Distilling Knowledge for Empathy Detection

Empathy is the link between self and others. Detecting and understanding empathy is a key element for improving human-machine interaction. However, annotating data for detecting empathy at a large scale is a challenging task. This paper employs multi-task training with knowledge distillation to incorporate knowledge from available resources (emotion and sentiment) to detect empathy from the natural language in different domains. This approach yields better results on an existing news-related empathy dataset compared to strong baselines. In addition, we build a new dataset for empathy prediction with finegrained empathy direction, seeking or providing empathy, from Twitter. We release our dataset for research purposes.


Introduction
Empathy is the ability to feel, understand, and correlate with the thoughts and feelings of another person (Decety and Jackson, 2004). Empathy enables us to build rapport with other people by acknowledging their cognitive state and making them feel that they are being heard and understood. Applications of analyzing and detecting empathy have been examined from numerous perspectives, including medical and healthcare (Decety and Fotopoulou, 2015;Williams et al., 2015;Raab, 2014), humancomputer interaction (De Vicente and Pain, 2002;Buechel et al., 2018), neuroscience (Decety and Ickes, 2011), philosophy and psychology (Yan and Tan, 2014;Coplan and Goldie, 2011;Batson, 2009), and education (Virvou and Katsionis, 2003).
Social platforms facilitate expressing empathy and sharing of thoughts and information through natural language and text-based communication. Consequently, many people turn to social networks to share their experiences and feelings in different situations. Several psychological and social science studies have recently examined the relationship between users' empathetic ability in a social network and their behavioral patterns (Kardos et al., 2017;Morelli et al., 2017;Medeiros and Bosse, 2016;Reis et al., 2004). For example, Kardos et al. (2017) examined social networks and observed that more empathetic capabilities in users lead to a larger group of close friends. Morelli et al. (2017) and Medeiros and Bosse (2016) also showed that empathy as an individual's personality influences their ability to attract social ties.
To analyze and understand empathy at scale, it is important to devise models to detect empathy from the natural language. Effectively training such models depends on the presence of quality labeled data. However, annotating such data at scale is challenging due to the subjective nature of empathy (Decety and Jackson, 2004) and the high annotation costs. Consequently, existing datasets on the task of textbased empathy classification are small in size. To address the small data issue, we study the use of data-rich tasks related to empathy and utilize their correlation in a multi-task learning setup. Multitask learning delivers an efficient means of using supervised data from multiple related tasks. It is beneficial for various relevant tasks to be learned jointly so that each task can benefit from the knowledge learned in other tasks (Fukuda et al., 2017;Zagoruyko and Komodakis, 2016;Ma et al., 2018).
To put forward the relevant tasks, we follow the notion of the correlation between empathy and emotion discussed by Szanto and Krueger (2019) and Hein and Singer (2008). Szanto and Krueger (2019) showed that empathy is correlated with affective and emotional expression. Hein and Singer (2008) also characterized empathy as "an affective state, caused by sharing of the emotions of another person." Therefore, we can expect that empathetic sentences are rich in emotion and sentiment. It can be seen from Table 1 that when expressing empathy, people often show emotional behavior. For exam-NewsEmp I'm sorry to hear that about Dakota's parents. Even when you are adult it must be hard to see your parents splitting up. No one wants that to happen and it's unfortunate that her parents couldn't work it out. I hope they are able to still remain civil around the kids and family. Just because it didn't work romantically doesn't mean it won't work at all. Emotion: sadness , polarity: negative empathetic It's a shame that air pollution has potentially been linked to increased mental damage with young children. We often don't take into account all the damage that the fossil fuel companies have done to our society. We only praise them for creating the fuels we use but never tax them appropriately for all the damage that they cause us.
Emotion: anger , disgust , polarity: negative none-empathetic TwittEmp My granddaughter has Wilms Cancer stage 4, she has been fighting since January. I cry everyday.
There is not much to say, anyone who outlives a child suffers heartache , and the grandparents suffers both for their child and their grandchild. ple, sentences like "I'm sorry to hear that about Dakota's parents" or "I cry everyday" are rich in sadness emotion and negative sentiment polarity.
In this paper, we show that better performance can be obtained by leveraging external knowledge related to empathy: emotion and sentiment. To this end, we use multi-task training with knowledge distillation technique (Clark et al., 2019) to incorporate knowledge into empathetic content from emotion and sentiment. In particular, we utilize two available resources as the external knowledge to improve empathy prediction: (1) EmoNet , an emotion detection dataset; and (2) SST (Socher et al., 2013) a sentiment classification dataset. We employ EmoNet and SST as single-task models to teach a multitask model to detect empathy. We show that the multi-task training with knowledge distillation outperforms strong baselines on two empathy datasets, each collected from different platforms on different domains: news and health. Table 1 shows examples from these datasets-NewsEmp by Buechel et al. (2018) and TwittEmp our dataset created from Twitter on health posts.
We explore empathy at higher granularity of empathy versus non-empathy and lower granularity of seeking empathy versus providing empathy. Results of our experiments show that with the higher granularity, detecting empathy from the news context is more challenging than detecting empathy from the health domain. However, detecting em-pathy (from the health domain) at the seeking and providing granularity makes it more difficult for the models to detect empathy. This may imply that empathy detection in a fine-grained granularity requires more implicit reasoning, which is not present as surface-level lexical information.
Our contributions are as follows: (1) We propose to use multi-task training with knowledge distillation for empathy classification to incorporate emotion and sentiment knowledge into empathetic content ( §3); (2) We achieve better performance on the news empathy reactions dataset (NewsEmp) (Buechel et al., 2018) culminating (on average) in +4% F1 score ( §5.2). Moreover, we bridge the domain gap between the existing empathy datasets (e.g., NewsEmp (Buechel et al., 2018)) and our TwittEmp dataset by employing unsupervised domain adaptation, from news to health ( §6). To our knowledge, we are the first to explore unsupervised domain adaptation for empathy detection; (3) We introduce TwittEmp ( §4), a Twitter dataset of perceived empathy annotated with fine-grained empathy direction. We release our dataset 1 as a step towards to facilitate research in social domains.

Related Work
Numerous studies have discussed the importance of empathy and its impacts on individuals' physiological condition and medical health. The appli-cations of empathy and its benefit have been examined from numerous perspectives, including humancomputer interaction (De Vicente and Pain, 2002;Virvou and Katsionis, 2003;Kort and Reilly, 2002), healthcare (Raab, 2014;Williams et al., 2015), psychology (Batson, 2009;Davis, 1983), cognitive science (Wakabayashi et al., 2006;Launay et al., 2015), and neuroscience (Carr et al., 2003;Singer and Lamm, 2009;Keysers et al., 2004). Empathy is shown to have correlation with gender and language, as well as behavior and culture (Chung and Bemak, 2002;Chung et al., 2010;Gungordu, 2017). Gungordu (2017) analyzed the impacts of gender and cultural orientations on individuals' empathetic expression and observed that women are more empathetic compare to men, and people from different cultures express empathy in diverse ways.
However, only recently, computational studies have been conducted on analyzing empathy from text (Sharma et al., 2020;Yang et al., 2019;Sedoc et al., 2019;Buechel et al., 2018;Khanpour et al., 2017) and from spoken dialogues (Alam et al., 2018;Pérez-Rosas et al., 2017;Fung et al., 2016). For example, Khanpour et al. (2017) proposed a neural network model to detect empathetic messages in health-related posts from lung and breast discussion boards in a cancer support network. Their work is different from ours as they only focus on high-level empathy presented in the text and do not detect the direction of empathy at a fine-grained level.  identified a pathogenic type of empathy by collecting ≈ 1.8M Facebook posts. Unlike our study,   For the text-based empathy prediction, to date, only three contributions (Hosseini and Caragea, 2021;Sharma et al., 2020;Buechel et al., 2018) previously built publicly available datasets, to our knowledge. Hosseini and Caragea (2021) used BERT to detect the direction of empathetic support from an online cancer network. Unlike our work, Hosseini and Caragea (2021) modeled the empathy direction at the sentence level, not considering the whole message expressing empathy (which usually contains more than one sentence; see Table 1). Sharma et al. (2020) employed a RoBERTa-based bi-encoder model to detect empathy in conversations in online mental health platforms. In contrast to our work, Sharma et al. (2020) focused on the level of communication (weak, strong, or no communication) in a response post and developed a framework of expressed empathy consisting of three communication mechanisms, emotional reactions, interpretations, and explorations. Buechel et al. (2018) also built a corpus of messages from people's written reactions to news articles. Other publicly available datasets addressed other tasks on empathy, such as empathetic dialogue generation (Rashkin et al., 2018), and learning word ratings for empathy (Sedoc et al., 2019).

Detecting Empathy
Detecting the empathy from textual input is challenging due to the scarcity of labeled training data. Manually annotating a corpus at a large scale is not a feasible solution either due to the task's difficulty and the high cost of the annotation process. Here, we propose to use multi-task learning with knowledge distillation and teacher annealing to leverage knowledge from available resources of sentiment and emotion to detect empathy.

Multi-Task Learning
In multi-task learning (MTL) (Liu et al., 2019;Caruana, 1997), a target task is learned by employing knowledge from related auxiliary tasks so that knowledge learned in one task is shared across all tasks. In our setting, the target task is empathy detection and the auxiliary tasks are emotion and sentiment classification. As in Liu et al. (2019), we build all the models on top of the pre-trained BERT language model (Devlin et al., 2018). In MTL, the bottom layers (corresponding to BERT) are shared across all three tasks, and the top layers are spe- cific for each task as shown in Figure 1 (right side). Specifically, we use a fully connected layer for each task followed by softmax for classification.
During MTL training, examples from the three tasks are shuffled together (within minibatches) and the sum of the losses of all three tasks is minimized using backprop. That is, let D τ = {(x τ i , y τ i )} i be the training set for task τ , where τ could be any of the three tasks (empathy, emotion, or sentiment). The loss of the MTL model with parameters θ is: is the output of model θ on the input x τ i and is the cross-entropy loss. That is, the MTL model is optimized based on one-hot labels.

Multi-Task Learning with Knowledge Distillation and Teacher Annealing
Rather than optimizing the model based on onehot labels, better training signal can be obtained from the data when distilling knowledge using a teacher-student framework, in which the student model learns the knowledge offered by the teachers' output. Thus, we propose to use the MTL model that distills knowledge from the auxiliary tasks into the target task, proposed by (Clark et al., 2019), which employs the idea of applying knowledge distillation (Ba and Caruana, 2014;Buciluǎ et al., 2006;Hinton et al., 2015) with the purpose that single-task models (teachers) teach a multitask model (student) so that the student becomes better than the teachers. During training, as before, various tasks' examples are mixed jointly and the aggregated loss over all three tasks is minimized. Formally, let D τ = {(x τ i , y τ i )} i be the training set for task τ (empathy, emotion, or sentiment), as before. A single-task (teacher) model, denoted θ τ , is trained on each task τ (τ = 1, 2, 3), which produces output f τ (x τ i , θ τ ) on the input x τ i (see Figure   1). Then, a multi-task shared (student) model with parameters θ (right side of Figure 1) learns to imitate the output of the single-task (teacher) models θ τ (left side of Figure 1). The loss of the multi-task (student) model becomes: That is, the MTL model with knowledge distillation is optimized based on teachers' predictions. Emulating the teacher model in knowledge distillation may limit the student model to transcend the teacher model. Clark et al. (2019) uses a training strategy called teacher annealing. That is, the MTL with knowledge distillation and teacher annealing combines gold-standard with predictions: where λ is linearly increased from 0 to 1 over the course of training. This approach benefits the student model to outperform its teachers. We adopt this approach in our experiments.

Data
We incorporate knowledge from data-rich tasks of emotion and sentiment to detect empathy. We specifically use SST-2 (Socher et al., 2013) and EmoNet . SST-2 is a binary dataset for sentiment analysis consisting of sentences from movie reviews and their sentiment (positive and negative We incorporate knowledge from related tasks of emotion and sentiment to detect empathy using two datasets. These datasets are chosen from (1) different domains: news and health; and (2) different platforms: online news platforms and Twitter. Despite the significance of empathy in improving patients' positive feelings, only a few datasets are publicly available. We model empathy on the recent dataset by Buechel et al. (2018), leveraging available resources. We refer to this dataset as NewsEmp dataset. In addition, to experiment with a data from a different domain, we introduce Twit-tEmp, a new dataset of perceived empathy collected from Twitter. We describe the datasets below.

NewsEmp Dataset
NewsEmp is a dataset of empathic reactions to news stories released by Buechel et al. (2018). The dataset contains 1, 860 messages written in reaction to news articles rated with a numeric level of empathy and distress on a 7-point scale. Buechel et al. (2018) provided empathy binary labels, indicating if a message contains empathetic content or not. We leverage these labels to model empathy in a binary setting. We split the dataset into three sets of train, validation, and test with 80% of data used for training, 10% for validation and, 10% for test.

TwittEmp Dataset
We present our dataset of perceived empathy annotated by fine-grained empathy direction (seeking vs. providing). TwittEmp contains 3, 000 English tweets, which will be publicly available for further research in social domains.

Definitions of Seeking and Providing Empathy.
Empathy needs one to embrace the subjective standpoint of the others (Decety and Jackson, 2004). We characterize seeking empathy as a need to be heard and understood. When people experience challenging situations, they need their feelings to be recognized and acknowledged. Providing empathy can be defined as the psychological perception of the individuals' feelings, thoughts, or attitudes who are enduring challenging experiences. Our definitions are derived in consultation with a psychologist and follow (Decety and Jackson, 2004) and online definitions of empathy.

Data Collection and Annotation
We collect a dataset of 3, 000 tweets from Twitter, which are annotated with three categories: seekingempathy, providing-empathy, or none. We collect data related to the cancer topic using the Twitter streaming API, starting from July 2015 to August 2020. We employ filtering techniques to ensure that the collected tweets are likely to contain empathetic content. We specifically use the empathy and distress lexicon 2 by Sedoc et al. (2019), which consists of 9, 356 word types, each with associated empathy and distress ratings. The lexicon is contextindependent; therefore, there are several words in the lexicon with high empathy ratings, such as gaza, zambia, myanmar, that do not correlate with our topic of interest (i.e., health). Consequently, we select 200 words with the highest empathy rating that are relevant to the health topic. The selected words and their corresponding empathy rating are presented in Appendix A.
We require that empathetic tweets contain at least one of the 200 high-rating empathy words plus "cancer". As part of the preprocessing, we remove duplicate tweets and replace links and usernames with <URL> and <USER>, respectively.
To ensure the quality of annotations and reliability of the labels, we trained two graduate students through multiple iterations with a psychologistin-the-loop for the initial round of labeling. Following prior studies (D'Mello, 2015;Fort, 2016), the annotation task was done iteratively. In each round, the annotators were asked to annotate 200 tweets and discussed the disagreements with researchers. 100% inter-annotator agreement (IAA) was obtained, measured by Cohen's kappa coefficient, after each round of discussions. After three initial rounds of annotations, the annotation continued until we get 1, 000 annotated samples per class of seeking-empathy, providing-empathy, and none. Finally, the last round of annotations was reviewed and finalized by one of the authors of this paper.

Characteristics of Datasets
Characteristics of TwittEmp compared with NewsEmp dataset are outlined in Table 2. As shown in Table 2, Buechel et al. (2018) modeled "intended" empathy as they obtained empathy scores from the writer of a text. In contrast, we study "perceived" fine-grained empathy from the reader's perspective. This allows us to examine   and model empathy from different perspectives. Table 3 presents top frequent noun phrases (4grams) in TwittEmp and NewsEmp datasets. Analyzing top noun phrases denotes a distinct theme and a storyline of each of these datasets. Unlike, NewsEmp which is collected from reactions to news stories, TwittEmp covers health-related content. For instance, "Sorry for your loss, cancer has robbed our lives of some wonderful people." represents the user's intention to provide empathetic support for others. In contrast, sentences like "So I just read an article where 2 friends went diving to a place they shouldnt have and ended up dying. While they were using brand new equipment, I feel like idiots who take stupid risks and go to places where no humans should be, kind of deserve what ends up happening to them. If you dont sky dive, you never have to worry about going splat when your chute doesnt open", from NewsEmp, describes a reaction to a heartbreaking news story. Table 1 in §1 shows samples from NewsEmp and TwittEmp, along with their Plutchik-8 emotions and sentiment polarity.
The average length of a tweet in TwittEmp is around 37 words (max=62 words), while NewsEmp has an average message length of 82 words (max=163 words). TwittEmp also holds an average number of 3 sentences per tweet, while NewsEmp has an average number of 5 sentences per message. Figures 2a and 2c compare the tweet and message length distribution across TwittEmp and NewsEmp datasets, respectively. Figures 2b  and 2d show the length distribution in the datasets per class. Comparing the two results suggests that NewsEmp often carries longer sentences.

Experiments
We model empathy in a binary setting in both datasets, detecting if a message contains empathetic content or not. For modeling empathy in the TwittEmp dataset, we keep tweets with labels seeking-empathy, and providing-empathy as positive samples and tweets in none class as negative samples. We then split the dataset into three sets of train, validation, and test with 80% of data used for training and the remaining 20% split equally for validation and test.
Detecting Fine-grained Empathy. Given a tweet, our goal is to classify it into one of the two categories of seeking-empathy, and providing-empathy. We create two classifiers in a binary setting, one to detect tweets seeking empathy and one to detect tweets providing empathy. For the seekingclassifier, we keep seeking-empathy as positive samples and combine the two classes of none and providing-empathy as negative samples. Similarly, to create the providing-classifier, we keep providing-empathy as positive samples and combine the two classes of none and seeking-empathy as negative samples. We then split the datasets, keeping 60% of data for the training set, 20% for validation, and 20% for the test set.

Models
The details of the experiments are as follows. We contrast the multi-task learning with knowledge distillation and teacher annealing ( §3.2) that learns from teachers' outputs and one-hot labels (denoted KD) (Eq. 3) with the multi-task learning ( §3.1) that uses one-hot labels (denoted MT) (Eq. 1) and with the following baselines.
Standard Neural Methods. We experiment with (1) CNN (Kim, 2014), (2) LSTM (Hochreiter and Schmidhuber, 1997), (3) ConvLSTM a combination of the two previous models used in prior work on the empathetic message identification task (Khanpour et al., 2017), and BiLSTM (Hochreiter and Schmidhuber, 1997). All the neural models were trained with pre-trained 100d GloVe (Pennington et al., 2014) word embeddings. The best hyper-parameters reported by (Kim, 2014) are used for CNN. For the LSTM-based models, we used 128 hidden units and a dropout rate of 0.5 with a softmax layer on top to obtain the final predictions.
Pre-Trained Language Models. We fine-tune BERT (Devlin et al., 2018), in particular bert-base-uncased, with an added single linear layer on top of the [CLS] token.

Results
Our main results for the NewsEmp dataset (Buechel et al., 2018) are shown in Table 4. We observe that multi-task training with knowledge distillation and teacher annealing achieve clear improvements over the best BERT model and multitask training.  score further improves to 68.41 (+4.62), suggesting that the two tasks provide a complementary signal that is beneficial for the empathy prediction task. The results also suggest that using teachers' output distribution over classes (i.e., 'KD * ') instead of one-hot labels (i.e., 'M T * ') positively improves the performance. The results indicate that teachers' outputs help to gain further information on training examples. Table 5 shows the main results on TwittEmp dataset empathy detection, where we see that leveraging knowledge from EmoNet improves the performance over 'KD SST ' and 'KD SST +EmoN et ' on this dataset. The observed performance could be attributed to the EmoNet's content, which contains general tweets, resembling the TwittEmp dataset's content. The results also suggest that MT+KD outperforms MT with one-hot labels. Comparing the results with Table 4 suggests that modeling empathy in NewsEmp is more challenging compared to TwittEmp. This may be due to the longer sentences in NewsEmp, which are harder to classify. Table 6 shows the main results on TwittEmp dataset fine-grained empathy direction. We see similar patterns for both the seek and provide classifiers. Each multi-task training improves model performance. 'KD EmoN et ' is more effective on the performance showing that leveraging knowledge from more related tasks helps to enhance the performance to a greater extent. We can also observe that detecting empathy at a finer granularity is more challenging compared to coarse-grained empathy detection. This may denote that modeling empathy at the fine-grained level requires more implicit reasoning, making modeling empathy more challenging. Similar to previous tasks, we can see that leveraging knowledge distillation provides more information than solely employing one-hot labels resulting in improved performance.

Unsupervised Domain Adaptation
Empathy annotations are not always available. Nevertheless, from a psychological perspective, these annotations would be valuable to understand users' empathetic profile during hard situations.
In this section, we examine methods to leverage supervision from existing empathy datasets (i.e., NewsEmp (Buechel et al., 2018)) in providing labels for the TwittEmp empathy dataset. We set up this task as unsupervised domain adaptation; NewsEmp is considered as the labeled source domain (SRC), and our TwittEmp dataset is considered as the unlabeled target domain (TRG). Below, we provide details on the adaptation method. We employ BERT as the classifier. As Han and Eisenstein (2019), we mainly focus on using pretraining techniques that facilitate effective transfer between different domains. We experiment with pre-training on dynamic masked language modeling by leveraging unsupervised data from different domains and platforms: (1) Unsupervised EMPA-THETICDIALOGUES (Rashkin et al., 2018) is a dataset of crowdsourced conversations from emotional situations; (2) Unsupervised Twitter: we collect a large amount of unsupervised data from Twitter in the health domain using the words as before from the lexicon by Sedoc et al. (2019); (3) Unsupervised GoEmotions. GoEmotions (Demszky et al., 2020) is a large-scale emotion detection dataset from Reddit comments; (4) Unsupervised ISEAR ISEAR (Scherer and Wallbott, 1994) is a survey on emotion antecedents and reactions to emotional situations; (5) Unsupervised DailyDialog. DailyDialog (Li et al., 2017) comprises dialogues from educational websites.
For comparison, we experiment with different systems: (1) SOURCE-ONLY: the source domain is used for fine-tuning BERT (the training portion) and the target domain is used for the evaluation  (the test portion); (2) TARGET-ONLY: the target domain is used for both training and evaluation of BERT. These results are adopted from Table 5 to show the performance in-domain; (3) PRETRAIN- * : BERT undertakes dynamic masked language modeling (MLM) pre-training by leveraging a large set of unsupervised data from task/dataset * , i.e., EMPATHETICDIALOGUES, GoEmotions, ISEAR, Twitter, and DailyDialog (one at a time) and then BERT is trained (fine-tuned) on the source domain (the training portion), and ultimately evaluated on the target domain (the test portion) (Han and Eisenstein, 2019). Table 7 presents the results of the unsupervised domain adaptation. Generally, we do not observe a noticeable improvement in performance over the SOURCE-ONLY baseline using EMPATHETICDI-ALOGUES, GoEmotions, and Twitter. Leveraging unsupervised data from DailyDialog improves performance by 1.48%. The results also suggest that incorporating ISEAR yields 1.13% improvement in performance. It can also be seen from Table  7 that pre-training adds a small improvement in recall in most of the settings. We can posit that incorporating knowledge from a different domain can be beneficial to get most of the relevant results (less false negatives). But still, we can see a big gap between PRETRAIN- * and TARGET-ONLY. The results suggest that more explicit strategies may be needed for empathy to enable domain adaptation.

Conclusion
In this study, we show that distilling knowledge from available related resources on emotion and sentiment can be effectively used to inform empathy classification. We use multi-task training with knowledge distillation technique to incorporate knowledge into empathetic content from EmoNet and SST. This approach achieves better results on two datasets from different domains. We also show promising results on unsupervised domain adaptation for empathy detection which represents an interesting future direction.

A Words and Empathy Ratings
Tables 8 presents the selected words and their corresponding empathy rating chosen from (Sedoc et al., 2019).