EmpathBERT: A BERT-based Framework for Demographic-aware Empathy Prediction

Affect preferences vary with user demographics, and tapping into demographic information provides important cues about the users’ language preferences. In this paper, we utilize the user demographics and propose EmpathBERT, a demographic-aware framework for empathy prediction based on BERT. Through several comparative experiments, we show that EmpathBERT surpasses traditional machine learning and deep learning models, and illustrate the importance of user demographics, for predicting empathy and distress in user responses to stimulative news articles. We also highlight the importance of affect information in the responses by developing affect-aware models to predict user demographic attributes.


Introduction
Modeling complex human reactions and affect from text has been a challenging research area with innovations focusing on sentiment and emotion understanding (Picard, 1997;Li and Liu, 2015;Rosenthal et al., 2017;Socher et al., 2011Socher et al., , 2013. The study of non-trivial human reactions has been limited. These methods, often rooted in psychological theories, have turned out to be more complex in terms of annotation and modeling (Strapparava and Mihalcea, 2007). A critical affective phenomena, empathy, has received surprisingly less attention.
Empathy assesses feelings of sympathy towards others, and Distress measures anxiety and discomfort oriented towards self (Davis, 1980). Empathy has been positively associated to a number of wellbeing activities, such as volunteering (Batson et al., 1987), charity (Pavey et al., 2012), and longevity (Poulin et al., 2013), and in consumer marketing, advertising and customer interfaces (Wang et al., 2016;Escalas and Stern, 2003). Works on empathy in text have focused on spoken dialogue, addressing conversational agents, psychological interventions, or call center transcripts (McQuiggan and Lester, 2007;Fung et al., 2016;Pérez-Rosas et al., 2017;Alam et al., 2018;Demasi et al., 2019). Buechel et al. (2018) collected an empathy-distress dataset by leveraging users' reactions to textual stimulus content. Sedoc et al. (2019) constructed an empathy lexicon by obtaining word ratings from documentlevel ratings from this dataset. Xiao et al. (2012); Gibson et al. (2015); Khanpour et al. (2017) presented predictive models for empathy in the healthcare domain. However, we believe none of the above works focus on (a) predicting empathy from textual reactions, and (b) studying the impact of demographics on the expression of empathy.
Language preferences vary with user demographics (Tresselt and Mayzner, 1964;Eckert and McConnell-Ginet, 2013;Garimella et al., 2016;Lin et al., 2018;Loveys et al., 2018), and this has led to studies leveraging the user demographic information to obtain better language representations and classification models for various NLP tasks (Volkova et al., 2013;Bamman et al., 2014;Hovy, 2015;Garimella et al., 2017). Owing to the recent success of large language models such as BERT (Devlin et al., 2018) and GPT-2 (Radford et al., 2019) in improving the performances of several downstream tasks, we propose a BERT-based demographic-aware framework for empathy (distress) prediction, and through several comparative experiments, show that it surpasses existing baselines and demographic-agnostic approaches.
This paper makes three main contributions.
(2) Through comparisons against several baseline and demographic-agnostic approaches, we illustrate the importance of user demographics in end-to-end modeling and predicting empathy (distress).
(3) Conversely, we show that empathy (distress) also contributes to demographic attribute prediction, by developing affectaware models for demographic attribute prediction, backed by empirical comparison with baselines and generic models. To the best of our knowledge, ours is the first computational effort addressing empathy (distress) through the lens of demographic biases, a phenomenon well-understood in psychology.

Dataset
We use the empathy-distress dataset introduced by Buechel et al. (2018). It consists of 418 news articles from popular news platforms, and responses to them from 403 annotators (5 articles each), resulting in a total of 2,015 responses. Filtering the responses that deviated from the task description led to 1,860 responses (empathy: 916, distress: 905) , with a total token count of 173,686 (min: 52, max: 198, median: 84). The number of responses per article ranges from 1 to 7, with an average of 4.46 responses per article. We report some example responses from the dataset in Table 1 1 . We focus on the responses only, and use the empathy (distress) tags associated with these responses. We group the data into binary classes for age (C 0 : < 35, C 1 : ≥ 35), income (C 0 : ≤ $50, 000, C 1 : > $50, 000), and education (C 0 : no degree, C 1 : bachelor's or above), to mitigate class imbalances. 2 The resulting dataset is balanced for all dimensions, with a maximum deviation of 5.5% (age) among classes.

EMPATHBERT
In this section, we describe our approach for demographic-aware empathy (distress) prediction from text. Figure 1 shows the proposed architecture. Our model takes as input a response (a sequence of words w 1 , w 2 , . . . , w n ) and demographic information of the corresponding annotator. We represent the response using BERT, a bidirectional Transformer-based (Vaswani et al., 2017) language model. We use the final 768-dimensional hidden vector corresponding to the [CLS] token as the aggregate sequence representation. We employ cross-domain pre-training , finetuning, and multi-task fine-tuning  techniques to customize BERT for our tasks.
Cross-domain Pre-training (PT). We use the pre-trained BERT language model trained on the English Wikipedia and Book Corpus (Zhu et al., 2015) datasets for masked word and next sentence prediction, and perform further pre-training on demographic-specific datasets to introduce demographic-specific language preferences. This enables slanting the BERT model towards a specific demographic group. For this, we use a corpus different from the empathy dataset in two scenarios. (1) ALL: train the BERT model on all of the external corpus, and (2) DEMOGRAPHIC-SPECIFIC: train only on the demographic-specific samples from the external corpus.
Fine-tuning Only (tBERT). BERT-based finetuning has had significant success, due to the ease in implementation and performance gains reported for various NLP tasks Liu and Lapata, 2019). We fine-tune BERT for sequence classification by adding a classification layer, where the input is response represented by the hidden vector of the [CLS] token, and output is the prediction for empathy (distress). We train on generic data and demographic-specific portions, and compare the performances to study the demographic effect on empathy (distress) prediction.
Multi-task Fine-tuning (tBERT-MT). We finetune BERT in multi-task learning (MTL) setup for classification, similar to , where the tasks under consideration are empathy (distress) classification and demographic attribute prediction. Both the tasks have shared BERT layers, while the classification heads containing the final dense and This is just crazy, you have to feel for the mother, but at the same time what kind of apartment has that many violations and is still not punished. They need to sue them and anybody involved with this. I can't believe that in today's society that tragedies like this are tolerated. Somebody needs to go to jail for the death of this little girl and the injuries that her mother suffered. I can't imagine what the mother is going through and she probably blames herself. Things like this should just not happen.
Male, Age ≥ 35, Education ≥ Bachelors, Income ≤ $50,000 0.82 Table 1: Qualitative examples of high empathy (above) and high distress (below) with scores on empathy and distress dimensions as predicted by our tBERT-C (fnn) model. softmax layers are specific to each task. We replace the final dense and softmax layers in tBERT setup with multiple classification heads based on the number of tasks. We experiment with (1) Alternative training: In each epoch, we cyclically train only one classification head, freezing the parameters of the remaining heads; and (2) Parallel training: In each epoch, we train the model end-to-end on the joint loss from all the classification heads. Explicit Demographic Knowledge. PT, tBERT and tBERT-MT intrinsically infuse demographic information. We also incorporate this explicitly by concatenating a demographic vector #» d to the output of the global average pooling layer (Lin et al., 2013) from tBERT or tBERT-MT (concatenation in Figure 1) in two ways. (1) tBERT-[MT]-C: #» d is a d-dimensional one-hot encoding vector (d: number of demographics).
(2) tBERT-[MT]-C (fnn): #» d is the output of a feedforward neural network (FNN), the input for which is a one-hot encoding vector. Three dense layers are stacked before the task-specific heads, and this model is trained end-to-end for empathy (distress) prediction. In tBERT-MT where one of the tasks heads predicts a demographic attribute, the corresponding binary value in #» d is removed. To assess the contribution of specific attributes, we also propose to concatenate a 1-bit encoding (tBERT-[MT]-C (attribute)) for each given attribute.
We model empathy (distress) prediction as a binary classification task. To study the efficacy of empathy (distress) to predict demography attributes, we also conduct experiments for empathy (distress)-aware demographic attribute prediction. Such a prediction can be used for further demographic removal from text to mitigate adversarial attacks and protect privacy of users (Elazar and Goldberg, 2018). Implementation Details (1) Cross-domain Pretraining: We use the Blog Authorship Corpus 4 (Schler et al., 2006), which consists of 681,288 blogposts and self-provided demographic attributes, gender, age, industry, and astrological sign of the corresponding 19,320 bloggers to further pre-train BERT. Out of these we use the gender attribute to pre-train for male-specific and female-specific pre-training experiments. We train the model on the Masked Language Model task (Taylor, 1953) for 10 epochs using a learning rate of 3e-5. (2) Finetuning: We train the model end-to-end (110M parameters) using binary cross-entropy loss and decoupled weight decay Adam optimizer (Loshchilov and Hutter, 2017), in batches of 32. The best performance is observed when the maximum input sequence length is set to 150, learning rate to 3e-5, and number of epochs to 3. (3) Explicit Demographic Attributes: We use gender, age, education and income attributes corresponding to each annotator in the empathy dataset. The d-dimensional vector size 4 resulting in a 16-d FFN output. Evaluation metrics. We use five-fold cross validation (five random shuffled restarts) with 80-20    train-test proportions, and report the F1 and accuracy (Ac) averaged across the 5 runs on the test set. Baselines. We compare our model against the Random Forest (RF) model with Glove embeddings (Pennington et al., 2014) for text and demographic attributes (excluding the prediction attribute) as one-hot vectors as features. We also report performance against deep learning baselines, CNN (Kim, 2014), biLSTM, and biLSTM with Attention  and the pre-trained BERT without further training. Table 2 shows the accuracies using BERT for pretraining (PT), fine-tuning (tBERT), and both (PT + tBERT) for gender-specific empathy (distress) prediction. On the M and F test sets, models trained on the same demographic subset (M or F ) outperform those trained on the opposite subset or A s . The acccuracies of plain BERT are 48.37, 49.49, and 50.42 on A s , M , and F test sets respectively for empathy prediction. tBERT outperforms all other variants. The results support our hypothesis that empathy is dependent on and influenced by the gender associated with the author. We note similar patterns for age, income, and education (Table 3). Table 4 shows results for empathy (distress) prediction using tBERT-[MT]-[C (fnn/attribute)] variants trained on the full dataset. In the notation, we replace [MT] with the heads on which the multitasking is performed. For example tBERT-MT-(E+D)-G-C implies fine-tuned BERT with empathy prediction, distress prediction, and gender prediction multi-tasking heads with demographic information concatenated to the text representation directly before classification. 5 We report performances on demographic-wise test sets (A, M , F ). Insights: (1) tBERT variants with a single training objective outperform all baselines. (2) Performance of tBERT-MT varies with the affect dimension. Empathy prediction shows marginal loss in performance with explicit concatenation (tBERT-C) and further loss in the multitask setup. (3) For distress, introduction of gender as the demographic attribute shows an observable improvement across different test sets. (4) A similar trend is observed for age. Table 5 shows performance of age and gender prediction with empathy (distress)-aware models on affect-wise test sets (Empathy (Em) and Distress (Dist)). Empathy-aware gender prediction models show consistent improvement over baselines, with tBERT (G) reporting the best score when tested on the complete dataset and empathy-specific test set. tBERT (A) helps improve the accuracies for age prediction by atleast 5% over baselines for the com-5 In the models where a demographic attribute prediction is involved, we remove that attribute from the demographic vector.   plete (All) test set. For the empathy-specific test set, best results are observed with MTL (tBERT-MT-(E+D)). We infer that while having affect-aware demographic prediction models do improve performance over fine-tuned models, they may also lead to a marginally negative impact. The overall inference from above experiments is that demographicaware models aid affect predictions but the reverse relationship is much weaker. End-to-end training across a variety of train sets and demographic attributes establishes that the variance observed in language preferences and expressions has an impact on the manner of expressing empathy and distress in reactions.

Conclusion
We proposed a novel demographic-aware empathy prediction framework based on fine-tuning and multi-tasking using BERT, showed that it surpasses existing methods, and illustrated the impact of demography in modeling subjective phenomena such as empathy and distress. Our framework is generalizable, and we extended it to empathy-aware demography prediction, and showed that empathy also improves demographic prediction. We believe this is a significant checkpoint towards developing models for empathy (distress), and tapping into demographic information while doing so.