Modelling Context Emotions using Multi-task Learning for Emotion Controlled Dialog Generation

A recent topic of research in natural language generation has been the development of automatic response generation modules that can automatically respond to a user’s utterance in an empathetic manner. Previous research has tackled this task using neural generative methods by augmenting emotion classes with the input sequences. However, the outputs by these models may be inconsistent. We employ multi-task learning to predict the emotion label and to generate a viable response for a given utterance using a common encoder with multiple decoders. Our proposed encoder-decoder model consists of a self-attention based encoder and a decoder with dot product attention mechanism to generate response with a specified emotion. We use the focal loss to handle imbalanced data distribution, and utilize the consistency loss to allow coherent decoding by the decoders. Human evaluation reveals that our model produces more emotionally pertinent responses. In addition, our model outperforms multiple strong baselines on automatic evaluation measures such as F1 and BLEU scores, thus resulting in more fluent and adequate responses.


Introduction
One of the key skills for dialogue agents in a dialog system is to acknowledge the feelings of the user and respond accordingly. It is quite instinctive for humans to identify and understand other people's emotions but is quite hard for Artificial Intelligence (AI) systems due to the lack of representative publicly available data sets for training and evaluating an intelligent and robust dialog management system. Table 1 shows an example of emotion labelled conversation from the dataset. The example shows how two different emotionally inclined responses can lead a conversation in two different directions. An engaging conversation usually involves empa-Agent 1 Do you like wearing hats? It has so many functions.

Curious
Agent 2 I don't like them on myself but I know a lot of people that can pull them off.
Neutral Agent 1 Yes me as well. In the military hats denote a nationality, branch of service, rank or regiment. thetic responses by conversing partners which can have varied emotion labels. It is important to capture user's affective information by any dialog agent to build an intelligent and socially engaging open-domain chatbot. For learning new tasks, we often apply the knowledge we have acquired by learning similar tasks. For instance, in Table 2 the context history has several utterances with Happy, Fearful, Disgusted and Curious to dive deeper emotion and the target responses are labelled with Fearful and Happy emotion. Context emotion can play an important role in transferring the target style while predicting the responses. The words terrified, scary, afraid and like can help in generating responses with the given target emotion label respectively. An auxiliary task of emotion classification can help in improving the main task of text generation.
In prior research, neural network based models handled the emotion controlled generation by either appending the target emotion label (Zhou et al., 2018;Zhou and Wang, 2017;Wang and Wan, 2018;Hu et al., 2017;Huang et al., 2018;Logeswaran et al., 2018;Song et al., 2019) or by using emotion embeddings (Asghar et al., 2018) in addition to the input sentence representation. Although label information is effective, still it seems to be underutilized for effective response generation. Wang and Wan (2018) showed sentiment transfer using discriminator networks.
We hypothesize that emotional construct in a conversation can be formed by focusing on specific words in a dialog. To acknowledge the presence of annotated emotion labels in a multi-turn conversation, we perform emotion analysis of user utterances as an auxiliary task for open-domain dialogue generation. Our objective is to generate responses according to the target emotion style. Specifically, if we want to choose words that can provide information about the emotion of a sentence, we exploit an emotion classification model to govern the selection strategy. We train a self-attention (Vaswani et al., 2017) based encoder to compute the context features in a dialog. Words with higher attention weights are selected to be in the set of selections while decoding the response.
In this work, we propose to apply multi-task learning to leverage emotion information for opendomain response generation. Multi-task learning allows the encoder to learn common and prominent features in the input sequence. Our emotionincorporated weights achieve a good balance between language fluency and emotion quality in model responses. We utilize focal loss (Lin et al., 2017) for emotion classification to address the imbalanced structure of the emotion distribution in the dataset. Furthermore, to attain better attention scores, we compute consistency loss in order to preserve the attention performance of individual tasks. Our empirical study does not show performance degradation in language fluency while classifying emotion-rich sequences.
We evaluate our proposed model on the Topical Chat dataset (Gopalakrishnan et al., 2019). We design human evaluation to score the following three metrics, viz. fluency, adequacy and emotional accuracy of the generated response. The human evaluation results indicate that our model improves not only the fluency and adequacy scores but also the emotional accuracy scores. In addition, we conduct automatic evaluation on the topical chat dataset. The automatic evaluation results show that our method improves significantly on the F1 and BLEU metrics.
The key contributions and/or attributes of our current work are summarized as follows: 1. We propose an effective deep multi-task framework that performs emotion classification and response generation. 2. To handle the imbalanced data distribution, we use Focal Loss (Lin et al., 2017) instead of regular cross entropy loss for emotion classification of utterances. 3. To maintain uniformity between the attention weights of different tasks, we utilise consistency loss (Nishino et al., 2019) in addition to the original task-specific losses.

Related Work
Early representative works were mostly based on the manually hand-crafted rules (Skowron, 2010;Polzin and Waibel, 2000), for generating responses with a specific emotion. Although rule-based approaches show high accuracy they often fail to handle complex emotions, especially for large corpora. In (Prendinger and Ishizuka, 2005), computational experiments established that empathetic agents ensure good communication. Ochs et al. (2008) designed an empathetic virtual agent that can express emotions based on cognitive appraisal theories which require an extensive hand-crafted rule base. In recent years, there is an emerging research trend in an end-to-end neural network based generative conversational systems (Vinyals and Le, 2015;Shang et al., 2015). To improve the content quality of neural conversational models, many techniques have been proposed, such as improving response diversity using Conditional Variational Autoencoders (CVAE) (Zhao et al., 2017) and encoding commonsense knowledge using external facts corpus (Ghazvininejad et al., 2018).
By expressing emotions, people show their mutual respect, empathy and understanding to each other, and thus improve the relationship between them. Emotional chatting machine (ECM) (Zhou et al., 2018) extended the basic encoder-decoder architecture using three mechanisms, viz. emotion category embedding, internal emotion memory, and external memory in order to generate sequence with a particular emotion label. Affect transfer in text using Recurrent Neural Networks (RNNs) (Ghosh et al., 2017) and text generation using emojis as the target labels (Zhou and Wang, 2017) was proposed for controlled generation of text. The research reported in (Niu and Bansal, 2018;Golchha et al., 2019) introduced state-of-the-art techniques for stylistic transfer of user behaviour, such as courteousness (e.g. polite, rude or neutral). Li et al. (2019) proposed an empathetic dialogue system (EmpGAN) based on adversarial learning comprising of a multi-resolution empathetic generator along with two interactive discriminators. Song et al. (2019) presented an attention framework based on emotion-lexicons. Colombo et al.
(2019) generated affect driven dialogues using emotion embeddings and affective sampling methods. Various techniques that can capture user's emotional state empathetic response generation were developed in (Asghar et al., 2018;Lubis et al., 2018) . An affective attention based model coupled with weighted cross-entropy loss was proposed by Zhong et al. (2019) for affective dialogue generation. Lin et al. (2020) built an empathetic chatbot which fine-tunes a Generative Pre-trained Transformer (GPT) with multiple objectives: response language modeling, response prediction, and dialogue emotion detection.
Multi-task learning, with deep neural networks which learn from different related-tasks has achieved remarkable success in improving the performance of many natural language processing (NLP) tasks (Luong et al., 2015a;Hashimoto et al., 2016;Liu et al., 2019). A multi-task learning framework usually consists of an encoder which is shared across multiple tasks to learn a common set of shared features. Moreover, the encoder learns to focus more on important and desirable features, and ignores redundant and noisy features (Ruder, 2017). Rashkin et al. (2018) proposed a new dataset with ∼ 25k conversations empathetic dialogue generation. The conversations in the dataset are prepared for a given emotion label. As opposed to this, our model handles dataset which has different emotion labels for every utterance in a dialog. As per our knowledge there is no existing work that has proposed the multi-task learning architecture for heterogeneous emotions in a conversation.
In our current work, we propose a multi-task framework with a shared multi-head self-attention based hierarchical encoder for response generation and emotion classification. We also utilize focal loss for emotion classification. Additionally, we incorporate a consistency based loss to enable persistent output generation for our multi-task architecture. The experiments are performed on the knowledge and emotion grounded Topical Chat dataset (Gopalakrishnan et al., 2019) containing a significant amount of human-human conversations in open-domain setting. Our approach tends to produce adequate responses.

Problem Statement
In this work, we aim to produce emotion controlled responses for multi-turn conversations using relevant context knowledge and emotion labels. Let U = u (1) , ..., u (k) , ..., u (K) denote the set of K utterances of our multi-turn conversation. We represent I words of the k-th utterance as i.e E = e (1) , ..., e (k) , ..., e (K) . Hence, our task is to generate a response y = y 1 , y 2 , ..., y m with m words given the set of previous k context utterances and emotion labels.

Encoder
The encoder is used to transform the input utterance into a hidden representation q (k) . The embedding, e, of the current word, e(w (k) i )) and the positional embedding PE(i) is fed as input to the encoder. The combined embedding representation is subsequently passed into the Gated Recurrent Unit (GRU) model (Cho et al., 2014) which encodes the input utterance and yields relevant features. We

Context-level Encoder:
We use a GRU network to address the previous context of utterances in a multi-turn conversation. The initial state of the decoder GRU is initialised with the final hidden state of the context GRU.

Decoder:
Intuitively, this layer takes what we have decoded so far, h d,t−1 , and all of what we have encoded, q (k) , to produce a vector, a (k) t , that represents attention weights which signifies most important words in the source sentence in order to correctly decode, y t+1 . We then calculate the energy, e (k) e,ij , between them by concatenating them together and passing them through a linear layer (attn) and a tanh activation function. The desired conditioning on previous utterances (context history) is obtained by initializing the hidden state of the GRU decoder with the final hidden state from the context GRU, h

Multitasking Dialog Generation and Emotion Recognition
We perform multi-tasking using a shared encoder layer for encoding input sequences and two decoder layers for utterance prediction and classification. Figure 1 gives an overview of our proposed model. Shared encoder: We use the encoder from Section 3.2.1 which converts the input sequence into hidden vectors (q (k) ) which is used across multiple tasks.
Classifier: The classifier transforms the shared representation from the encoder into the emotion class probability p Decoder: We employ a GRU based decoder which takes the hidden representation from the shared encoder and generates a response y = y 1 , y 2 , ..., y m comprising of m words.

Focal Loss
Focal Loss (Lin et al., 2017) is employed to address imbalance between the emotion classes during training. We use focal loss as a replacement of cross entropy loss for emotion recognition. It is defined in Eq 14, where γ is a focusing parameter.

Consistency Loss
We use the "consistency loss" (Nishino et al., 2019) to reduce the difference between the attention weights from different tasks. Attention agreement favours emotional words while decoding the responses. The consistency loss between two different tasks is defined as follows: p,ij is the attention weight for every k-th utterance for the p-th task. To compare the two attention weights, a ramp function |x|+ is used.

Training: Dialog generation
We denote the negative log-likelihood loss for dialog generation using L 2 .
The overall loss function for our proposed model is calculated as the total sum of losses from the two tasks and the consistency loss: where L 1 and L 2 signify the loss of the emotion classification and dialog generation task. L cl indicates the consistency loss.

Dataset
We perform our experiments on the knowledge and emotion grounded Topical Chat dataset (Gopalakrishnan et al., 2019) with ∼11K dialogues. It is a multi-turn conversational dataset in which every utterance is annotated with an emotion label. There are a total of eight emotions (angry, disgusted, fearful, sad, happy, surprised, curious to dive deeper, and neutral) in the dataset. The data is split into 5 distinct groups: Train, Valid Frequent, Valid Rare, Test Frequent, and Test Rare. The frequent set contains conversations on entities frequently seen in the training set. The rare set contains conversations on entities infrequently seen in the training set. Table 3 provides the details of the dataset.

Baselines
In order to prove the usefulness of our model, we compare it with the following baselines: 1. HRED: This baseline is defined based on the hierarchical encoder-decoder model by Serban et al. (2015Serban et al. ( , 2016. In this, the encoder RNN encodes the words of the utterances, and the context RNN encodes the dialog history. 2. HRED-A: We apply word-level attention (Luong et al., 2015b) to the encoder of the HRED model to capture important words of the input sequence. 3. HRED-SA: Another extension to the generative hierarchical Seq2Seq model with selfattention mechanism on the encoder which takes the dialog conversations as input. 4. EmoHRED-A-FL-CL: We extend the HRED-A model to EmoHRED-A-FL-CL, a deep multi-task learning framework that jointly performs the task of both response generation and emotion analysis. We add focal loss and consistency loss to the existing task specific losses.
To prove the effectiveness of our consistency loss in EmoHRED-SA-FL-CL, we conduct ablation study by removing the consistency loss from the EmoHRED-SA-FL-CL model. We name the model as EmoHRED-SA-FL. We also show the strength of the focal loss by eliminating FL from EmoHRED-SA-FL model. The resulting model is named as EmoHRED-SA.

Experimental Setup
For the HRED model, we use a single layer bidirectional GRU (Cho et al., 2014). We extend the HRED model to HRED-A using the global attention mechanism (Luong et al., 2015b) at the encoder. For our proposed self-attention-based model, the number of encoder and decoder layers is set to 2 and the number of attention heads is 8 with the filter size equal to 2048. Word embedding dimension is chosen as 300, hidden dimension is set to 300. For the generator, we use the ADAM optimizer (Kingma and Ba, 2014) whose learning rate is fixed to 0.0001. While decoding the responses we use beam search with beam size set to 4.

Evaluation Metrics
Automatic Evaluation: We utilise the most well-known metrics for evaluating a sequence such as BLEU (Papineni et al., 2002), F1, perplexity (PPL) (Vinyals and Le, 2015) and n-gram diversity (Div.) (Gopalakrishnan et al., 2019). 1. Perplexity: We define perplexity in Equation 18. It is a measurement of how well a model can predict human responses. We report perplexity values on our frequent and rare test. N is the total number of samples in the test set and N w is the total number of tokens in the entire test set.
log(P (y|U ))} (18) 2. BLEU: To evaluate the predicted responses we compute BLEU score, a word-based metric which performs n-gram matching with the ground truth responses. 3. F1: We compute unigram F1-score 1 between the model prediction and the ground truth responses.

N-gram diversity:
We evaluate the informativeness and diversity of sentences using Ngram diversity. It is defined in Eq 19. M is the total number of samples in the test set. The results are shown under the columns -Div. (n=1) and Div. (n=2) in Table 5 on the frequent and rare test set.

# unique n-grams # words in predicted response (19)
Human Evaluation: To measure the quality of the generated text from a human perspective, we randomly sample 100 conversations from each model and with the help of two experts with postgraduate exposure we evaluate the predicted responses using the following metrics: (i) Fluency: It is used to measure the grammatical correctness. (ii) Adequacy: It is used to measure contextual relevancy of the predicted response. (iii) Emotional Accuracy (EA): It checks how accurately one can infer the target emotion in the predicted response.
We assign a scores in {0,1,2} (representing "wrong", "acceptable" and "perfect") for indicating the level of fluency and adequacy of responses. We measure the emotional accuracy on a scale of 0-1 with '0' indicating the incorrect emotion and '1' the correct emotion. We compute the Fleiss' kappa (Fleiss, 1971) score, to measure the inter-annotator agreement. We obtain a kappa score of 0.90, 0.75, 0.76 for fluency, adequacy, and emotional content, respectively, denoting "good agreement".

Results and Analysis
We present the results for all our experiments in this section. Detailed results using both the automatic and human evaluation methods are shown in Table  5.

Automatic evaluation results
In Table 5, we observe that the proposed model has high uni-gram and bi-gram diversities, demonstrating that the models learn to decode fluent and informative responses with great diversity. We observe relatively fewer repeated segments in the responses generated by our proposed model owing to a good Div.(n=1) and Div.(n=2) score. We observe significant improvement in BLEU and F1-scores when compared with the baseline models which support our multi-task learning architecture. Our proposed model seems to utilize the multi-task learning phenomenon and effectively utilize emotion labels associated with each utterances.
We also perform an ablation study for better understanding the contributions of the attributes of our model. As shown in Table 5, after we remove the consistency loss, both the emotion accuracy and perplexity performance become obviously worse, indicating that to generate persistent outputs, consistency between attention weights is critical for emotion understanding and model generation quality. We also test the importance of focal loss using the EmoHRED-SA model. As shown in Table 5, after we eliminate the focal loss, there is significant drop in EA and F1 which justifies our use of focal loss. We perform statistical significance test between our proposed and the baselines models using t-test at 5% (0.05) significance level, and showed that the improvement in our model is statistically significant. Table 5 illustrates that our proposed model outperforms the other baseline models in terms of fluency, adequacy and emotion quality. Owing to a good fluency score of our proposed model, we observed fewer copying of sentences from the input utterance in the predicted response. The increment in the adequacy scores w.r.t baseline models verifies that the response generated by the proposed model comes out as more relevant. The emotional content score determines that the generated responses are more in line with the emotional sensitivity of the sentences.

Human evaluation results
In Table 6, we present few examples of the responses generated by one of the baseline model (HRED) and our proposed model given the desired emotion. As shown in the table, the responses predicted by the EmoHRED-SA-FL-CL model has mostly predicted adequate and emotionally rele-vant responses as compared to the baseline HRED model. For the fourth utterance, even though the HRED model gives an emotionally relevant reply but it seems highly inadequate with respect to the context where as the EmoHRED-SA-FL-CL model responds with an emotionally as well as contextually relevant reply. Detailed examples with outputs from all of our baseline and proposed model with the required emotion label can be found in the appendix in Table 7.

Error Analysis:
In this section, we report the most commonly occurring errors that our proposed and baseline models encounter.
1. Common phrases: Some common phrases are repeated in the generated response. For instance 'i don't think i've ever heard about it though', 'i don't know much about it so i don't know much about it either.' and 'i 'm not sure either. i've never been there'. Due to data scarcity and less diversity in the data, the models may only have learned to predict the most frequent utterances. Since the dialogues are inherently ambiguous, predicting them accurately would require more data.

2.
Repetition: The proposed model (EmoHRED-SA-FL-CL), in a few cases, go on repeating the information present in the predicted response. Predicted Response: that's terrible. i'll have to check that out. i'll have to check it out!. This lowers the count of unique uni-gram words in the generated response i.e the F1-score.
3. Emotional inconsistencies: In some cases, the proposed model (EmoHRED-SA-FL-CL) is unable to produce responses of particular emotion labels due to less occurrence of instances from those classes (angry, sad, fearful and disgusted). The less frequent emotion classes like anger, sad, fearful and disgusted get confused with the recurring classes like curious to dive deeper and surprised. Also, instances from 'Happy' and 'Surprised' emotion classes get mixed up with each other. For example, in Table 6, the target response for Utterance 5 should have the emotion 'Happy' but it gets confused with the emotion 'Surprised' and generates an irrelevant response. Table 4 shows the distribution of emotion classes present in the dataset.
More detailed examples can be found in Table 8 and Table 9 of the Appendix.  Table 6: Generated examples are from a continuous conversation from the frequent test set. EmoHRED-SA-FL-CL and HRED predict responses using the previous set of utterances and emotion labels.

Comparisons to the state-of-the-arts
The ) did not focus on taking into consideration the context utterance and instead simply concatenated the context utterances and passed them as a single sequence into the transformer model. We also observe a significant improvement in the F1-score for our proposed model. We achieve a score of 0.23 / 0.19 for our task of emotion-controlled dialog generation. We adopt ECM (Zhou et al., 2018) for comparison, a Seq2Seq model that first proposed to generate emotional response using emotion category embeddings, internal and external memory mechanisms. We concatenate the dialog history into a long sequence and feed as input to the model. Evaluation shows the F1-score of 0.14 / 0.13 and BLEU score of 1.9 / 1.6 for the frequent / rare test set. Our model clearly outperforms the baselines with a huge margin.

Conclusion and Future Work
In this paper, we have proposed a new deep learning framework for modeling emotion-grounded conversations using emotion labels as the guiding attributes. Building an emotion-aware conversational agent is crucial in enhancing the user interactions with long, engaging conversations. Extensive experiments show that the predicted responses expressed high levels of emotional accuracy and content adequacy. We have also provided details of different kinds of errors found in section 5.3. In general, we show how a related task of emotion recognition along with appropriate loss functions can ensure emotional relevancy of the generated response and improves user engagement.
In the future, we intend to use pre-trained language models for the task of dialog generation using emotion labels. We also aim to extend our model to handle knowledge-grounded conversations.
(formerly Media Lab Asia

A Samples Generated by the Proposed Model
A.1 Predicted responses when we have different emotion labels for every utterance.

A.1.1 When the emotions occur frequently
We observe that the predicted responses as shown in Table 7 tend to follow the target emotions accurately, however they sometimes may lack adequacy.

A.1.2 When the emotions occur rarely
We observe that the predicted responses as shown in Table 8 fails to generate adequate as well as emotionally relevant responses.
A.2 Predicted responses when we have the same emotion label for every utterance.
We observe that the predicted responses as shown in Table 9 are very close to the ground truth response and are also emotionally accurate.