Emotion Inference in Multi-Turn Conversations with Addressee-Aware Module and Ensemble Strategy

Emotion inference in multi-turn conversations aims to predict the participant’s emotion in the next upcoming turn without knowing the participant’s response yet, and is a necessary step for applications such as dialogue planning. However, it is a severe challenge to perceive and reason about the future feelings of participants, due to the lack of utterance information from the future. Moreover, it is crucial for emotion inference to capture the characteristics of emotional propagation in conversations, such as persistence and contagiousness. In this study, we focus on investigating the task of emotion inference in multi-turn conversations by modeling the propagation of emotional states among participants in the conversation history, and propose an addressee-aware module to automatically learn whether the participant keeps the historical emotional state or is affected by others in the next upcoming turn. In addition, we propose an ensemble strategy to further enhance the model performance. Empirical studies on three different benchmark conversation datasets demonstrate the effectiveness of the proposed model over several strong baselines.


Introduction
In this paper, we investigate the task of emotion inference in multi-turn conversations, which aims to explore how the conversation history affects the participant's future emotion, and predict the participant's emotion in the next upcoming turn, without knowing the participant's response yet. An example of the task is shown in Figure 1. Emotion inference is a necessary step for applications such as dialogue planning, dialogue generation, and interpretability, among others (Lin et al., 2008;Hasegawa et al., 2013;Gaonkar et al., 2020). For example, in a human-machine conversation scenario, if a chatbot tries to say something to cheer * Corresponding author. you up when you feel down, then the chatbot must predict the emotional consequence first, and avoid words that may offend you or elicit negative emotion on you.
Previous studies on emotion analysis in conversations have mainly focused on recognizing the emotion of a given utterance, including bc-LSTM (Poria et al., 2017), DialogueRNN , DialogueGCN (Zhong et al., 2019), COS-MIC (Ghosal et al., 2020), etc., while the emotion inference task is to predict the emotion of the next upcoming utterance, in which the utterance in the next turn is not given. Hasegawa et al. (2013) studied a similar task to the emotion inference, however they only took two turns as context and the multi-party multi-turn scenario was not considered. Bothe et al. (2017) and Wang et al. (2020) estimate the sentiment polarity (negative or positive) of the next utterance, while our work anticipates the fine-grained emotion, such as happy, sad, angry, excited, and frustrated, etc.
Although extensive related work has been conducted, emotion inference in multi-turn conversations is still an understudied and challenging task, due to the lack of utterance information from the future and the complexity to capture the characteristics of emotional propagation in multi-turn conversations, such as persistence and contagiousness. To address these issues, an addressee-aware module is designed for both a sequence-based and a graph-based model to capture the propagation of emotional state in conversations and automatically learn whether the participant keeps the historical emotional state or is affected by others.
In addition, we propose an ensemble strategy to further enhance the model performance. Since the exact response of the participant in the next upcoming turn is unknown, there may be multiple potential emotional reactions. We run the models with different random seeds to generate multiple candidate results, and then train a fusion classifier  Figure 1: A dialogue example in the MELD dataset . The task of emotion inference in multi-turn conversations is to predict Chandler's emotion in the next upcoming turn (8) based on the previous 7 turns of the dialogue.
to automatically select the final result most suitable for the current context and dialogue scene from the candidates.
The main novelty and contribution of this work is that we propose an addressee-aware module for the emotion inference task to model the emotional characteristics and anticipate the emotion trend in multi-turn conversations. Moreover, an ensemble strategy is proposed to further enhance the model performance. The experiments on three benchmark conversational datasets show that our model achieves the new state-of-the-art F1 score.

Task Definition
Given a multi-party multi-turn conversation history D along with the participants information, the emotion inference task aims to infer and anticipate the participant's emotion in the next upcoming turn. Formally, conversation history D = [(U 1 , p w 1 ), (U 2 , p w 2 ), · · · , (U m , p w m ), p a m+1 ] is a sequence of utterances, where U m is the utterance at time m and consists of N words, i.e., U m = (w m 1 , w m 2 , · · · , w m N ), p w m is the writer/speaker of utterance U m at timestamp m. And p a m+1 is the addressee/listener in the next upcoming turn m + 1.
The task is to infer the addressee p a m+1 's emotion E a m+1 at time m + 1 based on the utterances of previous m turns along with the participants information, without knowing the utterance information at time m + 1 yet:

Methodology
Feature Extraction: First, we employ both a GloVe-based CNN encoder (Kim, 2014;Pennington et al., 2014) and a RoBERTa Large encoder (Liu et al., 2019) to encode each utterance in the dataset. Following Ghosal et al. (2019Ghosal et al. ( , 2020, we fine-tune each encoder for the context-independent utterance-level emotion label recognition task on the training set, and then extract the emotional representation of each utterance from the last layer of the encoder, and obtained a 100-dimensional and a 1024-dimensional vector for each utterance from the GloVe-based encoder and the RoBERTa-based encoder respectively. The encoding process can be simplified as: where (U 1 , U 2 , · · · , U m ) is the conversation history, U t is the utterance at time t and u t ∈ R H is the corresponding utterance representation encoded by CNN/RoBERTa, H = 100/1024.

Addressee-Aware Module
To infer and anticipate the participant's emotion, it is important to model the emotion shift in conversations. In this work, we consider two basic characteristics of emotion: persistence and contagiousness (Picard, 1995;Hazarika et al., 2018), as the basis of inferring participant's emotion.
• Persistence. Participants may be consistently affected by their own mood and keep the existing emotional state for a period of time. For example, if a participant's car breaks down, then the emotion of this participant may be sad for a long period of time in the conversation.
• Contagiousness. The emotional states of participants are interactive, influential and contagious to each other. For example, a sad participant can be encouraged or comforted by others to be happy.
Thus, the addressee p a m+1 either keeps her/his own historical emotional state or is affected by others. In this paper, an addressee-aware module is proposed for both a sequence-based and a graphbased model to model these two kinds of emotion flow simultaneously.
Sequence-based Model: We first categorize each utterance u t in the conversation history (u 1 , u 2 , · · · , u t , · · · , u m ) into two types according to whether the utterance u t was spoken by the addressee p a m+1 or others. Two different LSTM units, LST M store and LST M af f ect , are then employed to control the different emotional information flow. Persistence: If the historical utterance u t at time t was spoken by the addressee p a m+1 , i.e., p w t = p a m+1 , then we expect the LST M store unit to open the input gate i t and store the u t into the internal state c a t as much as possible. Contagiousness: If the utterance u t was spoken by someone other than the addressee p a m+1 , i.e., p w t = p a m+1 , then we expect that if the utterance u t is highly contagious and is likely to affect the addressee's emotion, then the LST M af f ect unit will open the forget gate f t to forget the addressee's own past state c a t−1 and update the current state c a t with the other participant's utterance u t . Otherwise if the utterance u t is not contagious, then the LST M af f ect unit will close the input gate i t and keep the addressee's own historical state c a t−1 into the internal state at time t. This process can be formalized as: where t = 1, 2, · · · , m. u t ∈ R H is the utterance feature. (h a t , c a t ) are the hidden state and cell state in the LSTM unit, h a t /c a t ∈ R F , F = 100. λ a t is the information coefficient at time t. p w t is the writer/speaker of the utterance u t at time t. p a m+1 is the addressee/listener at time m + 1.
Then the last hidden state h a m is then fed to a linear classifier to obtain the emotion distribution es a m+1 of the addressee p a m+1 in the next upcoming turn m + 1: where h a m ∈ R F , W s ∈ R F ×F is the parameter matrix, W c ∈ R F ×C is the weight of the linear classifier, C is the total number of emotion categories. es a m+1 ∈ R C is the final emotion distribution of the addressee p a m+1 . Graph-based Model: A graph-based model is also proposed to model the conversational data for the emotion inference task. We construct a directed graph for each conversation: G = (g, e, α), with nodes g t ∈ g, edges e m,t ∈ e and edge weights α m,t ∈ α between nodes g m and g t , where t = 1, 2, · · · , m. Each node g t in the graph is used to represent a dialogue state in the turn t, and we initialize each node g t with the utterance representation u t through a linear transform layer (Eq 5). The edges between nodes in the graph are used to represent the complicated dependencies between the dialogue states. In our emotion inference task setting, each node is connected to all the previous nodes (including itself), and then all the historical information is accumulated into the node g m , based on the edges and edge weights (Eq 6-7), and then the emotion of the next upcoming turn is predicted based on g m (Eq 8). We formally describe this process below.
For t = 1, 2, · · · , m, we represent each utterance u t as a node g t in the directed graph G through a linear transform layer: where t = 1, 2, · · · , m. u t ∈ R H is the utterance feature. g t ∈ R F is the node in the graph, W l ∈ R F ×H is the weight of the linear transform layer, and F = 100 is the dimension of nodes. We then employ two different attention functions, AT T store and AT T af f ect , to compute the edge weight between the node g m and node g t , which is similar to the sequence-based addresseeaware model. If the historical utterance u t at time t was spoken by the addressee p a m+1 , i.e., p w t = p a m+1 , then we employ AT T store to compute the edge weight between g m and g t , otherwise AT T af f ect . The edge weight α a m,t between node g m and node g t can be formalized as: α a m,t = softmax(λ a t · AT T store (gm, gt) where α a m,t represents the attention weight between the nodes g m and g t . || is the concatenation operation. W a ∈ R F and W f ∈ R 2F ×F are the parameter matrices.
We then update the nodes. The updated node g m is a linear combination of all the connected nodes with the attention coefficient α a m,t : where g t ∈ H gm represents all the historical nodes g t connected with g m . After updating, all the historical information that contributes to the addressee's emotion is accumulated into the node g m . Then the emotion distribution eg a m+1 of the addressee p a m+1 is obtained: where g m ∈ R F , eg a m+1 ∈ R C . W g ∈ R F ×F is the parameter matrix, W c ∈ R F ×C is the weight of the linear classifier.

Ensemble Strategy
We denote the sequence-based and graphbased model as DialogInfer-S (Equation 4) and DialogInfer-G (Equation 8). And we also integrate the two models through Equation 9 and denote it as DialogInfer-(S+G): where ei a There may be multiple potential emotional reactions, as the exact response of the participant in the next upcoming turn is unknown. Therefore, different results may be output by the above three different models due to the uncertainty of the emotion inference task. Moreover, even the same model with different parameter initializations may give different results. An ensemble strategy is proposed to address this issue.
We train DialogInfer-S, DialogInfer-G, and DialogInfer-(S+G) 5 times each with different random seeds to generate 15 candidate results, and then train a fusion classifier to automatically select the final result most suitable for the current context and dialogue scene from the candidates: where ef a m+1 ∈ R C is the output emotion probability distribution of the ensemble strategy. es a m+1 , eg a m+1 , ei a m+1 are the output emotion probability distributions of DialogInfer-S, DialogInfer-G, and DialogInfer-(S+G) respectively. The superscripts 1, 2, · · · , 5 represent 5 different random initializations. || is the concatenation operation. W f ∈ R 15 is the parameter matrix. The ensemble model is denoted as DialogInfer-Ensemble.

Datasets
We evaluate our model on three multi-turn conversational datasets: IEMOCAP (Busso et al., 2008), MELD , and EmoryNLP (Zahiri and Choi, 2018). For more dataset details, please refer to their papers.

Baseline and State-of-the-art Methods
We compare our model with the following related latest neural-network-based methods, and modified them to fit the emotion inference task: CNN (Kim, 2014) and RoBERTa Large (Liu et al., 2019) model are trained at the utterance level to infer the emotion class of next turn. sc-LSTM (Poria et al., 2017) is a simple contextual unidirectional LSTM model. DialogueRNN  is an RNN-based model, which uses three separate GRU networks to keep track of the individual speaker states. DialogueGCN (Ghosal et al., 2019) uses a relational GCN to model the relation between utterances. COSMIC (Ghosal et al., 2020) is the state-of-the-art model in emotion recognition in conversations, which incorporates different elements of commonsense. All the baseline methods in our experiments use the same input features (Eq 2) as our proposed methods to ensure a fair comparison (300 dimensional pretrained 840B GloVe vectors (Pennington et al., 2014) for the GloVe-based models, and 1024 dimensional RoBERTa-Large (Liu et al., 2019) for the RoBERTa-based models).

Experimental Settings
We use the batch size of 16, learning rate of 0.001, and dropout rate of 0.2 to train the inference models. Cross entropy is used as the optimization objective function of the model, and the optimization algorithm is Adam (Kingma and Ba, 2015). The hidden size F is set to 100. All models are trained for 60 epochs and the model checkpoint that achieves the best results on the development set is used for testing. Other hyper-parameters are optimized using the grid search.

Results and Discussion
We compare the performance of our proposed models with the baselines on the three benchmark conversational datasets, and the results are listed in Table 1. As we can see from the results, our sequencebased and graph-based addressee-aware models surpass the baseline methods, which shows that our models can capture more essential information for inferring the addressee's emotion than other models. In addition, the ensemble model achieves significant improvements in most cases, which also proves the effectiveness of the ensemble strategy for further enhancing the performance of emotion inference in multi-turn conversations. The performance of utterance level models, CNN and RoBERTa Large, are worse than other models based on conversation history in most cases, which shows that the inference of the addressee's emotion relies heavily on the evidence from the conversation history. Comparing the GloVe-based models with the RoBERTa-based models, most of the results obtained by the RoBERTa-based models are better than those got by the GloVe-based models. This is because the RoBERTa model has been pre-trained on the large-scale unstructured texts and the features extracted from the RoBERTa model are more informative. Table 2, we also report the results of ablation studies by removing the addressee-aware module, and using the same LSTM-unit or attention function in Equation 3 and  Equation 6. The results show that the performance of both GloVe-based and RoBERTa-based models drops after removing the addressee-aware module, which proves the effectiveness of our addresseeaware module, and indicates the addressee-aware module can model the persistence and contagiousness of emotion and learn the emotion shift in multiturn conversations.

Conclusion
In this paper, we investigate the emotion inference in multi-turn conversations, which explores how the conversation history affects the participant's future emotion. To model the characteristics of emotion propagation in conversations: persistence and contagiousness, an addressee-aware module is designed for both a sequence-based and a graphbased model. In addition, an ensemble strategy is proposed to further enhance the model performance. The extensive experimental results on three benchmark datasets show that the proposed models achieve the new state-of-the-art F1 score, and the effectiveness of both the addressee-aware module and the ensemble strategy is demonstrated.