DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations

Emotion Recognition in Conversations (ERC) has gained increasing attention for developing empathetic machines. Recently, many approaches have been devoted to perceiving conversational context by deep learning models. However, these approaches are insufficient in understanding the context due to lacking the ability to extract and integrate emotional clues. In this work, we propose novel Contextual Reasoning Networks (DialogueCRN) to fully understand the conversational context from a cognitive perspective. Inspired by the Cognitive Theory of Emotion, we design multi-turn reasoning modules to extract and integrate emotional clues. The reasoning module iteratively performs an intuitive retrieving process and a conscious reasoning process, which imitates human unique cognitive thinking. Extensive experiments on three public benchmark datasets demonstrate the effectiveness and superiority of the proposed model.


Introduction
Emotion recognition in conversation (ERC) aims to detect emotions expressed by the speakers in each utterance of the conversation. The task is an important topic for developing empathetic machines  in a variety of areas including social opinion mining (Kumar et al., 2015), intelligent assistant (König et al., 2016), health care (Pujol et al., 2019), and so on.
A conversation often contains contextual clues ) that trigger the current utterance's emotion, such as the cause or situation. Recent context-based works (Poria et al., 2017;Hazarika et al., 2018b; on ERC have been devoted to perceiving situation-level or speaker-level context by deep learning models. However, these methods are insufficient in understanding the context that usually contains rich emotional clues. We argue they mainly suffer from the following challenges. 1) The extraction of emotional clues. Most approaches (Hazarika et al., 2018a,b;Jiao et al., 2020b) generally retrieve the relevant context from a static memory, which limits the ability to capture richer emotional clues. 2) The integration of emotional clues. Many works Ghosal et al., 2019;Lu et al., 2020) usually use the attention mechanism to integrate encoded emotional clues, ignoring their intrinsic semantic order. It would lose logical relationships between clues, making it difficult to capture key factors that trigger emotions.
The Cognitive Theory of Emotion (Schachter and Singer, 1962;Scherer et al., 2001) suggests that cognitive factors are potently determined for the formation of emotional states. These cognitive factors can be captured by iteratively performing the intuitive retrieving process and conscious reasoning process in our brains (Evans, 1984(Evans, , 2003(Evans, , 2008Sloman, 1996). Motivated by them, this paper attempts to model both critical processes to reason emotional clues and sufficiently understand the conversational context. By following the mechanism of working memory (Baddeley, 1992) in the cognitive phase, we can iteratively perform both cognitive processes to guide the extraction and integration of emotional clues, which imitates human unique cognitive thinking.
In this work, we propose novel Contextual Reasoning Networks (DialogueCRN) to recognize the utterance's emotion by sufficiently understanding the conversational context. The model introduces a cognitive phase to extract and integrate emotional clues from the context retrieved by the perceive phase. Firstly, in the perceptive phase, we leverage Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) networks to capture situation-level and speaker-level context. Based on the above context, global memories can be obtained to storage different contextual information. Sec-ondly, in the cognitive phase, we design multi-turn reasoning modules to iteratively extract and integrate the emotional clues. The reasoning module performs two processes, i.e., an intuitive retrieving process and a conscious reasoning process. The former utilizes the attention mechanism to match relevant contextual clues by retrieving static global memories, which imitates the intuitive retrieving process. The latter adopts LSTM networks to learn intrinsic logical order and integrate contextual clues by retaining and updating dynamic working memory, which imitates the conscious reasoning process. It is slower but with human-unique rationality (Baddeley, 1992). Finally, according to the above contextual clues at situation-level and speaker-level, an emotion classifier is used to predict the emotion label of the utterance.
To evaluate the performance of the proposed model, we conduct extensive experiments on three public benchmark datasets, i.e., IEMOCAP, SE-MAINE and MELD datasets. Results consistently demonstrate that our proposed model significantly outperforms comparison methods. Moreover, understanding emotional clues from a cognitive perspective can boost the performance of emotion recognition.
The main contributions of this work are summarized as follows: • We propose novel Contextual Reasoning Networks (DialogueCRN) to fully understand the conversational context from a cognitive perspective. To the best of our knowledge, this is the first attempt to explore cognitive factors for emotion recognition in conversations.
• We design multi-turn reasoning modules to extract and integrate emotional clues by iteratively performing the intuitive retrieving process and conscious reasoning process, which imitates human unique cognitive thinking.
• We conduct extensive experiments on three public benchmark datasets. The results consistently demonstrate the effectiveness and superiority of the proposed model 1 .

Problem Statement
Formally, let U = [u 1 , u 2 , ..., u N ] be a conversation, where N is the number of utterances. And there are M speakers/parties p 1 , p 2 , ..., p M (M ≥ 2). Each utterance u i is spoken by the speaker p φ(u i ) , where φ maps the index of the utterance into that of the corresponding speaker. Moreover, for each λ ∈ [1, M ], we define U λ to represent the set of utterances spoken by the speaker p λ , i.e., The task of emotion recognition in conversations (ERC) aims to predict the emotion label y i for each utterance u i from the pre-defined emotions Y.

Textual Features
Convolutional neural networks (CNNs) (Kim, 2014) are capable of capturing n-grams information from an utterance. Following previous works (Hazarika et al., 2018b;Ghosal et al., 2019), we leverage a CNN layer with max-pooling to exact context-free textual features from the transcript of each utterance. Concretely, the input is the 300 dimensional pre-trained 840B GloVe vectors (Pennington et al., 2014). We employ three filters of size 3, 4 and 5 with 50 feature maps each. These feature maps are further processed by max-pooling and ReLU activation (Nair and Hinton, 2010). Then, these activation features are concatenated and finally projected onto a dense layer with dimension d u = 100, whose output forms the representation of an utterance. We denote {u i } N i=1 , u i ∈ R du as the representation for N utterances.

Model
Then, we propose Contextual Reasoning Networks (DialogueCRN) for emotion recognition in conversations. DialogueCRN is comprised of three integral components, i.e., the perception phase (Section 2.3.1), the cognition phase (Section 2.3.2), and an emotion classifier (Section 2.3.3). The overall architecture is illustrated in Figure 1.

Perception Phase
In the perceptive phase, based on the input textual features, we first generate the representation of conversational context at situation-level and speakerlevel. Then, global memories are obtained to storage different contextual information.

Conversational
Context Representation. Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) introduces the gating mechanism into recurrent neural networks to capture long-term dependencies from the input sequences. In this part, two bi-directional LSTM networks are leveraged to capture situationlevel and speaker-level context dependencies, respectively. For learning the context representation at the situation level, we apply a bi-directional LSTM network to capture sequential dependencies between adjacent utterances in a conversational situation. The input is each utterance's textual features u i ∈ R du . The situation-level context representation c s i ∈ R 2du can be computed as: where h s i ∈ R du is the i-th hidden state of the situation-level LSTM.
For learning the context representation at the speaker level, we also employ another bi-directional LSTM network to capture selfdependencies between adjacent utterances of the same speaker. Given textual features u i of each utterance, the speaker-level context representation c v i ∈ R 2du is computed as: where λ = φ(u i ). U λ refers to all utterances of the speaker p λ . h v λ,j ∈ R du is the j-th hidden state of speaker-level LSTM for the speaker p λ .
Global Memory Representation. Based on the above conversational context representation, global memories can be obtained to storage different contextual information via a linear layer. That is, global memory representation of situation-level context G s = [g s 1 , g s 2 , ..., g s N ] and that of speaker- can be computed as: are learnable parameters.

Cognition Phase
Inspired by the Cognitive Theory of Emotion (Schachter and Singer, 1962;Scherer et al., 2001), cognitive factors are potently determined for the formation of emotional states. Therefore, in the cognitive phase, we design multi-turn reasoning modules to iteratively extract and integrate the emotional clues. The architecture of a reasoning module is depicted in Figure 2.
The reasoning module performs two processes, the intuitive retrieving process, and the conscious reasoning process. In the t-th turn, for the reasoning process, we adopt the LSTM network to learn intrinsic logical order and integrate contextual clues in the working memory, which is slower but with human-unique rationality (Baddeley, 1992). That is, i ∈ R 2du refers to the working memory, which can not only storage and update the previous memory h (t−1) i , but also guide the extraction of clues in the next turn. During sequential flowing of the working memory, we can learn implicit logical order among clues, which resembles the conscious thinking process of humans. h (t) i is initialized with zero. t is the index that indicates how many "processing steps" are being carried to compute the final state.
For the retrieving process, we utilize an attention mechanism to match relevant contextual clues from the global memory. The detailed calculations are as follows: where f is a function that computes a single scalar from g j andq (t−1) i (e.g., a dot product). Then, we concatenate the output of reasoning processq The query q (t) i will be updated under the guidance of working memory h (t) i , and more contextual clues can be retrieved from the global memory.
To sum up, given context representation c i of the utterance u i , global memory representation G, and the number of turns T , the whole cognitive phase (Eq.5-9) can be denoted as, q i = Cognition(c i , G; T ). In this work, we design two individual cognition phases to explore contextual clues at situation-level and speaker-level, respectively. The outputs are defined as: where T s and T v are the number of turns in situation-level and speaker-level cognitive phases, respectively. Based on the above output vectors, the final representation o can be defined as a concatenation of both vectors, i.e.,

Emotion Classifier
Finally, according to the above contextual clues, an emotion classifier is used to predict the emotion label of the utterance.
where W o ∈ R 8du×|Y| and b o ∈ R |Y| are trainable parameters. |Y| is the number of emotion labels. Cross entropy loss is used to train the model. The loss function is defined as: where L is the total number of conversations/samples in the training set. τ (i) is the number of utterances in the sample i. y l i,k andŷ l i,k denote the one-hot vector and probability vector for emotion class k of utterance i of sample l, respectively.

Datasets
We evaluate our proposed model on following benchmark datasets, IEMOCAP (Busso et al., 2008), SEMAINE (McKeown et al., 2012), and MELD  datasets. The statistics are reported in Table 1. The above datasets are multimodal datasets with textual, visual, and acoustic features. In this paper, we focus on emotion recognition in textual conversations. Multimodal emotion recognition in conversations is left as future work.
IEMOCAP 2 : The dataset (Busso et al., 2008) contains videos of two-way conversations of ten  unique speakers, where only the first eight speakers from session one to four belong to the training set. The utterances are annotated with one of six emotion labels, namely happy, sad, neutral, angry, excited, and frustrated. Following previous works (Hazarika et al., 2018a;Ghosal et al., 2019;Jiao et al., 2020b), the validation set is extracted from the randomly shuffled training set with the ratio of 80:20 since no pre-defined train/val split is provided in the IEMOCAP dataset.  (Nicolle et al., 2012). Following (Hazarika et al., 2018a;Ghosal et al., 2019), the attributes are averaged over the span of an utterance to obtain utterance-level annotations. We utilize the standard both training and testing splits provided in the sub-challenge.
MELD 4 : Multimodal Emotion Lines Dataset (MELD) , a extension of the EmotionLines (Hsu et al., 2018), is collected from TV-series Friends containing more than 1400 multiparty conversations and 13000 utterances. Each utterance is annotated with one of seven emotion labels (i.e., happy/joy, anger, fear, disgust, sadness, surprise, and neutral). We use the pre-defined train/val split provided in the MELD dataset.

Comparisons Methods
We compare the proposed model against the following baseline methods. TextCNN (Kim, 2014) is a convolutional neural network trained on contextindependent utterances. Memnet (Sukhbaatar et al., 2015) is an end-to-end memory network and update memories in a multi-hop fashion. bc-LSTM+Att (Poria et al., 2017) adopts a bidirectional LSTM network to capture the contextual content from the surrounding utterances. Additionally, an attention mechanism is adopted to re-weight features and provide a more informative output. CMN (Hazarika et al., 2018b) encodes conversational context from dialogue history by two distinct GRUs for two speakers. ICON (Hazarika et al., 2018a) extends CMN by connecting outputs of individual speaker GRUs using another GRU for perceiving inter-speaker modeling. DialogueRNN ) is a recurrent network that consists of two GRUs to track speaker states and context during the conversation. DialogueGCN (Ghosal et al., 2019) a graph-based model where nodes represent utterances and edges represent the dependency between the speakers of the utterances.

Evaluation Metrics
Following previous works (Hazarika et al., 2018a;Jiao et al., 2020b), for IEMOCAP and MELD datasets, we choose the accuracy score (Acc.) to measure the overall performance. We also report the Weighted-average F1 score (Weighted-F 1) and Macro-averaged F1 score (Macro-F 1) to evaluate the model performance on both majority and minority classes, respectively. For the SEMAINE dataset, we report Mean Absolute Error (MAE) for each attribute. The lower MAE, the better the detection performance.

Implementation Details
We use the validation set to tune hyperparameters. In the perceptive phase, we employ two-layer bidirectional LSTM on IEMOCAP and SEMAINE datasets and single-layer bi-directional LSTM on the MELD dataset. In the cognitive phase, singlelayer LSTM is used on all datasets. The batch size is set to 32. We adopt Adam (Kingma and Ba, 2015) as the optimizer with an initial learning rate of {0.0001, 0.001, 0.001} and L2 weight decay of {0.0002, 0.0005, 0.0005} for IEMOCAP, SE-MAINE, MELD datasets, respectively. The dropout rate is set to 0.2. We train all models for a max-      emotional clues by exploring cognitive factors. Accordingly, our model obtains more effective performance. That is, as shown in Table 2 and 3, for the IEMOCAP dataset, DialogueCRN gains 3.2%, 4.0%, 4.7% relative improvements over the previous best baselines in terms of Acc., Weighted-F1, and Macro-F1, respectively. For the SEMAINE dataset, DialogueCRN achieves a large margin of 11.1% MAE for the Arousal attribute. Table 1, the number of speakers of each conversation in the MELD dataset is large (up to 9), and the average length of conversations is 10. The shorter conversation length of the MELD dataset indicates it contains less contextual information. From the result in Table 4, interestingly, TextCNN ignoring conversational context achieves better results than most baselines. It indicates that it is difficult to learn useful features from perceiving a limited and missing context. Besides, Dia-logueGCN leverages graph structure to perceive the interaction of multiple speakers, which is sufficient to perceive the speaker-level context. Thereby, the performance is slightly improved. Compared with baselines, DialogueCRN enables to perform sequential thinking of context and understand emotional clues from a cognitive perspective. Therefore, it achieves the best recognition results, e.g., 2.9% improvements on Weighted-F1.

Ablation Study
To better understand the contribution of different modules in DialogueCRN to the performance, we conduct several ablation studies on both IEMOCAP and SEMAINE datasets. Different modules that model the situation-level and speaker-level context in both perceptive and cognitive phases are removed separately. The results are shown in Table 5. When cognition and perception modules are removed successively, the performance is greatly  declined. It indicates the importance of both the perception and cognition phases for ERC.
Effect of Cognitive Phase. When only removing cognition phase, as shown in the third block of Table 5, the performance on the IEMOCAP dataset decreases 4.3%, 4.3% and 6.5% in terms of Acc., Weighted-F1, and Macro-F1, respectively. And on the SEMAINE dataset, the MAE scores of Valence, Arousal, and Expectancy attributes are increased by 2.3%, 12.5% and 2.9%, respectively. These results indicate the efficacy of the cognitive phase, which can reason based on the perceived contextual information consciously and sequentially. Besides, if removing the cognitive phase for either speaker-level or situation-level context, as shown in the second block, the results decreased on both datasets. The fact reflects both situational factors and speaker factors are critical in the cognitive phase.
Effect of Perceptive Phase. As shown in the last row, when removing the perception module, the performance is dropped sharply. The inferior results reveal the necessity of the perceptive phase to unconsciously match relevant context based on the current utterance.
Effect of Different Context. When removing either situation-level or speaker-level context in both cognitive and perceptive phases, respectively, the performance has a certain degree of decline. The phenomenon shows both situation-level and speaker-level context play an effective role in the perceptive and cognitive phases. Besides, the margin of dropped performance is different on both datasets. This suggests speaker-level context plays a greater role in the perception phase while more complex situation-level context works well in the cognitive phase. The explanation is that it is limited to learn informative features from context by intuitive matching perception, but conscious cognitive reasoning can boost better understanding. Figure 3: Results against the number of turns. We report the Weighted-F1 score on the IEMOCAP dataset and MAE of Arousal attribute on the SEMAINE dataset. The lighter the color, the better the performance.

Parameter Analysis
We investigate how our model performs w.r.t the number of turns in the cognitive phase. From Figure 3, the best {T s , T v } is {2, 2} and {1, 3} on IEMOCAP and SEMAINE datasets, which obtain 66.20% Weighted-F1 and 0.1522 MAE of Arousal attribute, respectively. Note that the SEMAINE dataset needs more turns for the speaker-level cognitive phase. It implies speaker-level contextual clues may be more vital in arousal emotion, espe-All you're going to do is just give me fifty dollars and say go have fun on your vacation without any of your stuff?
We'd be willing to give you fifty dollars to reimburse you for the bag. [NEUTRAL] [ANGRY] [ANGRY] [ANGRY] [NEUTRAL] [NEUTRAL] [NEUTRAL] [FRUSTRATED] There are some shops here in the airport. I realize this. But I have to ask you to move so I can help the next person.
What am I gonna do without anything for three weeks?
But fifty dollars isn't going to get me anything.
It sounds like it's no big deal whatever. It's a big deal to me. cially empathetic clues that require complex reasoning.
Besides, if we solely consider either situationlevel or speaker-level context in the cognitive phase, results on the two datasets are significantly improved within a certain number of turns. The fact indicates the effectiveness of using multi-turn reasoning modules to understand contextual clues. Figure 4 shows a conversation sampled from the IEMOCAP dataset. The goal is to predict the emotion label of utterance 8. Methods such as Dia-logueRNN and DialogueGCN lack the ability to consciously understand emotional clues, e.g., the cause of the emotion (failed expectation). They are easy to mistakenly identify the emotion as angry or neutral.

Case Study
Our model DialogueCRN can understand the conversational context from a cognitive perspective. In the cognitive phase, the following two processes are performed iteratively: the intuitive retrieving process of 8-7-2-1 (blue arrows) and the conscious reasoning process of a-b-c (red arrows), to extract and integrate emotional clues. We can obtain that utterance 8 implied that more compensation expected by female was not achieved. The failed compensation leads to more negative of his emotion and thus correctly identified as depression.

Emotion Recognition
Emotion recognition (ER) has been drawing increasing attention to natural language processing (NLP) and artificial intelligence (AI). Existing works generally regard the ER task as a classification task based on context-free blocks of data, such as individual reviews or documents. They can roughly divided into two parts, i.e., featureengineering based (Devillers and Vidrascu, 2006), and deep-learning based methods (Tang et al., 2016;Wei et al., 2020).

Emotion Recognition in Conversations
Recently, the task of Emotion Recognition in Conversations (ERC) has received attention from researchers. Different traditional emotion recognition, both situation-level and speaker-level context plays a significant role in identifying the emotion of an utterance in conversations . The neglect of them would lead to quite limited performance (Bertero et al., 2016). Existing works generally capture contextual characteristics for the ERC task by deep learning methods, which can be divided into sequence-based and graph-based methods.
Sequence-based Methods. Many works capture contextual information in utterance sequences. Poria et al. (2017) employed LSTM (Hochreiter and Schmidhuber, 1997) to capture conversational context features. Hazarika et al. (2018a,b) used end-to-end memory networks (Sukhbaatar et al., 2015) to capture contextual features that distinguish different speakers. Zhong et al. (2019); Li et al. (2020) utilized the transformer (Vaswani et al., 2017) to capture richer contextual features based on the attention mechanism.  introduced a speaker state and global state for each conversation based on GRUs (Cho et al., 2014). Moreover, Jiao et al. (2020a) introduced a conversation completion task to learn from unsupervised conversation data. Jiao et al. (2020b) proposed a hierarchical memory network for real-time emotion recognition without future context.  modeled ERC as sequence tagging to learn the emotional consistency. Lu et al. (2020) proposed an iterative emotion interaction network to explicitly model the emotion interaction.
Graph-based Methods. Some works (Zhang et al., 2019;Ghosal et al., 2019;Ishiwatari et al., 2020;Lian et al., 2020) model the conversational context by designing a specific graphical structure. They utilize graph neural networks (Kipf and Welling, 2017;Velickovic et al., 2017) to capture multiple dependencies in the conversation, which have achieved appreciable performance.
Different from previous works, inspired by the Cognitive Theory of Emotion (Schachter and Singer, 1962;Scherer et al., 2001), this paper makes the first attempt to explore cognitive factors for emotion recognition in conversations. To sufficiently understand the conversational context, we propose a novel DialogueCRN to extract and then integrate rich emotional clues in a cognitive manner.

Conclusion
This paper has investigated cognitive factors for the task of emotion recognition in conversations (ERC). We propose novel contextual reasoning networks (DialogueCRN) to sufficiently understand both situation-level and speaker-level context. Di-alogueCRN introduces the cognitive phase to extract and integrate emotional clues from context retrieved by the perceptive phase. In the cognitive phase, we design multi-turn reasoning modules to iteratively perform the intuitive retrieving process and conscious reasoning process, which imitates human unique cognitive thinking. Finally, emotional clues that trigger the current emotion are successfully obtained and used for better classification. Experiments on three benchmark datasets have proved the effectiveness and superiority of the proposed model. The case study shows that considering cognitive factors can better understand emotional clues and boost the performance of ERC.