Past, Present, and Future: Conversational Emotion Recognition through Structural Modeling of Psychological Knowledge

Conversational Emotion Recognition (CER) is a task to predict the emotion of an utterance in the context of a conversation. Although modeling the conversational context and interactions between speakers has been studied broadly, it is important to consider the speaker’s psychological state, which controls the action and intention of the speaker. The state-of-the-art method introduces CommonSense Knowledge (CSK) to model psychological states in a sequential way (forwards and backwards). How-ever, it ignores the structural psychological interactions between utterances. In this pa-per, we propose a pSychological-Knowledge-Aware Interaction Graph (SKAIG). In the locally connected graph, the targeted utterance will be enhanced with the information of action inferred from the past context and intention implied by the future context. The utterance is self-connected to consider the present effect from itself. Furthermore, we utilize CSK to enrich edges with knowledge representations and process the SKAIG with a graph transformer. Our method achieves state-of-the-art and competitive performance on four popu-lar CER datasets.


Introduction
As one of the most ubiquitous ways of communicating, conversations contain rich information and emotional expressions of the participants. With the explosive growth of conversational data on the Internet, it is of great importance to employ machines to automatically identify the emotions expressed by speakers in the conversation. Therefore, in recent years, Conversational Emotion Recognition (CER) receives increasing attention from the researchers (Poria et al., 2017;Jiao et al., 2019;Shen et al., 2021).
Unlike traditional emotion recognition, CER needs to model not only the semantic information of an utterance, but also the conversational * Zheng Lin is the corresponding author.  The utterance #1 provides the action of speaker B for #2, and #4 provides the intention. Both give positive and rational hints for #2 to predict the positive emotion excited. The descriptions of action, intention are generated by COMET (Bosselut et al., 2019). contextual information between utterances (Jiao et al., 2019(Jiao et al., , 2020Shen et al., 2021). Additionally, the speaker information attaching to the utterance is thought to facilitate modeling the conversational context. Different speaker modeling schemes and the corresponding solutions are proposed to enhance the interactions between utterances Li et al., 2020b;Ghosal et al., 2019;Li et al., 2020a).
Although these works yield significant performance, the modeling of conversational context and speakers does not consider psychological states of speakers. The psychological state will control the speaker's action and intention along the conversation, which can help predict the emotion more reasonably. As the original conversation provides no extra information about psychological states, to guide a model to realize psychological states, Com-monSense Knowledge (CSK) can be introduced. From the perspective of CSK proposed by , which is a kind of widely-used socialized CSK (Hwang et al., 2021), action means what the speaker wants to do in the next step, which can be triggered by speaker him/herself or other speakers. Intention means what the speaker wanted to do before this step, which can only be inferred by speaker him/herself. Therefore, for a targeted utterance, the action can be inferred from its past context, the intention from its future context. As illustrated in Fig. 1, the targeted utterance #2 can be positively enhanced by the action inferred from #1 of speaker A and the intention from #4. COSMIC (Ghosal et al., 2020) introduces this kind of CSK into CER to model the speaker's psychological state, and then utilizes bidirectional GRUs to model these states in every time step. However, COSMIC ignores the structural psychological influences from contextual utterances to the targeted utterance (i.e. an utterance can directly and explicitly pass psychological messages to other utterances over several time steps, which is more than just sequential and implicit modeling of psychological states over utterances). In addition, modeling all psychological states both forwards and backwards does not conform with the nature of the CSK  mentioned above (e.g. intention cannot be inferred forwards and should be only inferred backwards as illustrated in Fig. 1).
To alleviate these issues, we propose a pSychological-Knowledge-Aware Interaction Graph (SKAIG). Utterances, which are locally connected, act as the nodes in the graph. There are four relations considered in SKAIG: xWant, oWant, xIntent, xEffect. For a targeted utterance, xWant, oWant model the action indicated by utterances in the past context with the same speaker (x) and other speakers (o) respectively. Conversely, xIntent models the intention inferred by utterances in the future context. And xEffect is the self-connected relation to model the influence from the present utterance itself. By taking the three sources: past, present, and future into consideration, we believe the graph can more structurally and rationally enhance context modeling. Furthermore, these relations will be assigned to edges between utterances accordingly. Therefore, edges take the role to model the psychological interactions between utterances. To realize this, we enrich edges with their corresponding knowledge representations. These representations are produced by commonsense transformer COMET (Bosselut et al., 2019) which takes utterances and relations as inputs. As edges in SKAIG possess knowledge representations that require to be considered, we therefore utilize the graph transformer (Shi et al., 2021) for message passing. We then use the final outputs for classification.
To evaluate our method, we conduct experiments on four datasets: IEMOCAP, DailyDialog, EmoryNLP, and MELD. Our method achieves stateof-the-art performance on the first three datasets, and competitive performance on MELD. Further experiments also demonstrate the efficacy of our proposed method.

Methodology
In this section, we first formalize the CER task, and then elaborate on our proposed model. The framework of the model (illustrated in Fig. 2) consists of three parts: Utterance-level Encoder, Conversationlevel Encoder, and Emotion Classifier.

Task Definition
For the subsequent context, a conversation containing N textual utterances is denoted as C = [u 1 , u 2 , ..., u N ]. In an utterance u n = [w 1 , w 2 , ..., w Ln ], L n words are contained. In addition, a conversation involves at least two speakers, and each utterance within is expressed by its corresponding speaker s ∈ (S 1 , S 2 , ..., S P ). Therefore, CER task aims to classify all utterances in one conversation to their correct emotion labels which belong to the set (E 0 , E 1 , ..., E M ).

Utterance-level Encoder
For each utterance out of the conversational context, it is important to extract the contextual information among its words. We employ the widelyused pretrained model RoBERTa (Liu et al., 2019) to encode the utterance. An utterance u n = [w 1 , w 2 , ..., w Ln ] is fed into RoBERTa, we obtain the hidden states of the last layer: where W ∈ R Ln×dw and d w is the dimension of hidden states of words. The goal of the utterancelevel encoder is to encode the representation for each utterance. Therefore, we deploy a maxpooling operation and a linear projection following Ishiwatari et al. (2020), Li et al. (2020b):  where c n ∈ R du is the representation of the utterance u n and d u is the dimension of the representation. After all utterances encoded, we obtain the representation of the conversation C ∈ R N ×du .

Conversation-Level Encoder
Considering each utterance in its conversational context, there is rich contextual information. For an utterance, the action and intention of the speaker and interactions with other utterances in past, present, and future are crucial to model the context more precisely. Therefore, we construct a pSychological-Knowledge-Aware Interaction Graph (SKAIG) of utterances in a conversation, and then utilize the Graph Transformer (Shi et al., 2021) network to process SKAIG.

SKAIG Construction
We construct a directed graph modeling interactions between utterances. We denote the graph as G = (V, E, R, A). Specifically, u n ∈ V is an utterance node, r ∈ R is an edge type, e i,j = (u i , r, u j ) ∈ E is the edge between utterance i and j, and a i,j ∈ A is the edge attribute (representation) of e i,j . Vertices: For an utterance u n acting as a node in the graph, we use the representation c n ∈ R du encoded by the utterance-level encoder to initiate the node feature h 0 n . The initial node feature contains no conversational contextual information.

Relations:
The interaction between utterances is often indicated by the relations between the speakers. In previous works (Ghosal et al., 2019;Ishiwatari et al., 2020), there are two important speaker relations r considered: self-dependency and interspeaker dependency. Based on this scheme, we propose more refined types of relations so that the speaker's action and intention in the conversation can be modeled. In our setting, the utterances in the post context can guide the action of the current utterance and those in the future context can predict the intention. Therefore, for two utterances u i , u j where u i appears before u j , if they share the same speaker, the relation u i → u j means that u i passes the guidance of the speaker's action to u j , and we denote this relation as xWant. The relation u i ← u j represents that u j can predict the intention of the speaker as u j is in the future context for u i , and we denote it as xIntent. Conversely, if u i and u j do not share the speaker, u i → u j will provide the influence of u i 's speaker on the action of u j 's speaker, and we denote it as oWant. As the intention only can be inferred by the speaker him/herself , no "intent" relation connects two utterances with different speakers. Furthermore, an utterance can be self-connected (u i → u i ) and the self-effected relation is denoted as xEffect. Therefore, we get four types of edge relations R = (xEffect, xWant, oWant, xIntent).
Edges: An edge e i,j = (u i , r, u j ) between two utterances u i and u j models the interactions between these utterances. We think that the influence of an utterance on contextual utterances can be locally effective, so we connect the targeted node with the contextual nodes of every speaker in a window whose size is k. When k = 1, the targeted utterance considers one utterance of every speaker in the past and future context respectively, which is exemplified in Fig. 2.
Edge Representations: Different from the previous works (Ghosal et al., 2019;Ishiwatari et al., 2020) that only assign a weight to the edge, we introduce the commonsense knowledge to enrich the edges with different relations. Fortunately, commonsense transformer COMET (Bosselut et al., 2019), which is a GPT (Radford et al., 2018) model, can provide such features for all of our relations. We utilize a COMET model trained on ATOMIC  which is a knowledge base of If-Then reasoning. There are nine relations in ATOMIC, which cover all of the relations we require. Under such circumstances, COMET can generate descriptions of "then" based on the input and the selected relation. For example, if taking u n and the relation xWant as inputs, COMET will generate a reasoning sequence following "If u n , then X wants to".
We concatenate u n and a relation with mask tokens (e.g. u n [MASK] <xWant>) in the inputting format of COMET, and then COMET processes the input. Following Ghosal et al. (2020), we take the hidden state of the relation token from the last layer of COMET transformer encoder as the relation's representation. For an edge e i,j = (u i , xWant, u j ), the corresponding representation is a i,j , whose dimension is mapped from 768 to d u with a following linear unit.

Graph Transformer
We utilize an L-layer graph transformer to propagate the interactive information through the SKAIG. We update the node representation h where N (i) is the set of source nodes connected with the targeted node i, m j is the message passed by these nodes, α i,j is the attention score, β i ∈ R 1 is the gate for the residual connections, and W s ∈ R du×du is a mapping weight.
The message passed by neighboring nodes contains two parts of information: the contextual relevance and the psychological information, so the message is computed by: where W e ∈ R d head ×du is a trainable weight and Furthermore, the attention score that controls how much information should be gathered from neighbors can be computed by: (3) only considers one attention head, while multiple heads are involved in practice. We concatenate outputs from all heads after message aggregation and denote it as o i . As for the gate, where [] is the concatenating operation.
In addition, we replace the original operation after the attention in Shi et al. (2021) to a pointwise feed forward network proposed by Vaswani et al. (2017). We denote the final output of the conversation as H L ∈ R N ×du .

Emotion Classifier
We utilize a linear unit as the classifier to predict the emotion distributions: where W c ∈ R du×M , b c ∈ R M . The cross-entropy loss utilized to train the model is calculated on a conversation by: where y i is the one-hot vector denoting the emotion of utterance i in the conversation, and e is the dimension of each emotion. 3 Experimental Setup
DailyDialog DailyDialog is a dataset containing two-way dialogues about the daily life. Seven emotions are included: neutral, happiness, sadness, anger, surprise, disgust, fear. In DailyDilaog, over 83% of the utterances are labeled with neutral.
EmoryNLP EmoryNLP is collected from the TV series Friends, which contains multi-speaker conversations. Seven emotions are annotated: neutral, mad, sad, scared, powerful, peaceful, joyful.
MELD MELD is also collected from Friends. Therefore, it is a dataset with multi-speaker conversations. The emotions are the same as those in DailyDialog.

Baselines and Compared Methods
We compare our model with the following baselines and state-of-the-art models: CNN (Kim, 2014) is the widely-used text convolution network. DialogueRNN  employs GRUs to track speakers' global and emotional states. Ghosal et al. (2020) implement both CNN and RoBERTa based DialogueRNN. DialogueGCN (Ghosal et al., 2019) uses graph convolutional networks to process the graph constructed from self-dependency and inter-speaker dependency. KET (Zhong et al., 2019) is a hierarchical transformer using their proposed graph attention to extract information from knowledge base. HiTrans (Li et al., 2020a) is a hierarchical transformer based on BERT which is augmented with a speaker relation prediction task. RGAT-POS (Ishiwatari et al., 2020) is a relation-aware graph attention network utilizing the proposed relational position encoding. The speaker modeling of this model is based on DialogueGCN. Di-alogXL (Shen et al., 2021) is an all-in-one XLNet that processes the conversation in one step. Di-alogXL also utilizes the speaker modeling of Di-alogueGCN. COSMIC (Ghosal et al., 2020) is a modified DialogueRNN based on RoBERTa-large. COSMIC models more refined states of speakers by utilizing bidirectional GRUs. COSMIC utilizes commonsense knowledge COMET to initialize a part of inputs of the speaker's internal, external, and intent GRUs. RoBERTa (Liu et al., 2019) is the utterance-level encoder directly followed by a classifier. RoBERTa-Transformer replaces the graph transformer with a transformer, which can be regarded as a locally and fully connected graph without mental relation modeling. We implement RoBERTa and RoBERTa-transformer in the setting of our method. For other models, we refer the performance from the corresponding papers.

Implementation
For IEMOCAP, we use RoBERTa-base 1 to initialize the utterance-level encoder. For other datasets, RoBERTa-large is selected, which is deployed by HuggingFace transformers toolkit (Wolf et al., 2019). RoBERTa is fine-tuned when training. The batch size is set to 1 for IEMOCAP and 8 for other datasets. For graph transformer, the dimension of the utterance is set 200 for MELD and 300 for other datasets; the dimension of the feed forward network is set to 200 for MELD and 600 for other datasets; the head dimension is set to 50 for all datasets; the number of layers is searched from 1 to 6. We train the model with the AdamW optimizer (Loshchilov and Hutter, 2019) whose learning rate is set to 8e-6 for MELD and 1e-5 for other datasets. The training step is set to 10000 with the first 1000 steps for warming up and other steps decaying the  learning rate. Early stopping is activated with 10 epochs. 2 For IEMOCAP, EmoryNLP, and MELD, the weighted F1 score is selected as the evaluating metric. For DailyDialog, following previous works, we report the micro F1 score excluding those utterances labeled with neutral and the macro F1 score. All of our results are averaged on 5 runs.

Overall Results
Illustrated in Tab. 2, our method achieves state-of-the-art results on IEMOCAP, DailyDialog, EmoryNLP, and competitive results on MELD.
For IEMOCAP, RoBERTa performs poorly comparing to models with conversational context modeling, which indicates IEMOCAP contains rich information of conversational context. COSMIC achieves limited improvement against RoBERTa-DialogueRNN, while our method outperforms RoBERTa-Transformer. We think the reason is that our method can benefit from the structural modeling of psychological knowledge in IEMOCAP. Conversely, COSMIC only models psychological states by updating step by step. However, the interactions between utterances in several steps play an important role in IEMOCAP, which can be elucidated in Fig. 3. To this end, our method models better conversational context and outperforms COS-MIC by 1.68 weighted-F1. For models based on pretrained models, the performance is similar. Our method performing better indicates the importance of CSK to enhance psychological states.
For DailyDialog, our method exceeds COSMIC by 1.27 micro-F1 and RoBERTa-Transformer by 1.47 mirco-F1. In this case, RoBERTa-Transformer is competitive in micro-F1 but the performance on macro-F1 is poor. Conversely, our method achieves the best macro-F1, which demonstrates the introduction of SKAIG can partly defend the influence of data imbalance on transformer.
For EmoryNLP, the contextual information provided by conversations is limited as RoBERTa achieves similar performance as Transformer and DialogueRNN. In such case, our method still exceed COSMIC by 0.77 weighted-F1. For MELD, our method achieves competitive performance against COSMIC . We think the reason maybe that MELD contains short conversations but involves multiple speakers, which leads to limited interactive influence from psychological state. Therefore, our method does not show advantages on MELD. The error analysis on MELD is present in section 4.5.

Effect of Relations
We evaluate the effect of different relations to our model. We take one relation off our proposed SKAIG, where the edge will not be eliminated to keep the modeling of conversational context. To achieve this, we only remove the edge representations of the selected relation. In addition, we train  Taking off different relations in SKAIG leads to different degree of performance drop. By taking off the self-connected relation xEffect, the performance drops. This observation indicates the importance of modeling self-effect in the current state. Furthermore, by taking off xWant or oWant, where the two relations model the action information provided by the past context from different speakers, the performance drops. This demonstrates that the information about action can enhance interactions between utterances. On the other hand, xIntent also affects the performance of our model, which indicates the necessity of considering the intent information from the future context. The trainable model variant performs poorly as it achieves the lowest F1 scores. We deduce the reason may be that no CSK is provided to inform what kind of the relation is modeled between two utterances. This emphasizes the importance of CSK to guide the model to learn more rational information about the speaker's psychological states.

Effect of Window Size
In this section, we evaluate the effect of the window size to our method. The performance on the validation set is illustrated in Fig. 3. Except IEMO- CAP, the upper window size for others is 6 or 7, because these datasets contain relatively short conversations as shown in Tab. 1. The increasing rate of the number of edges in the graph becomes slow when the window size exceeds 6 or 7.
From the illustration, only IEMOCAP shows significant improvement with the window widening, while other datasets show flat trends. The reason maybe that IEMOCAP contains more contextual information and obvious interactions of utterances in the conversation (as elucidated in section 4.1). Inferred from trend lines, whose changing ranges are different in different datasets though, the performance basically increases first and then drops as the window becomes large except EmoryNLP. This observation accords with our claim that the psychological interactions between utterances are locally effective. On the other hand, the reason for our method not sensitive to the window size on EmoryNLP may be that the contextual information provided by conversations in this dataset is limited. This can be inferred from the similar performance of RoBERTa and RoBERTa-Transformer (-DialogueRNN) in Tab. 2.

Case Study
In Fig. 4 (a), we exemplify a simple case that a targeted utterance gets messages of intent from future utterances. Specifically, the attention scores are averaged from the attention heads in the top layer of graph transformer. For xEffect of #1, CSK can provide a positive indication of the speaker's self-effect state, where #1 is likely to be predicted as neutral by models without CSK. As for #5 that is two steps from #1, the xIntent provided by it can positively enhance #1. In addition, the attention score of the edge (5, xIntent, 1) is the highest among all the in-degree edges of #1. This coincides our claim that an utterance can directly and explicitly pass psychological messages to other utterances over several steps, and indicates the necessity of modeling structural interactions. In Fig. 5, we illustrate a case that our method gives the correct prediction while COSMIC fails.
In this case, messages of action from history utterances contribute the most while the self-effect (#46 → #46) and intent (#(> 46) → #46) have lower importance. xWant directly passed by #37 and oWant from #38 can provide positive guidance to #46 and they are both several steps away, which further demonstrates the importance of direct structural psychological interactions. Conversely, COSMIC considers intent, effect, reaction from #46 itself and effect, reaction from neighboring #45 due to the sequential modeling. Although the knowledge can provide useful clues like "alone", COSMIC fails to make the correct prediction. This indicates that COSMIC is hindered by the implicit and limited psychological interactions with contextual utterances, even though contextual utterances can provide more effective psychological information.

Error Analysis
In Fig. 4 (b), we illustrate the confusion matrix of predictions on IEMOCAP. To study the condition that our model fails in, the diagonal in the confusion matrix is eliminated to zero. The deeper color denotes that more samples are misclassified. From the heatmap, happy samples are likely predicted as excited, and other negative emotions like sad are more confused with frustrated. These observations indicate that the difficulty of discriminating similar emotions in emotion recognition still disturbs our method.
In Fig. 4 (c), we illustrate the effect of increasing speakers in a conversation to our method and COS-MIC on MELD. At first, the performance of ours and COSMIC increases, which is different from that of HiTrans (Li et al., 2020a) that constantly decays. Compared with COSMIC, our method can achieve competitive performance. This appearance demonstrates that our method can handle the condition involving a small amount of speakers. However, when the number keeps increasing, our method show the same dropping trend of performance as HiTrans and COSMIC do, but the trend is sharper than that of COSMIC. This indicates that it becomes hard for our method when a conversation involves a large scale of speakers. In the future work, we will endeavor to explore more effective schemes of speaker modeling to deal with the condition that involves multiple speakers.

Related Work
Conversational Emotion Recognition is a hot-spot task in recent years. Unlike traditional Emotion Recognition, CER involves conversational context. Hazarika et al. (2018b,a); Jiao et al. (2020) utilize memory network to model such context. To consider the speaker and listener in the conversation,  propose DialogueRNN, which utilizes GRUs to update speakers' states and the global line of the conversation. DialogueGCN (Ghosal et al., 2019) models two relations between speakers: self and inter-speaker dependencies, and utilizes graph networks (Schlichtkrull et al., 2018;Kipf and Welling, 2017) to model the graph constructed by these relations. Zhong et al. (2019) propose a graph attention to extract information from external knowledge base and utilize Transformer (Vaswani et al., 2017) to model conversations.
Furthermore, with the spreading of pretrained models, new works are based on these highperformance and large-scale models. Ishiwatari et al. (2020) propose a relation-aware position encoding based on DialogueGCN and utilize BERT (Devlin et al., 2019) to encode utterances. Li et al. (2020a) utilize BERT and propose a speaker relation prediction task to augment CER. Shen et al. (2021) utilize XLNet  and model the whole conversation in one step. By introducing commonsense knowledge to CER, Ghosal et al. (2020) propose COMSIC which is based on Di-alogueRNN equipped with RoBERTa (Liu et al., 2019) to model the speakers' internal, external, intent states.

Conclusion
In this paper, we study conversational emotion recognition. The SOTA method ignores the psychological interactions between utterances over several time steps and does not conform with the nature of psychological states. We therefore propose a pSychological-Knowledge-Aware Interaction Graph (SKAIG). The graph contains four relations to model psychological states of speakers. Enhanced by commonsense knowledge and the deployment of the graph transformer, our method yields SOTA or competitive performance on benchmark datasets.