A Discourse-Aware Graph Neural Network for Emotion Recognition in Multi-Party Conversation

Emotion recognition in multi-party conversation (ERMC) is becoming increasingly popular as an emerging research topic in natural language processing. Prior research focuses on exploring sequential information but ignores the discourse structures of conversations. In this paper, we investigate the importance of discourse structures in handling informative contextual cues and speaker-speciﬁc features for ERMC. To this end, we propose a discourse-aware graph neural network (ERMC-DisGCN) for ERMC. In particular, we design a relational convolution to lever the self-speaker dependency of interlocutors to propagate contextual information. Further-more, we exploit a gated convolution to select more informative cues for ERMC from dependent utterances. The experimental results show our method outperforms multiple baselines, illustrating that discourse structures are of great value to ERMC.


Introduction
In the past few years, emotion recognition in conversation (ERC) has become increasingly popular in natural language processing (NLP) with the proliferation of open conversational data on social media platforms (Poria et al., 2019a). Similar to text sentiment analysis, ERC is a task to determine the emotion of each utterance within a conversation, as shown in Fig. 1, and plays important role in many NLP applications, such as opinion mining in conversation , social media analysis  and emotion-aware dialogue systems (Ghosal et al., 2019). However, ERC, particularly the emotion recognition in multiparty conversation (ERMC), often exhibits more difficulties than traditional text sentiment analysis due to the emotional dynamics of conversations (Poria et al., 2019b). Consequently, recognizing * Corresponding author.

surprise
(1) A: How does she do that?
(2) B: I cannot sleep in a public place.
(3) A: Would you look at her? She...  Figure 1: An example of the ERC task, the gold labels are different emotions of the utterances, and the discourse structure is shown on the left. QAP, Ack, Ela, and Expl respectively represent the Question-Answer Pair, Acknowledgment, Elaboration, and Explanation relations.
the emotion of an utterance in a multi-party conversation primarily depends on not only the utterance itself and its context but also the self and interpersonal dependencies and the emotions expressed in the preceding utterances Jiao et al., 2019;Zhong et al., 2019;Shen et al., 2021).
Many approaches have been proposed for ERC with a focus on conversational context representation and speaker-specific modeling. While earlier works on ERC focus on two-party conversation and exploit recurrent neural networks (RNNs) to capture sequential context features of conversations Jiao et al., 2019;Ghosal et al., 2019), recent studies exert more efforts on ERMC and explore different techniques such as multi-task learning  and pre-training language modeling (Shen et al., 2021) to capture speaker-specific information. Although these studies have greatly promoted the progress of ERC, most of them ignore the important conversational discourse structures. Therefore, they can only leverage cues in neighboring context of conversations, and are difficult to handle informative distant dependencies for ERC.
Actually, conversational discourse structures contain discourse relations or discourse dependencies between utterances and thus provide a straightforward way to capture both adjacent and distant cues for ERMC. Fig. 1 illustrates a multi-party conversation example with its discourse structure obtained from the discourse parser proposed by Shi and Huang (2019). As we can see, although the first and the fourth utterances are distant in position within the conversation, they have an immediate discourse relation and are thus annotated with the same emotion type surprise. Therefore, such discourse relations offer important contextual cues for ERMC. On the other hand, discourse structures have proven to be useful for document-level sentiment analysis (Bhatia et al., 2015;Märkle-Huß et al., 2017;Kraus and Feuerriegel, 2019) and we believe that they are also beneficial for ERMC. Moreover, recent progress in conversational discourse parsing (Shi and Huang, 2019;Li et al., 2021) makes it applicable to explore discourse structures to help model conversational contexts and speakers for ERMC.
However, two new problems may arise when discourse structures are applied to ERMC. First, previous works have shown that speaker-specific information is very important for ERMC . So it becomes a key issue how to incorporate conversational discourse structures into speaker-specific modeling for ERMC. Second, discourse structures involve dependent relations between utterances. However, not all information from dependent utterances is useful for conversational emotion recognition. Therefore, another important problem might be how to select more informative cues for ERMC.
To address the aforementioned issues, we propose a discourse-aware graph neural network for emotion recognition in multi-party conversation, named ERMC-DisGCN. It consists of three main modules: Firstly, a sequential context encoding module exploits Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) to capture the sequential features of utterances in a conversation. Then, we exploit discourse dependency links and discourse relations to construct a graph, which contains two main convolution operations, namely a relational convolution and a gated convolution. The relational convolution is used to model the self-speaker dependency based on dis-course structures, where individual speakers resist the change of their own emotion against external influence (Ghosal et al., 2019), while the gated convolution adopts a gated mechanism to select informative cues for ERMC from dependent utterances. Similar to , we take utterances as nodes of the constructed graph. Finally, a decoding module is applied to predict the emotion label for each utterance. In addition, we employ the deep sequential discourse parser developed by Shi and Huang (2019) to obtain the explicit discourse dependency trees of input conversations.
In summary, we make the following contributions: • We propose a discourse-aware graph neural network for emotion recognition in multiparty conversation (ERMC). • We devise a discourse-based relational graph convolution to exploit the self-speaker dependency of interlocutors to propagate contextual information, and further use a gated convolution to select more informative cues for ERMC from dependent utterances. • We conduct experiments on both multi-party and two-party conversation corpora, and demonstrate that using conversational discourse structures can benefit ERMC.

Related work
Recently, ERC has become a new trend due to the emergence of publicly available conversational datasets collected from social media platforms and scripted situations (Busso et al., 2008;Zahiri and Choi, 2018;Poria et al., 2019a). Earlier works focus on capturing sequential context features for emotion recognition in two-party conversation.    (Vaswani et al., 2017) to capture contextual information.
For emotion recognition in multi-party conversation (ERMC), studies exert more effort in handling speaker-specific information.  represent the entire conversational corpus as a large graph to model speaker-sensitive dependency.  use speaker identification as an auxiliary task to capture speaker-specific features. Shen et al. (2021) propose an all-in-one XL-Net ) model with dialog-aware self-attention to deal with the multi-party structures. However, these studies neglect the informative discourse structures in multi-party conversations. To the best of our knowledge, we are the first to investigate the importance of discourse structures in handling informative contextual cues and speakerspecific features for ERMC.
Discourse structures have been successfully applied to document-level sentiment analysis (Bhatia et al., 2015;Märkle-Huß et al., 2017;Kraus and Feuerriegel, 2019), where discourse structures are produced by Rhetorical Structure Theory (RST) parser (Li et al., 2014). Recently, Shi and Huang (2019) propose a deep sequential model for conversational discourse parsing and achieve new state-ofthe-art (SOTA) results. With this model, Jia et al. (2020) transform dialogue histories into threads for multi-turn response selection. Inspired by (Xia et al., 2019) and , we exploit discourse dependency links and discourse relations to construct a graph. Especially, we stack two convolutional layers to aggregate contextual and speakerspecific information of the neighborhood for each utterance in the graph.

Problem Definition
Suppose there are N constituent utterances u 1 , u 2 , . . . , u N from a conversation with X(X 2) speakers s 1 , s 2 , . . . , s X . Utterance u i is uttered by speaker S m(u i ) , where the function m maps an utterance into its corresponding speaker. ERMC is to predict the emotion label for each utterance.

Pre-processing
Similar to most existing studies, the input of our model is a multi-party conversation consisting of context-independent utterance-level feature vectors. Besides, we need to obtain discourse structures to construct a graph. We complete these works in this pre-processing module.
Utterance Encoding: Earlier works adopt the Convolution Neural Network (CNN) (Kim, 2014) to obtain the feature vectors for utterances. To compare with the latest model (Shen et al., 2021) based on XLNet , we use the BERT model (Devlin et al., 2019) to extract contextindependent utterance-level feature vectors for utterances. Let an utterance u consists of a sequence of tokens x 1 , x 2 , . . . , x N . First, a special token [CLS] is appended at the beginning of the utterance to create the input sequence for the model: [CLS], x 1 , x 2 , . . . , x N . Then, we pass the [CLS] appended utterances to BERT and extract out activations from the final four layers corresponding to the [CLS] token. Finally, these four vectors are averaged to obtain the feature vector with a dimension of 768. Discourse Parsing: To obtain discourse dependency trees, we utilize the discourse parser proposed by Shi and Huang (2019). It is a deep sequential model that achieves SOTA performance on the STAC corpus (Asher et al., 2016). We feed the conversations into the discourse parser: The quadri-tuple (i, j, r ij , p ij ) are directed edges of a discourse dependency tree with head i and tail j, indicating that u i has immediate relation r ij with u j . And p ij is the confidence score of the dependency link. Notice that i, j = 1, 2, . . . , N and j > i.

Model Overview
As illustrated in Fig. 2, there are three components in our proposed framework: (1) sequential context encoding; (2) discourse graph modeling; (3) emotion recognition. In the following sections, we explain each component in detail.
After the pre-processing, we obtain not only the dependency trees of conversations, but also the context-independent utterance-level feature vectors. Then, we use Bi-directional LSTM to transform these vectors into context-dependent ones. Next, a discourse-based graph stacks two different convolutional layers to aggregate contextual and speakerspecific information. Finally, the output feature vectors from the graph are used to recognize emotions for utterances.

Sequential Context Encoding
Similar to previous strategies, the sequential context encoder processes the constituent utterances in a conversation as a sequence according to the timeline. Inspired by , we use Bi-directional LSTM to capture sequential context information, where, i = 1, 2, . . . , N , u i and g i are contextindependent and sequential utterance representations, respectively.

Discourse Graph Modeling
Conversational discourse structures provide a straightforward way to capture both adjacent and distant cues for ERMC. Inspired by (Xia et al., 2019) and , we exploit discourse dependency trees to construct graphs to propagate contextual and speaker-specific information. The framework is detailed here.

Graph Construction
First, we introduce the following notation: a multiparty conversation having N utterances is represented as a directed graph G = (V, E, R, W), with vertices/nodes v i ∈ V, labeled edges (relations) e ij ∈ E where r ij ∈ R is the relation type of the edge between v i and v j , and α ij is the weight of the labeled edge e ij , with 0 α ij 1, where α ij ∈ W and i, j = 1, 2, . . . , N . The graph is constructed based on discourse dependency trees in the following way, Vertices: In the graph, each utterance within a multi-party conversation is represented as a vertex v i ∈ V. Each vertex v i is initialized with the corresponding sequentially encoded representation g i , and i = 1, 2, . . . , N .
Edges: Construction of the edges E depends on discourse dependency trees. For instance, if there is a quadri-tuple (i, j, r ij , p ij ) from a dependency tree, there would be an edge e ij in the graph with head u i and tail u j . As the graph is directional, e ij is not equivalent to e ji . In most cases, an utterance only depends on its historical utterances, so the direction of edges is often directed as a topological sort from earlier utterances to later ones.
For speaker-specific information, Ghosal et al. (2019) model the emotional inertia of speakers in two-party conversations, where individual speakers resist the change of their own emotion against external influence. However, it is a challenge to incorporate discourse structures into speaker modeling for ERMC. In our model, we leverage the self-speaker dependency of interlocutors to model the emotional inertia of speakers by directly letting one utterance know whether its dependent utterance belongs to the same speaker. In Fig. 2, we use a dashed line to represent discourse dependencies between utterances from the same speaker and use a solid line to denote discourse dependencies between utterances from different speakers.
Edge Weights: The spatial graph convolutional operation essentially propagates node information along edges (Wu et al., 2020), thus proper edge weights is helpful. In our graph model, we set the edge weights statically, where p ij is the confidence score of edge e ij obtained from the discourse parser.
Relations: The relation r ij of an edge e ij is set depending upon two aspects: Discourse relations -These relations depend on discourse dependency trees. For example, r ij is the discourse relation type of edge e ij which is the dependency link between utterance u i and u j . According to (Shi and Huang, 2019), there are 16 types of discourse relations: Comment, Clarification question, Elaboration, Acknowledgment, Continuation, Explanation, Conditional, Question-Answer pair (QAP), Alternation, Question-Elab(Q-Elab), Result, Background, Narration, Correction, Parallel and Contrast.
Self-speaker dependency -This relation depends upon speakers. If two utterances are from the same speaker and have discourse relation r q (r q is one of the 16 discourse relations), we transform r q into r q to model the self-speaker dependency.

Feature Transformation
We now describe the methodology to transform the sequentially encoded feature vectors using the graph network. After a two-step graph convolution process, the vertex representations g i are transformed into contextual and speaker-specific ones.
In the first step, we consider discourse dependencies as important cues to propagate contextual and speaker-specific information. As there are many types of edges, inspired by Schlichtkrull et al. (2018), the new features h 1 i of utterance u i is computed as: where, α ij is edge weight, N r i represents the neighboring indices of node g i under relation r ∈ R. And c i,r is a problem specific normalization constant which is set in advance (c i,r = |N r i |). σ is an activation function such as ReLU, W 1 0 and W 1 r are trainable parameters, only edges of the same relation type r are associated with the same projection weight W 1 r . In the second step, to select more informative cues from dependent utterances, another residual gated convolutional layer (Bresson and Laurent, 2018) is applied over the output of the first step, where W 2 0 , W 2 1 , W 2 2 , and W 2 3 are trainable parameters. This stack of graph convolutional layers effectively aggregates normalized contextual and speaker-specific information of the neighborhood for each utterance in the graph.

Emotion Recognition
After the feature transformation, we consider h 2 i as the contextual and speaker-specific representations  of utterances. Then, we classify each utterance using a fully connected network: To train the model, we choose the cross-entropy loss function: where y V is the set of node indices that have labels and Y is the label indicator matrix.

Datasets
To verify the effectiveness of integrating discourse structures for ERMC, we evaluate our model on both multi-party and two-party conversation corpora. All these datasets contain multimodal information for each utterance within a conversation, while we only focus on the textual information in this work. Table 1 shows the corpora statistics. MELD (Poria et al., 2019a): A multi-party conversation corpus collected from the TV show Friends. Each utterance is annotated as one of the seven emotion classes: neutral, surprise, fear, sadness, joy, disgust, and anger.
EmoryNLP (Zahiri and Choi, 2018): A multiparty conversation corpus collected from Friends, but varies from MELD in the choice of scenes and emotion labels. The emotion labels include neutral, joyful, peaceful, powerful, scared, mad, and sad.
IEMOCAP (Busso et al., 2008): A two-party conversation corpus. The emotion labels include neutral, happiness, sadness, anger, frustrated, and excited. Since this dataset has no validation set, we follow (Shen et al., 2021) to use the last 20 dialogues in the training set for validation.

Implementation Details
We use pre-trained BERT-Base 1 to encode utterances and adopt Adam (Kingma and Ba, 2015) as the optimizer with an initial learning rate of 1e-4 and L2 weight decay of 1e-5 for three datasets. The batch size is set to be {32,32,16} for MELD, EmoryNLP, and IEMOCAP respectively. The dimensions of g i , h 1 i and h 2 i are set to be 100, 64, and 64. The dropout (Srivastava et al., 2014) is set to be 0.5. We train all models for a maximum of 100 epochs and stop training if the validation loss does not decrease for 20 consecutive epochs.

Baseline Methods
For a comprehensive evaluation of our proposed ERMC-DisGCN, we compare it with the following baseline methods: cLSTM : Contextual utterance representations are generated by capturing the content from surrounding utterances using a Bi-directional LSTM network.
DialogueRNN : It is a recurrent network that uses three GRUs to track individual speaker states, global context, and emotional state within conversations.
HiGRU (Jiao et al., 2019): It is a hierarchical GRU structure that trains utterance-level and conversation-level encoders jointly.
ConGCN : This model represents the entire conversational corpus as a large heterogeneous graph to capture context-sensitive and speaker-sensitive features.
DialogueGCN (Ghosal et al., 2019): This is a graph-based model to encode speaker dependencies and temporal information within a window context.
BERT-MTL : It is a multi-task learning framework where features extracted from BERT are used for emotion recognition and speaker identification.
DialogueXL (Shen et al., 2021): An all-in-one XLNet model with dialog-aware self-attention to deal with multi-party structures.
BERT-LSTM: A variation of cLSTM where the CNN-based utterance-level feature vectors are replaced by our BERT-based feature vectors. We  Table 2: Overall performance on both multi-party and two-party conversation corpora, which is statistically significant under the paired t-test (p<0.05). We use the average F1 score to evaluate each model. The scores marked by "*" are based on our re-implementation, because of the differences in datasets between the corresponding work and ours.
consider this model as our strong baseline. ERMC-GCN: A variation of our approach where the graph modeling is based on the timeline of conversations. It means that there are no discourse structures in this model.

Comparison with Baseline Methods
We compare the performance of our proposed ERMC-DisGCN framework with multiple baselines in Table 2. To verify the effectiveness of integrating discourse structures for ERMC, we conduct experiments on both multi-party and two-party conversation datasets.
MELD and EmoryNLP: On these multi-party conversation datasets, we first report our baseline results which achieve comparable performance with the previous systems. Then, our proposed ERMC-DisGCN achieves average F1 scores of 64.22% and 36.38%, which are around 2% better than the strong baseline. Compared to ERMC-GCN, integrating discourse structures leads to F1 improvements of around 1.5% on two datasets. We attribute this gap in performance to the nature of conversations. There are many utterances, like "yeah", "okay", and "no", that can express different emotions depending on the context within conversations. In these cases, discourse structures indicate the most informative historical utterances, which contributes to emotion recognition.
IEMOCAP: On this two-party conversation dataset, we observe the inferior performance of our  Figure 3: The discourse dependency rate between distant utterances on three datasets.
baseline BERT-LSTM to dialogueXL. The average conversation length is 50 utterances in IEMOCAP which is much longer than MELD and EmoryNLP, so LSTM fails to propagate rich long-term information, while DialogueXL remains the SOTA result with enhanced memory for historical context. And compared to ERMC-GCN, integrating discourse structures only leads to an F1 score increase of 0.42%. In the following section, we explain the reason for different performance of integrating discourse structures in these datasets.

Multi-Party vs Two-Party
According to those results shown in Table 2, we find that integrating discourse structures in multiparty conversations leads to more significant improvements than in two-party conversations. To explain this difference, it is important to understand the nature of multi-party and two-party conversations. After examining the datasets, we report the distant dependency rate of them in Fig. 3. As we can see, discourse structures in multi-party conversations are much more complex. About 25% utterances have discourse relations with distant ones in multi-party conversations, and this rate rises as conversation length increases. In MELD and EmoryNLP, there are often more than 5 interlocutors within a conversation, thus speakers' turns change quickly and one speaker may respond to another after many turns. However, in two-party conversations, the distant dependency rate is only around 11% and keeps steady when conversation length increases. Since there are only two interlocutors, they tend to speak utterances cyclically, adjacent utterances are more related. From the above discussion, we can conclude that it is more necessary to exploit discourse structures to handle  the rich dependencies between distant utterances in multi-party conversations.

Different Speaker Modeling Methods
Previous studies have proven that capturing speaker-specific features benefits emotion recognition in conversation. In this section, we conduct experiments to answer the following two questions: (1) Is it helpful to propagate speaker information based on discourse structures? (2) Which speaker modeling method contributes most to our approach?
We replace our self-speaker dependency modeling method with the following three methods. The first one is a variation of ours that the self-speaker dependency is modeled independently of discourse structures by directly letting one utterance know whether the adjacent one is from the same speaker. The second method is to use speaker-specific GRUs (Hazarika et al., 2018) to process the histories of each speaker which represent the individual states of speakers. The third one is speaker role embedding, which maps each interlocutor to a trainable vector . These methods are all independent of discourse structures but capture different speaker-specific features.
The results of different speaker modeling methods are shown in Table 3. We observe that the discourse-based self-speaker modeling method performs better than the independent method. This gap supports our hypothesis that the discourse dependencies between distant utterances offer informative cues for capturing speaker-specific features. So, it is necessary to integrate discourse structures into speaker modeling. Besides, although the other two methods capture different kinds of speakerspecific features, they have similar performance with our independent model.

ID Speaker Text Emotion Prediction
(3) Chandler： Yeah, can you guys just throw him in the pool later?

Ablation Study
We perform an ablation study for three components of our model by removing them one by one at a time. Experimental results are shown in Table 4. First, we find that the self-speaker dependency is of significance in our model. This phenomenon is in tune with previous works that capturing speakerspecific features benefits emotion recognition in multi-party conversation, where there are often more than 5 interlocutors. By eliminating the gated convolutional layer in the graph, our model falls by 0.55% on MELD and 0.49% on EmoryNLP. Discourse structures only offer contextual cues, not all information from dependent utterances helps emotion recognition. Therefore, this gated convolutional layer is necessary to select informative cues in our graph modeling. Further, the relational convolutional layer successfully aggregates contextual and speaker-specific information from the neighborhood of each utterance according to edge types and makes the most contribution to our approach.

Case Study
For a comprehensive understanding of our proposed method, we visualize its performance by a case study, which is selected from the MELD test dataset. As illustrated in Fig. 4, utterance (6) is too short to carry rich semantic features for emotion recognition. However, its dependent utterance (3) offers an informative cue and helps make the right prediction. From the ablation study, we draw the conclusion that modeling the self-speaker dependency benefits ERMC, but it is not always the case. For instance, we observe two wrong predictions for the adjacent utterances (11) and (12), which are from the same speaker and have a discourse relation. Modeling the self-speaker dependency is hard to deal with the emotional shifts (i.e., the emotion labels of two consecutive utterances from the same speaker are different) (Poria et al., 2019a;Shen et al., 2021). Roughly, our model commits mistakes for 40% of similar cases, which calls for further investigations.

Conclusion
In this paper, we investigate the importance of discourse structures in handling informative contextual cues and speaker-specific features for ERMC. We propose a discourse-aware graph neural network and devise two graph convolutional layers to aggregate normalized contextual and speakerspecific information for each utterance in the graph. Experimental results show that our proposed model outperforms all the baselines on all multi-party conversation datasets. Furthermore, we apply extensive analyses for the proposed model and have the following findings. First, discourse structures are more helpful for emotion recognition in multi-party conversation than in two-party conversation. Second, it is important to integrate discourse structures into speaker modeling. Third, the gated mechanism helps select more informative cues from dependent utterances for ERMC. In our future work, we would like to capture various speaker-specific features to deal with the emotional shifts. Since our method focuses on using explicit discourse structures, we also plan to employ implicit methods to avoid error propagation and address consequent issues.