MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation

Emotion recognition in conversation (ERC) is a crucial component in affective dialogue systems, which helps the system understand users’ emotions and generate empathetic responses. However, most works focus on modeling speaker and contextual information primarily on the textual modality or simply leveraging multimodal information through feature concatenation. In order to explore a more effective way of utilizing both multimodal and long-distance contextual information, we propose a new model based on multimodal fused graph convolutional network, MMGCN, in this work. MMGCN can not only make use of multimodal dependencies effectively, but also leverage speaker information to model inter-speaker and intra-speaker dependency. We evaluate our proposed model on two public benchmark datasets, IEMOCAP and MELD, and the results prove the effectiveness of MMGCN, which outperforms other SOTA methods by a significant margin under the multimodal conversation setting.


Introduction
Emotion is an important part of human daily communication. Emotion Recognition in Conversation (ERC) aims to automatically identify and track the emotional status of speakers during a dialogue. It has attracted increasing attention from researchers in the field of natural language processing and multimodal processing. ERC has a wide range of potential applications such as assisting conversation analysis for legal trials and e-health services etc. It is also a key component for building natural humancomputer interactions that can produce emotional responses in a dialogue.
The fast growing availability of conversational data on social media is one of the factors that boost ... Well, you know I'm leaning towards like communicationaI. Okay. They have a lot of good schools, right?
Yeah, I mean it's just a really good school.
I'm looking forward to getting there. I mean, God, the campus is cool. the research focus on emotion recognition in conversation. Different from traditional emotion recognition on isolated utterances, emotion recognition in conversation requires context modeling of individual utterances. The context can be attributed to the preceding utterances, temporality in conversation turns, or speaker related information etc. Different models have been proposed to capture the contextual information in previous works, including the LSTM-based model (Poria et al., 2017), the conversational memory network (CMN) model (Hazarika et al., 2018b), interactive conversational memory network (ICON) model (Hazarika et al., 2018a), and DialogueRNN model  etc. In the example conversation as shown in Figure 1, the two speakers are chatting in the context of the male speaker being admitted to USC. In this chatting scene, they change topics a few times, such as the female speaker inviting the male speaker out to play and so on. But they keep coming back to the topic of USC, and then both of them express an excitement emotional status. It shows that long-distance contextual information is of great help to the prediction of speakers' emotions. However, previous models can not effectively capture both speaker and long-distance dialogue contextual information simultaneously in multi-speaker conversation scenarios. Ghosal et al.(Ghosal et al., 2019), therefore, first propose the DialogueGCN model which applies graph convolutional network (GCN) to capture long-distance contextual information in a conversation. Dia-logueGCN takes each utterance as a node and connects any nodes that are in the same window within a conversation. It can well model both the dialogue context and speaker information which leads to the state-of-the-art ERC performance. However, like most previous models, DialogGCN only focuses on the textual modality of the conversation, ignoring effective combination of other modalities such as visual and acoustic modalities. Works that consider multimodal contextual information often conduct the simple feature concatenation type of multimodal fusion.
In order to effectively explore the multimodal information and at the same time capture longdistance contextual information, we propose a new multimodal fused graph convolutional network (MMGCN) model in this work. MMGCN constructs the fully connected graph in each modality, and builds edge connections between nodes corresponding to the same utterance across different modalities, so that contextual information across different modalities can interact. In addition, the speaker information is injected into MMGCN via speaker embedding. Furthermore, different from DialogueGCN, which is a non-spectral domain GCN and its many optimized matrices occupy too much computing resource, we encode the multimodal graph using spectral domain GCN and extend the GCN from a single layer to deep layers. To verify the effectiveness of the proposed model, we carry out experiments on two benchmark multimodal conversation datasets, IEMOCAP and MELD. MMGCN significantly outperforms other models on both datasets.
The rest of the paper is organized as follows: Section 2 discusses some related works; Section 3 introduces the proposed MMGCN model in details; Section 4 and 5 present the experiment setups on two public benchmark datasets and the analysis of experiment results and ablation study; Finally, Section 6 draws some conclusions.

Emotion Recognition in Conversation
With the fast development of social media, much more interaction data become available, including several open-sourced conversation datasets such as IEMOCAP (Busso et al., 2008), AVEC (Schuller et al., 2012), MELD , etc. ERC has attracted much research attention recently.
Many previous works focus on modeling contextual information due to its importance in ERC. Poria et al. (Poria et al., 2017) leverage a LSTMbased model to capture interaction history context. Hazarika et al. (Hazarika et al., 2018b,a) first pay attention to the importance of speaker information and exploit different memory networks to model different speakers. DialogueRNN  leverage distinct GRUs to capture speakers' contextual information. DialogueGCN (Ghosal et al., 2019) construct the graph considering both speaker and conversation sequential information and achieve the state-of-the-art performance.

Multimodal Fusion
Most recent studies on ERC focus primarily on the textual modality. (Poria et al., 2017;Hazarika et al., 2018b,a) leverage multimodal information through concatenating features from three modalities without modeling the interaction between modalities. (Chen et al., 2017) conduct multimodal fusion at the word-level for emotion recognition of isolated utterances. (Sahay et al., 2018) consider contextual information and use relations in the emotion labels across utterances to predict the emotion (Zadeh et al., 2018) propose MFN to fuse information of multi-views, which aligns features from different modalities well. However, MFN neglects to model speaker information, which is significant to ERC as well. The state-of-the-art dialogueGCN model only considers the textual modality. In order to explore a more effective way of fusing multiple modalities and at the same time capturing contextual conversation information, we propose MMGCN which constructs a graph based on all three muoldalities.

Graph Convolutional Network
Graph convolutional networks have been widely used in the past few years for their ability to cope with non-Euclidean data. Mainstream GCN methods can be divided into spectral domain methods and non-spectral domain methods (Veličković et al., 2017). Spectral domain GCN methods (Zhang et al., 2019) are based on Laplace Spectral decomposition theory. They can only deal with undirected graphs. Non-spectral domain GCN methods (Veličković et al., 2017;Schlichtkrull et al., 2018;Li et al., 2015) can be applied to both directed and undirected graphs, but consuming larger computing resource. Recently, researchers have proposed methods to make spectral domain GCN deeper without over-smoothing Chen et al., 2020). In order to further improve MMGCN on ERC, we encode the multimodal graph using spectral domain GCN with deep layers.

Method
A dialogue can be defined as a sequence of utterances {u 1 , u 2 , ..., u N }, where N is the number of utterances. Each utterance involves three sources of utterance-aligned data corresponding to three modalities, including acoustic (a), visual (v) and textual (t) modalities, which can be represented as follows: where u a i , u v i , u t i denote the raw feature representation of u i from the acoustic, visual and textual modality, respectively. The emotion recognition in conversation task aims to predict the emotional status label for each utterance u i in the conversation based on the available information from all three modalities. Figure 2 illustrates the overall framework of our proposed emotion recognition in conversation system, which consists of three key modules: Modality Encoder, Multimodal Fused Graph Convolutional Network (MMGCN), and Emotion Classifier.

Modality Encoder
As we mentioned above, the dialog context information is important for predicting the emotion label of each utterance. Therefore, it is beneficial to encode the contextual information into the utterance feature representation. We generate the contextaware utterance feature encoding for each modality through the corresponding modality encoder. To be specific, we apply a bidirectional Long Short Term Memory (LSTM) network to encode the sequential textual context information for the textual modality. For the acoustic and visual modalities, we apply a fully connected network. The context-aware feature encoding for each utterance can be formulated as follows: where u a i , u v i , u t i are the context-independent raw feature representation of utterance i from the acoustic, visual and textual modalities, respectively. The modality encoder outputs the context-aware raw feature encoding h a i , h v i , and h t i accordingly.

Multimodal fused GCN (MMGCN)
In order to capture the utterance-level contextual dependencies across multiple modalities, we propose a Multimodal fused Graph Convolutional Network (MMGCN). We construct a spectral domain graph convolutional network to encode the multimodal contextual information inspired by Chen et al., 2020). We also stack more layers to construct a deep GCN. Furthermore, we add learned speaker-embeddings to encode the speakerlevel contextual information.

Speaker Embedding
As mentioned above, speaker information is important for ERC. In order to encode the speaker identity information, we add speaker embeddings to the features before constructing the graph. Assuming there are M parties in a dialogue, then the size of the speaker embedding is M . We show a two-speaker conversation case in Figure 2. The original speaker identity can be denoted with a one-hot vector s i and the speaker embedding S i is calculated as follows: The speaker embedding can then be leveraged to attach speaker information in the graph construction.

Graph Construction
A dialogue with N utterances can be represented as an undirected graph G = (V, E), where V (|V| = 3N ) denotes utterance nodes in three modalities and E ⊂ V × V is a set of relationships containing context, speaker and modality dependency. We construct the graph as follows: respectively, corresponding to the three modalities. Thus, given a dialogue with N utterances, we construct a graph with 3N nodes. Edges: We assume that each utterance has certain connection to other utterances in the same dialogue. Therefore, any two nodes in the same modality in the same dialogue are connected in the graph. Furthermore, each node is connected with the nodes which correspond to the same utterance but from different modalities. For example, v a i will be connected with v v i and v t i in the graph. Edge Weighting: We assume that if two nodes have higher similarity, the information interaction between them is also more important, and the edge weight between them should be higher. In order to capture the similarities between node representations, following (Skianis et al., 2018), we use the angular similarity to represent the edge weight between two nodes.
There are two types of edges in the graph: 1) edges connecting nodes from the same modality, and 2) edges connecting nodes from different modalities. To differentiate them, we use different edge weighting strategies. For the first type of edges, the edge weight is computed as: where n i and n j denote the feature representations of the i-th and j-th node in the graph. For the second type of edges, the edge weight is computed as: where γ is a hyper parameter.
where A denotes the adjacency matrix, D denotes the diagonal degree matrix of graph G, and I denotes identity matrix. The iteration of GCN from different layers can be formulated as: where α and β (l) are two hyper parameters, σ denotes the activation function and W (l) is a learnable weight matrix. To ensure the decay of the weight matrix adaptively increases when stacking more layers, we set β (l) = log( η l + 1), where η is also a hyper parameter. A residual connection to the first layer H (0) is added to the representationPH (l) and an identity mapping I is added to the weight matrix W (l) . With such residual connection, we can make MMGCN deeper to further improve performance.

Emotion Classifier
As described in sec. 3.2.2, we initialize nodes with the combination of utterance feature and speaker embedding, h i .
Let g a i , g v i and g t i be the features of different modalities encoded by the GCN. The features corresponding to the same utterance are concatenated: We then can concatenate g i and h i to generate the final feature representation for each utterance: e i is then fed into a MLP with fully connected layers to predict the emotion labelŷ i for the utterance:

Training Objectives
We use categorical cross-entropy along with L2regularization as the loss function during training: 12) where N is the number of dialogues, c(i) is the number of utterances in dialogue i, P i,j is the probability distribution of predicted emotion labels of utterance j in dialogue i, y i,j is the expected class label of utterance j in dialogue i, λ is the L2-regularization weight, and θ is the set of all trainable parameters. We use stochastic gradient descent based Adam (Kingma and Ba, 2014) optimizer to train our network. Hyper parameters are optimized using grid search.

Dataset
We evaluate our proposed MMGCN model on two benchmark datasets, IEMOCAP (Busso et al., 2008) and MELD . Both are multimodal datasets with aligned acoustical, visual and textual information of each utterance in a conversation. Followed (Ghosal et Table 1 shows the distribution of train and test samples for both datasets. IEMOCAP: The dataset contains 12 hours of videos of two-way conversations from ten unique speakers, where only the first eight speakers from session one to four are used in the training set. Each video contains a single dyadic dialogue, segmented into utterances. There are in total 7433 utterances and 151 dialogues. Each utterance in the dialogue is annotated with an emotion label from six classes, including happy, sad, neutral, angry, excited and frustrated. MELD: Multi-modal Emotion Lines Dataset (MELD) is a multi-modal and multi-speaker conversation dataset. Compared to the Emotion Lines dataset (Chen et al., 2018), MELD has three modality-aligned conversation data with higher quality. There are in total 13708 utterances, 1433 conversations and 304 different speakers. Specifically, different from dyadic conversation datasets such as IEMOCAP, MELD has three or more speakers in a conversation. Each utterance in the dialogue is annotated with an emotion label from seven classes, including anger, disgust, fear, joy, neutral, sadness and surprise.

Utterance-level Raw Feature Extraction
The textual raw features are extracted using TextCNN following (Hazarika et al., 2018a). The acoustic raw features are extracted using the OpenSmile toolkit with IS10 configuration (Schuller et al., 2011). The visual facial expression features are extracted using a DenseNet (Huang et al., 2015) pre-traind on the Facial Expression Recognition Plus (FER+) corpus (Barsoum et al., 2016).

Implementation Details
The hyperparameters are set as follows: the number of GCN layers are both 4 for IEMOCAP and MELD. The dropout is 0.4. The learning rate is 0.0003. The L2 regularization parameter is 0.00003. α, η and γ are set as 0.1, 0.5 and 0.7 respectively. Considering the class-imbalance in  MELD, we use focal loss when training MMGCN on MELD. In addition, we add layer normalization after the speaker embedding.

Evaluation Metrics and Significance Test
Following previous works (Hazarika et al., 2018a;Ghosal et al., 2019), we use weighted average f1-score as the evaluation metric. Paired t-test is performed to test the significance of performance improvement with a default significance level of 0.05.

Compared Baselines
In order to verify the effectiveness of our model, We implement and compare the following models on emotion recognition in conversation. BC-LSTM (Poria et al., 2017): it encodes contextual information through Bi-directional LSTM (Hochreiter and Schmidhuber, 1997) network. The context-aware features are then used for emotion classification. BC-LSTM ignores speaker information as it doesn't attach any speaker-related information to their model. CMN (Hazarika et al., 2018b): it leverages speaker-dependent GRUs to model utterance context combining dialogue history information. The utterance features with contextual information are subject to two distinct memory networks for both speakers. Due to the fixed number of Memory network blocks, CMN can only serve in dyadic conversation scenarios. ICON (Hazarika et al., 2018a): it extends CMN to model distinct speakers respectively. Same with CMN, two speaker-dependent GRUs are leveraged. Besides, A global GRU is used to track the change of emotion status in the entire conversation and multi-layer memory networks are leveraged to model the global emotion status. Though ICON improves the result of ERC, it still cannot adapt to a multi-speaker scenario.
DialogueRNN : it models speakers and sequential information in dialogues through three different GRUs, which include Global GRU, Speaker GRU and Emotion GRU. Specifically, Global GRU models context information, while Speaker dependent GRU models the status of the certain speaker. The two modules update interactively. Emotion GRU detects emotion of utterances in conversation. Furthermore, in the multimodal setting, the concatenation of acoustical, visual, and textual features is used when the speaker talks, but only use visual features otherwise. However, DialogueRNN doesn't improve much in multimodal settings. DialogueGCN (Ghosal et al., 2019): it applies GCN to ERC, in which the generated features can integrate rich information. Specifically, utterancelevel features encoded by bi-lstm are used to initialize the nodes of the graph, edges are constructed within a certain window. Utterances in the same dialogue but with long distance can be connected directly. Relation GCN (Schlichtkrull et al., 2018) and GNN (Morris et al., 2019), which are both nonspectral domain GCN models, are leveraged to encode the graph. However, DialogueGCN only focuses on the textual modality. In order to compare with our MMGCN under the multimodal setting, we extend DialogueGCN by simply concatenating features of three modalities.

Results and Discussions
We compare our proposed MMGCN with all the baseline models presented in section 4.5 on IEMO-CAP and MELD datasets under the multimodal setting. In order to compare the results under the same experiment settings, we reimplement the models in  Table 2 shows the performance comparison of MMGCN with other models on the two benchmark datasets under the multimodal setting. Di-alougeGCN was the best performing model when using only the textual modality. Under the multimodal setting, DialogueGCN which is fed with the concatenation of acoustic, visual and textual features achieves some slight improvement over the single textual modality. Our proposed MMGCN improves the F1-score performance over DialogueGCN under the multimodal setting by absolute 1.18% on IEMOCAP and 0.42% on MELD on average, and the improvement is significant with p-value < 0.05. Table 3 shows the performance comparison of MMGCN under different multimodal settings on both benchmark datasets. From Table 3 we can see that the best single modality performance is achieved on the textual modality and the worst is on the visual modality, which is consistent with previously reported findings. Adding acoustic and visual modalities can bring additional performance improvement over the textual modality.

Comparison with other fusion methods
To verify the effectiveness of MMGCN in multimodal fusion, we compare it with other multimodal fusion methods, including early fusion, late fusion, fusion through gated attention and other representative fusion methods such as MFN (Zadeh et al., 2018) and MulT (Tsai et al., 2019). The first three fusion methods are illustrated in Figure 3 where m j and m k could be any modality among {a, v, t}, h m j i and h m k i represent the feature encoded by the corresponding modality encoder, e i represents the final feature representation for the i th utterance. Considering MFN and MulT are leveraged to fuse multimodal information sequentially, they are used to replace the Modality Encoder. The fused multimodal features are fed to the GCN module subsequently.

MMGCN with different layers
We investigate the impact of the number of layers in MMGCN on the ERC performance in

Impact of Speaker Embedding
Speaker Embedding can differentiate input features from different speakers. Previous works have reported that speaker information can help improve emotion recognition performance. We conduct the ablation study to verify the contribution of speaker embedding in MMGCN as shown in Table 6. As expected, dropping speaker embedding in MMGCN leads to performance degradation, which is significant by t-test with p<0.05.

Case Study
Fig 4 depicts a scene in which a man and a woman quarrel with each other over a female friend of the man who came to meet with him across 700 miles. They are frustrated or angry in most cases. At the beginning of the conversation, their emotion states are both neutral. Over time, they become emotional. They are both angry at the end of the conversation. The heatmaps of the adjacent matrix for the 20 th utterance in the conversation from the three modalities demonstrate that different from simple sequential models, MMGCN pays attention not only to the close context, but also relate to the context in long-distance. For example, as shown in the textual heatmap, MMGCN can successfully aggregate information from the most relevant utterances, even from long-distance utterances, for example the 3 rd utterance.

Conclusion
In this paper, we propose an multimodal fused graph convolutional network (MMGCN) for multimodal emotion recognition in conversation (ERC). MMGCN provides a more effective way of utilizing both multimodal and long-distance contextual information. It constructs a graph that captures not only intra-speaker context dependency but also inter-modality dependency. With the residual connection, MMGCN can have deep layers to further improve recognition performance. We carry out experiments on two public benchmark datasets, IEMOCAP and MELD, and the experiment results prove the effectiveness of MMGCN, which outperforms other state-of-the-art methods by a significant margin under the multimodal conversation setting.