Conversation Understanding using Relational Temporal Graph Neural Networks with Auxiliary Cross-Modality Interaction

Emotion recognition is a crucial task for human conversation understanding. It becomes more challenging with the notion of multimodal data, e.g., language, voice, and facial expressions. As a typical solution, the global- and the local context information are exploited to predict the emotional label for every single sentence, i.e., utterance, in the dialogue. Specifically, the global representation could be captured via modeling of cross-modal interactions at the conversation level. The local one is often inferred using the temporal information of speakers or emotional shifts, which neglects vital factors at the utterance level. Additionally, most existing approaches take fused features of multiple modalities in an unified input without leveraging modality-specific representations. Motivating from these problems, we propose the Relational Temporal Graph Neural Network with Auxiliary Cross-Modality Interaction (CORECT), an novel neural network framework that effectively captures conversation-level cross-modality interactions and utterance-level temporal dependencies with the modality-specific manner for conversation understanding. Extensive experiments demonstrate the effectiveness of CORECT via its state-of-the-art results on the IEMOCAP and CMU-MOSEI datasets for the multimodal ERC task.


Introduction
Our social interactions and relationships are all influenced by emotions.Given the transcript of a conversation and speaker information for each constituent utterance, the task of Emotion Recognition in Conversations (ERC) aims to identify the emotion expressed in each utterance from a predefined set of emotions (Poria et al., 2019).The multimodal nature of human communication, which involves verbal/textual, facial expressions, vocal/acoustic, bodily/postural, and symbolic/pictorial expressions, This isn't anything like I thought anything would be.

This is just this…
Oh, sure this is standing on the beach, this is waiting, fighting. Right.
...I can't believe it.I never thought you would get married.

I know me neither.
Oh my gosh.
Just a couple days ago.
Oh my gosh.adds complexity to the task of Emotion Recognition in Conversations (ERC) (Wang et al., 2022).
Multimodal ERC, which aims to automatically detect the a speaker's emotional state during a conversation using information from text content, facial expressions, and audio signals, has garnered significant attention and research in recent years and has been applied to many real-world scenarios (Sharma and Dhall, 2021;Joshi et al., 2022).
Massive methods have been developed to model conversation's context.These approaches can be categorized into two main groups: graph-based methods (Ghosal et al., 2019;Zhang et al., 2019;Shen et al., 2021b) and recurrence-based methods (Hazarika et al., 2018a;Ghosal et al., 2020;Majumder et al., 2019;Hu et al., 2021).In addition, there have been advancements in multimodal models that leverage the dependencies and complementarities of multiple modalities to improve the ERC performance (Poria et al., 2017;Hazarika et al., 2018b;Zadeh et al., 2018).One limitation of these methods is their heavy reliance on nearby utterances when updating the state of the query utterance, which can restrict their overall performance.Recently, Graph Neural Network (GNN)-based methods have been proposed for the multimodal ERC task due to their ability to capture long-distance contextual information through their relational modeling capabilities.However, those models rely on fused inputs being treated as a single node in the graph (Ghosal et al., 2019;Joshi et al., 2022), which limits their ability to capture modality-specific representations and ultimately hampers their overall performance.
The temporal aspect of conversations is crucial, as past and future utterances can significantly influence the query utterance as Figure 1.The sentence "I know me neither" appears with opposing labels on different dialogues, which could be caused by sequential effects from previous or future steps.There are only a few methods that take into account the temporal aspect of conversations.MMGCN (Wei et al., 2019) represents modality-specific features as graph nodes but overlooks the temporal factor.DAG-ERC (Shen et al., 2021b) incorporates temporal information, but focuses solely on text modality.Recently, COGMEN (Joshi et al., 2022) proposes to learn contextual, inter-speaker, and intra-speaker relations, but neglects modalityspecific features and partially utilizes cross-modal information by fusing all modalities' representations at the input stage.
The aforementioned limitations motivate us to propose a COnversation understanding model using RElational Temporal Graph Neural Network with Auxiliary Cross-Modality Interaction (CORECT).It comprises two key components: the (i) Relational Temporal Graph Convolutional Network (RT-GCN); and the (ii) Pairwise Crossmodal Feature Interaction (P-CM).The RT-GCN module is based on RGCNs (Schlichtkrull et al., 2018) and GraphTransformer (Yun et al., 2019) while the P-CM is built upon (Tsai et al., 2019).Overall, our main contributions are as follows: • We propose the CORECT framework for Multimodal ERC, which concurrently exploit the utterance-level local context feature from multimodal interactions with temporal dependencies via RT-GCN, and the cross-modal global context feature at the conversation level by P-CM.These features are aggregated to enhance the performance of the utterance-level emotional recognition.
• We conduct extensive experiments to show that CORECT consistently outperforms the previous SOTA baselines on the two publicly real-life datasets, including IEMOCAP and CMU-MOSEI, for the multimodal ERC task.
• We conduct ablation studies to investigate the effect of various components and modalities on CORECT for conversation understanding.

Related Works
This section presents a literature review on Multimodal Emotion Recognition (ERC) and the application of Graph Neural Networks for ERC.

Multimodal Emotion Recognition in Conversation
The complexity of conversations, with multiple speakers, dynamic interactions, and contextual dependencies, presents challenges for the ERC task.
Multimodal machine learning has gained popularity due to its ability to address the limitations of unimodal approaches in capturing complex realworld phenomena (Baltrušaitis et al., 2018).It is recognized that human perception and understanding are influenced by the integration of multiple sensory inputs.There have been several notable approaches that aim to harness the power of multiple modalities in various applications (Poria et al., 2017;Zadeh et al., 2018;Majumder et al., 2019), etc. CMN (Hazarika et al., 2018b) combines features from different modalities by concatenating them directly and utilizes the Gated Recurrent Unit (GRU) to model contextual information.ICON (Hazarika et al., 2018a) extracts multimodal conversation features and employs global memories to model emotional influences hierarchically, resulting in improved performance for utterance-video emotion recognition.ConGCN (Zhang et al., 2019) models utterances and speakers as nodes in a graph, capturing context dependencies and speaker dependencies as edges.However, ConGCN focuses only on textual and acoustic features and does not consider other modalities.MMGCN (Wei et al., 2019), on the other hand, is a graph convolutional network (GCN)-based model that effectively captures both long-distance contextual information and multimodal interactive information.
More recently, Lian et al. (2022) propose a novel framework that combines semi-supervised learning with multimodal interactions.However, it currently addresses only two modalities, i.e., text and audio, with visual information reserved for future work.Shi and Huang (2023) introduces MultiEMO, an attention-based multimodal fusion framework that effectively integrates information from textual, audio and visual modalities.However, neither of these models addresses the temporal aspect in conversations.

Graph Neural Networks
In the past few years, there has been a growing interest in representing non-Euclidean data as graphs.However, the complexity of graph data has presented challenges for traditional neural network models.From initial research on graph neural networks (GNNs) (Gori et al., 2005;Scarselli et al., 2008), generalizing the operations of deep neural networks were paid attention, such as convolution (Kipf and Welling, 2017), recurrence (Nicolicioiu et al., 2019), and attention (Velickovic et al., 2018), to graph structures.When faced with intricate interdependencies between modalities, GNN is a more efficient approach to exploit the potential of multimodal datasets.The strength of GNNs lies in its ability to capture and model intra-modal and intermodal interactions.This flexibility makes them an appealing choice for multimodal learning tasks.
There have been extensive studies using the capability of GNNs to model the conversations.Di-alogueGCN (Ghosal et al., 2019) models conversation using a directed graph with utterances as nodes and dependencies as edges, fitting it into a GCN structure.MMGCN (Wei et al., 2019) adopts an undirected graph to effectively fuse multimodal information and capture long-distance contextual and inter-modal interactions.Lian et al. (2020) proposed a GNN-based architecture for ERC that utilizes both text and speech modalities.Dialogue-CRN (Hu et al., 2021) incorporates multiturn reasoning modules to extract and integrate emotional clues, enabling a comprehensive understanding of the conversational context from a cognitive perspective.MTAG (Yang et al., 2021) is capable of both fusion and alignment of asynchronously distributed multimodal sequential data.COGMEN (Joshi et al., 2022) uses GNN-based architecture to model complex dependencies, including local and global information in a conversation.Chen et al. (2023) presents Multivariate Multi-frequency Multimodal Graph Neural Network, M 3 Net for short, to explore the relationships between modalities and context.However, it primarily focuses on modalitylevel interactions and does not consider the temporal aspect within the graph.

Methodology
Figure 2 illustrates the architecture of CORECT to tackle the multimodal ERC task.It consists of main components namely Relational Temporal Graph Convolution Network (RT-GCN) and Pairwise Cross-modal Feature Interaction.For a given utterance in a dialogue, the former is to learn the local-context representation via leveraging various topological relations between utterances and modalities, while the latter infers the cross-modal globalcontext representation from the whole dialogue.
Given a multi-speaker conversation C consisting of N utterances [u 1 , u 2 , . . ., u N ], let us denote S as the respective set of speakers.Each utterance u i is associated with three modalities, including audio (a), visual (v), and textual (l), that can be represented as u a i , u v i , u l i respectively.Using localand global context representations, the ERC task aims to predict the label for every utterance u i ∈ C from a set of M predefined emotional labels Y = [y 1 , y 2 , . . ., y M ].

Utterance-level Feature Extraction
Here, we perform pre-processing procedures to extract utterance-level features to facilitate the learning of CORECT in the next section.

Unimodal Encoder
Given an utterance u i , each data modality manifests a view of its nature.To capture this value, we employ dedicated unimodal encoders, which generate utterance-level features, namely for the acoustic, visual, and lexical modalities respectively, and d a , d v , d l are the dimensions of the extracted features for each modality.
For textual modality, we utilize a Transformer (Vaswani et al., 2017) as the unimodal encoder to extract the semantic feature x l i from u l i as follows: where W l trans is the parameter of Transformer to be learned.For acoustic and visual modalities, we employ a fully-connected network as the unimodal encoder to extract context features for each modality type via the following procedure: where FC is the fully connected network, W τ f c ∈ R dτ ×d τ in are trainable parameters; d τ in is the input dimension of modality τ

Speaker Embedding
Inspired by MMGCN (Wei et al., 2019), we leverage the significance of speaker information.Let us define Embedding as a procedure that takes the identity of speakers and produce the respective latent representations.The embedding of multispeaker could be inferred as: where S emb ∈ R N ×N S and N S is the total number of participants in the conversation.The extracted utterance-level feature could be enhanced by adding the corresponding speaker embedding: where X τ ∈ R N ×dτ refers to the global-context representation from the whole dialogue obtained from the respective unimodal encoder; X τ represents the enhanced representation with the inclusion of the speaker embedding; η ∈ [0, 1] indicates the contribution ratio.

Relational Temporal Graph Convolutional Network (RT-GCN)
RT-GCN is proposed to capture local context information for each utterance in the conversation via exploiting the multimodal graph between utterances and their modalities.

Multimodal Graph Construction
Let us denote G(V, R, E) as the multimodal graph built from conversations, where {V, E, R} refers to the set of utterance nodes with the three modality types (|V| = 3 × N ), the set of edges and their relation types.Figure 3 provides an illustrative example of the relations represented on the constructed graph.
Nodes.Each utterance u i generates three nodes u a i , u v i , and u l i , which x a i , x v i , and x l i are the respective audio, visual, and lexical feature vectors.
Edges.The edge (u τ i , u τ j , r ij ) ∈ E, τ ∈ {a, v, l} represents the interaction between u τ i and u τ j with the relation type r ij ∈ R. In the scope of paper, we consider two groups of relations: R multi and R temp .Specifically, R multi represents the intra connections between the three modalities within the same utterance, reflecting multimodal interactions.On the other hand, R temp captures the inter connections between utterances of the same modality within a specified time window.This temporal relationship includes past/previous utterances de-noted as P and next/future utterances denoted as F. As a result, there are 15 edge types created with the definitions of the two groups.
Multimodal Relation.Emotions in dialogues cannot be solely conveyed through lexical, acoustic, or visual modalities in isolation.The interactions between utterances across different modalities play a crucial role.For example, given an utterance in a graph, its visual node has different interactive magnitude with acoustic-and textual nodes.Additionally, each node has a self-aware connection to reinforce its own information.Therefore, we can formalize 9 edge types of R multi to capture the multimodal interactions within the dialogue as: Temporal Relation.It is vital to have distinct treatment for interactions between nodes that occur in different temporal orders (Poria et al., 2017).To capture this temporal aspect, we set a window slide [P, F] to control the number of past/previous and next/future utterances that are set has connection to current node u τ i .This window enables us to define the temporal context for each node and capture the relevant information from the dynamic surrounding utterances.Therefore, we have 6 edge types of R temp as follows: where τ ∈ {a, v, l}; i, j ∈ 1, N ; future ← and past → indicate the past and future relation respectively.

Graph Learning
With the objective of leveraging the nuances and variations of heterogeneous interactions between utterances and modalities in the multimodal graph, we seek to employ Relational Graph Convolutional Networks (RGCN) (Schlichtkrull et al., 2018).For each relation type r ∈ R, node representation is inferred via a mapping function f (H, W r ), where W r is the weighted matrix.Aggregating all 15 edge types, the final node representation could be computed by R r f (H, W r ).To be more specific, the representation for the i-th utterance is inferred as follows: where N r (i) is the set of the node i's neighbors with the relation r ∈ R, W 0 , W r ∈ R d h1 ×dτ are learnable parameters (h1 is the dimension of the hidden layer used by R-GCN), and x τ i ∈ R dτ ×1 denotes the feature vector of node u τ i ; τ ∈ {a, v, l}.To extract rich representations from node features, we utilize a Graph Transformer model (Yun et al., 2019), where each layer comprises a selfattention mechanism followed by feed-forward neural networks.The self-attention mechanism allows vertices to exploit information from neighborhoods as well as capturing local and global patterns in the graph.Given g τ i is the representation of i th utterance with modality τ ∈ {a, v, l} obtained from RGCNs, its representation is transformed into: where W 1 , W 2 ∈ R d h 2 ×d h 1 are learned parameters (h2 is the dimension of the hidden layer used by Graph Transformer); N (i) is the set of nodes that has connections to node i; || is the concatenation for C head attention; and the attention coefficient of node j, i.e., α τ i,j , is calculated by the softmax activation function: W 3 , W 4 ∈ R dα×d h 1 are learned parameters.
After the aggregation throughout the whole graph, we obtain new representation vectors: where τ ∈ {a, v, l} indicates the corresponding audio, visual, or textual modality.

Pairwise Cross-modal Feature Interaction
The cross-modal heterogeneities often elevate the difficulty of analyzing human language.Exploiting cross-modality interactions may help to reveal the "unaligned" nature and long-term dependencies across modalities.Inspired by the idea (Tsai et al., 2019), we design the Pairwise Cross-modal Feature Interaction (P-CM) method into our proposed framework for conversation understanding.
A more detailed illustration of the P-CM module is presented in Appendix A.1.2Given two modalities, e.g., audio a and textual l, let us denote X a ∈ R N ×da , X l ∈ R N ×d l as the respective modality-sensitive representations of the whole conversation using unimodal encoders.Based on the transformer architecture (Vaswani et al., 2017), we define the Queries as Q a = X a W Q a , Keys as K l = X l W K l , and Values as V l = X l W V l .The enriched representation of X a once performing cross-modal attention on the modality a by the modality l, referred to as CM l→a ∈ R N ×d V , is computed as: where σ is the softmax function; is the dimension of queues, keys and values respectively.√ d k is a scaling factor and d (.) is the feature dimension.
To model the cross-modal interactions on unaligned multimodal sequences, e.g., audio, visual, and lexical, we utilize D cross-modal transformer layers.Suppose that Z a [i] is the modality-sensitive global-context representation of the whole conversation for the modality l at the i−th layer; Z a [0] = X a .The enriched representation of l→a by applying cross-modal attention of the modality l on the modality a is computed as the following procedure: where )) F F N expresses the transformation by the position-wise feed-forward block as: where Ω 1 and Ω 2 are linear projection matrices; b 1 and b 2 are biases.
Likewise, we can easily compute the crossmodal representation a→l , indicating that information from the modality a is transferred to the modality l.Finally, we concatenate all representations at the last layer, i.e., the D−th layer, to get the final cross-modal global-context representation Z

Multimodal Emotion Classification
The local-and global context representation resulted in by the RT-GCN and P-CM modules are fused together to create the final representation of the conversation: where τ ∈ {a, v, l}; F usion represents the concatenation method.H is then fed to a fully connected layer to predict the emotion label y i for the utterance u i : where Φ 0 , Φ 1 are learned parameters.

Experiments
This section investigate the efficacy of CORECT for the ERC task through extensive experiments in comparing with state-of-the-art (SOTA) baselines.

Experimental Setup
Dataset.We investigate two public real-life datasets for the multimodal ERC task including IEMOCAP (Busso et al., 2008) and CMU-MOSEI (Bagher Zadeh et al., 2018).The dataset statistics are given in Table 1.
IEMOCAP contains 12 hours of videos of twoway conversations from 10 speakers.Each dialogue is divided into utterances.There are in total 7433 utterances and 151 dialogues.The 6-way dataset contains six emotion labels, i.e., happy, sad, neutral, angry, excited, and frustrated, assigned to the utterances.As a simplified version, ambiguous pairs such as (happy, exited) and (sad, frustrated) are merged to form the 4-way dataset.
Evaluation Metrics.We use weighted F1-score (w-F1) and Accuracy (Acc.) as evaluation metrics.The w-F1 is computed K k=1 f req k × F 1 k , where f req k is the relative frequency of class k.The accuracy is defined as the percentage of correct predictions in the test set.
Implementation Details.Due to the space limit, the implementation details for feature extraction and interaction are described in Appendix A.1.
IEMOCAP: In the case of IEMOCAP (6-way) dataset (Table 2), CORECT performs better than the previous baselines in terms of F1 score for individual labels, excepts the Sad and the Excited labels.The reason could be the ambiguity between similar emotions, such as Happy & Excited, as well as Sad & Frustrated (see more details in Figure 6 in Appendix A.2). Nevertheless, the accuracy and weighted F1 score of CORECT are 2.89% and 2.75% higher than all baseline models on average.Likewise, we observe the similar phenomena on the IEMOCAP (4-way) dataset with a 2.49% improvement over the previous state-of-the-art models as Table 3.These results affirm the efficiency of CORECT for the multimodal ERC task.

Ablation study
Effect of Main Components.The impact of main components in our CORECT model is presented via Table 5.The model performance on the 6-way IEMOCAP dataset is remarkably degraded when the RT-GCN or P-CM module is not adopted with the decrease by 3.47% and 3.38% respectively.Similar phenomena is observed on the 4-way IEMOCAP dataset.Therefore, we can deduce that the effect of RT-GCN in the CORECT model is more significant than that of P-CM.
For different relation types, ablating either R multi or R temp results in a significant decrease in the performance.However, the number of labels may affect on the multimodal graph construction, thus it is no easy to distinguish the importance of R multi and R temp for the multimodal ERC task.
Table 8 (Appendix A.2) presents the ablation results for uni-and bi-modal combinations.In the unimodal settings, specifically for each individual modality (A, V, T), it's important to highlight that both P-CM module and multimodal relations R multi are non-existent.However, in bimodal combinations, the advantage of leveraging cross-modality information between audio and text (A+T) stands out, with a significant performance boost of over 2.75% compared to text and visual (T+V) modalities and a substantial 14.54% compared to visual and audio (V+A) modalities.
Additionally, our experiments have shown a slight drop in overall model performance (e.g., 68.32% in IEMOCAP 6-way, drop of 1.70%) when excluding Speaker Embedding S emb from CORECT.
Effect of the Past and Future Utterance Nodes.We conduct an analysis to investigate the influence of past nodes (P) and future nodes (F) on the model's performance.Unlike previous studies   The red-dash line implies our best setting for P and F. (Joshi et al., 2022;Li et al., 2023) that treated P and F pairs equally, we explore various combinations of P and F settings to determine their effects.Figure 4 indicates that the number of past or future nodes can have different impacts on the performance.From the empirical analysis, the setting [P, F] of [11,9] results in the best performance.This finding shows that the contextual information from the past has a stronger influence on the multimodal ERC task compared to the future context.For IEMOCAP (Table 2 and Table 3), the textual modality performs the best among the unimodal settings, while the visual modality yields the lowest results.This can be attributed to the presence of noise caused by factors, e.g., camera position, environmental conditions.In the bi-modal settings, combining the textual and acoustic modalities achieves the best performance, while combin-ing the visual and acoustic modalities produces the worst result.A similar trend is observed in the CMU-MOSEI dataset (Table 4), where fusing all modalities together leads to a better result compared to using individual or paired modalities.

Conclusion
In this work, we propose CORECT, an novel network architecture for multimodal ERC.It consists of two main components including RT-GCN and P-CM.The former helps to learn local-context representations by leveraging modality-level topological relations while the latter supports to infer cross-modal global-context representations from the entire dialogue.Extensive experiments on two popular benchmark datasets, i.e., IEMOCAP and CMU-MOSEI, demonstrate the effectiveness of CORECT, which achieves the new state-of-theart record for multimodal conversational emotion recognition.Furthermore, we also provide ablation studies to investigate the contribution of various components in CORECT.Interestingly, by analyzing the temporal aspect of conversations, we have validated that capturing the long-term dependencies, e.g., past relation, improves the performance of the multimodal emotion recognition in conversations task.

Limitations
Hyper-parameter tuning is a vital part of optimizing machine learning models.Not an exception, the learning of CORECT is affected by hyperparameters such as the number of attention head in P-CM module, the size of Future and Past Window.Due to time constraints and limited computational resources, it was not possible to tune or exploring all possible combinations of these hyperparameters, which might lead to local-minima convergences.In future, one solution for this limitation is to employ automated hyper-parameter optimization algorithms, to systematically explore the hyperparameter space and improve the robustness of the model.As another solution, we may upgrade CORECT with learning mechanisms to automatically leverage important information, e.g., attention mechanism on future and past utterances.

A Appendix
A.1 Implementation Details

A.1.1 Multimodal Raw Feature Extraction
The multimodal feature extraction process involves extracting features from the acoustic, lexical, and visual modalities for each utterance.
For IEMOCAP, the audio features, with a size of 100, are obtained using the OpenSmile Toolkit (Eyben et al., 2010); visual features, with a size of 512, are extracted using OpenFace (Baltrusaitis et al., 2018); textual features, with a size of 768, are derived using sBERT (Reimers and Gurevych, 2019).
For MOSEI, the audio features are extracted using librosa (McFee et al., 2015) with 80 filter banks, resulting in a feature vector size of 80.The visual features, with a size of 35, are obtained from (Bagher Zadeh et al., 2018).The textual features, with a size of 768, are obtained using sBERT (Reimers and Gurevych, 2019).x  Layers

A.2 Additional Experiment Result
Table 8 showcases the results on the IEMOCAP dataset (both 6-way and 4-way) for all the modality combinations of the CORECT model, while Table 7 presents an ablation study conducted on the CMU-MOSEI dataset, considering various modality combination settings.

A.3 Reproducibility
CORECT is implemented using Pytorch 2 , and run experiments on Google Colab Pro.We choose Adam as the optimizer and set the dropout rate to 0.5.The numbers of multi-head attentions used in Graph Transformer and P-CM are selected as 7 and 2, respectively.For IEMOCAP dataset, the learning rate is 0.0003; Window size [P, F] is tested on various settings in the range of [1,15].For CMU-MOSEI dataset, the learning rate is 0.0006; Window size [P, F] is picked between [5,4] due to the property of short dialogue in CMU-MOSEI.Refer-

Figure 1 :
Figure 1: Examples of temporal effects on conversations

Figure 2 :
Figure 2: Framework illustration of CORECT for the multimodal emotion recognition in conversations

Figure 3 :
Figure 3: An example construction of a graph illustrating the relationship among utterance nodes representing audio (square), visual (circle), and text (triangle) modalities with window size [P, F] = [2, 1] for query utterance i-th.The solid blue, solid red, and dashed red arrows indicate cross-modal-, past temporal-and future temporal connections respectively.
by the similar process.

Figure 4 :
Figure 4: The effects of P and F nodes in the past and future of CORECT model on the IEMOCAP (6-way)The red-dash line implies our best setting for P and F.

Figure 5
Figure 5 illustrates details of Pairwise Cross-modal Feature Interaction (P-CM).

Figure 5 :
Figure 5: Illustration of the P-CM module between modality β and α.

Figure 6 :
Figure 6: Visualization the confusion matrices of CORECT under multimodal (A+V+T) setting.Most of False predictions observed on IEMOCAP (6-way) came from the ambiguity between pair of labels: Happy and Excited, Neural and Frustrate.

Table 2 :
The results on IEMOCAP (6-way) multimodal (A+V+T) setting.The results in bold indicate the highest performance, while the underlined results represent the second highest performance.The ↑ illustrates the improvement compared to the previous state-of-the-art model.

Table 3 :
The results on the IEMOCAP (4-way) dataset in the multimodal (A+V+T) setting.The ↑ indicates the improvement compared to the previous SOTA model.

Table 4 :
Results on CMU-MOSEI dataset compared with previous works.The bolded results indicate the best performance, while the underlined results represent the second best performance.

Table 5 :
The performance of CORECT in different strategies under the fully multimodal (A+V+T) setting.Bolded results represent the best performance, while underlined results depict the second best.The ↓ represents the decrease in performance when a specific module is ablated compared to our CORECT model.

Table 6 :
The performance of CORECT under various modality settings.