Joyful: Joint Modality Fusion and Graph Contrastive Learning for Multimodal Emotion Recognition

Multimodal emotion recognition aims to recognize emotions for each utterance of multiple modalities, which has received increasing attention for its application in human-machine interaction. Current graph-based methods fail to simultaneously depict global contextual features and local diverse uni-modal features in a dialogue. Furthermore, with the number of graph layers increasing, they easily fall into over-smoothing. In this paper, we propose a method for joint modality fusion and graph contrastive learning for multimodal emotion recognition (Joyful), where multimodality fusion, contrastive learning, and emotion recognition are jointly optimized. Specifically, we first design a new multimodal fusion mechanism that can provide deep interaction and fusion between the global contextual and uni-modal specific features. Then, we introduce a graph contrastive learning framework with inter-view and intra-view contrastive losses to learn more distinguishable representations for samples with different sentiments. Extensive experiments on three benchmark datasets indicate that Joyful achieved state-of-the-art (SOTA) performance compared to all baselines.

1 Introduction "Integration of information from multiple sensory channels is crucial for understanding tendencies and reactions in humans" (Partan and Marler, 1999).Multimodal emotion recognition in conversations (MERC) aims exactly to identify and track the emotional state of each utterance from heterogeneous visual, audio, and text channels.Due to its potential applications in creating human-computer interaction systems (Li et al., 2022b), social media analysis (Gupta et al., 2022;Wang et al., 2023), and recommendation systems (Singh et al., 2022), MERC has received increasing attention in the natural language processing (NLP) community (Poria et al., 2019b(Poria et al., , 2021)), which even has the potential [Fear] I am so cold.
[Sad] You're gonna get out of here and you're gonna make lots of babies and watch them grow.
[Fear] It's getting quiet.I love you, Jack.[Sad] Winning that ticket was the best thing that ever happened to me.It took me to meet you.
[sad] I can't feel my body.
Figure 1 shows that emotions expressed in a dialogue are affected by three main factors: 1) multiple uni-modalities (different modalities complete each other to provide a more informative utterance representation); 2) global contextual information (u A 3 depends on the topic "The ship sank into the sea", indicating fear); and 3) intra-person and interperson dependencies (u A 6 becomes sad affected by sadness in u B 4 &u B 5 ).Depending on how to model intra-person and inter-person dependencies, current MERC methods can be categorized into Sequencebased and Graph-based methods.The former (Dai et al., 2021;Mao et al., 2022;Liang et al., 2022) use recurrent neural networks or Transformers to model the temporal interaction between utterances.However, they failed to distinguish intra-speaker and inter-speaker dependencies and easily lost unimodal specific features by the cross-modal attention mechanism (Rajan et al., 2022).Graph struc-ture (Joshi et al., 2022;Wei et al., 2019) solves these issues by using edges between nodes (speakers) to distinguish intra-speaker and inter-speaker dependencies.Graph Neural Networks (GNNs) further help nodes learn common features by aggregating information from neighbours while maintaining their uni-modal specific features.
Although graph-based MERC methods have achieved great success, there still remain problems that need to be solved: 1) Current methods directly aggregate features of multiple modalities (Joshi et al., 2022) or project modalities into a latent space to learn representations (Li et al., 2022e), which ignores the diversity of each modality and fails to capture richer semantic information from each modality.They also ignore global contextual information during the feature fusion process, leading to poor performance.2) Since all graphbased methods adopt GNN (Scarselli et al., 2009) or Graph Convolutional Networks (GCNs) (Kipf and Welling, 2017), with the number of layers deepening, the phenomenon of over-smoothing starts to appear, resulting in the representation of similar sentiments being indistinguishable.3) Most methods use a two-phase pipeline (Fu et al., 2021;Joshi et al., 2022), where they first extract and fuse uni-modal features as utterance representations and then fix them as input for graph models.However, the two-phase pipeline will lead to sub-optimal performance since the fused representations are fixed and cannot be further improved to benefit from the downstream supervisory signals.
To solve the above-mentioned problems, we propose Joint multimodality fusion and graph contrastive learning for MERC (JOYFUL), where multimodality fusion, graph contrastive learning (GCL), and multimodal emotion recognition are jointly optimized in an overall objective function.1) We first design a new multimodal fusion mechanism that can simultaneously learn and fuse a global contextual representation and uni-modal specific representations.For the global contextual representation, we smooth it with a proposed topic-related vector to maintain its consistency, where the topicrelated vector is temporally updated since the topic usually changes.For uni-modal specific representations, we project them into a shared subspace to fully explore their richer semantics without losing alignment with other modalities.2) To alleviate the over-smoothing issue of deeper GNN layers, inspired by You et al. (2020), that showed con-trastive learning could provide more distinguishable node representations to benefit various downstream tasks, we propose a cross-view GCL-based framework to alleviate the difficulty of categorizing similar emotions, which helps to learn more distinctive utterance representations by making samples with the same sentiment cohesive and those with different sentiments mutually exclusive.Furthermore, graph augmentation strategies are designed to improve JOYFUL's robustness and generalizability.3) We jointly optimize each part of JOYFUL in an end-to-end manner to ensure global optimized performance.The main contributions of this study can be summarized as follows: • We propose a novel joint leaning framework for MERC, where multimodality fusion, GCL, and emotion recognition are jointly optimized for global optimal performance.Our new multimodal fusion mechanism can obtain better representations by simultaneously depicting global contextual and local uni-modal specific features.
• To the best of our knowledge, JOYFUL is the first method to utilize graph contrastive learning for MERC, which significantly improves the model's ability to distinguish different sentiments.Multiple graph augmentation strategies further improve the model's stability and generalization.
• Extensive experiments conducted on three multimodal benchmark datasets demonstrated the effectiveness and robustness of JOYFUL.

Multimodal Emotion Recognition
Depending on how to model the context of utterances, existing MERC methods are categorized into three classes: Recurrent-based methods (Majumder et al., 2019;Mao et al., 2022) adopt RNN or LSTM to model the sequential context for each utterance.Transformers-based methods (Ling et al., 2022;Liang et al., 2022;Le et al., 2022) use Transformers with cross-modal attention to model the intra-and inter-speaker dependencies.Graphbased methods (Joshi et al., 2022;Zhang et al., 2021;Fu et al., 2021) can control context information for each utterance and provide accurate intraand inter-speaker dependencies, achieving SOTA performance on many MERC benchmark datasets.

Multimodal Fusion Mechanism
Learning effective fusion mechanisms is one of the core challenges in multimodal learning (Shankar,  2022).By capturing the interactions between different modalities more reasonably, deep models can acquire more comprehensive information.Current fusion methods can be classified into aggregationbased (Wu et al., 2021;Guo et al., 2021), alignmentbased (Liu et al., 2020;Li et al., 2022e), and their mixture (Wei et al., 2019;Nagrani et al., 2021).Aggregation-based fusion methods (Zadeh et al., 2017;Chen et al., 2021) adopt concatenation, tensor fusion and memory fusion to combine multiple modalities.Alignment-based fusion centers on latent cross-modal adaptation, which adapts streams from one modality to another (Wang et al., 2022a).Different from the above methods, we learn global contextual information by concatenation while fully exploring the specific patterns of each modality in an alignment manner.

Graph Contrastive Learning
GCL aims to learn representations by maximizing feature consistency under differently augmented views, that exploit data-or task-specific augmentations, to inject the desired feature invariance (You et al., 2020).GCL has been well used in the NLP community via self-supervised and supervised settings.Self-supervised GCL first creates augmented graphs by edge/node deletion and insertion (Zeng and Xie, 2021), or attribute masking (Zhang et al., 2022).It then captures the intrinsic patterns and properties in the augmented graphs without using human provided labels.Supervised GCL designs adversarial (Sun et al., 2022) or geometric (Li et al., 2022d) contrastive loss to make full use of label information.For example, Li et al. (2022c) first used supervised CL for emotion recognition, greatly improving the performance.Inspired by previous studies, we jointly consider self-supervised (suit-able graph augmentation) and supervised (crossentropy) manners to fully explore graph structural information and downstream supervisory signals.

Methodology
Figure 2 shows an overview of JOYFUL, which mainly consists of four components: (A) a unimodal extractor, (B) a multimodal fusion (MF) module, (C) a graph contrastive learning module, and (D) a classifier.Hereafter, we give formal notations and the task definition of JOYFUL, and introduce each component subsequently in detail.

Notations and Task Definition
In dialogue emotion recognition, a training dataset given, where C i represents the i-th conversation, each conversation contains several utterances C i = {u 1 , . . ., u m }, and Y i ∈ Y m , given label set Y = {y 1 , . . ., y k } of k emotion classes.Let X v , X a , X t be the visual, audio, and text feature spaces, respectively.The goal of MERC is to learn a function F : X v × X a × X t → Y that can recognize the emotion label for each utterance.We utilize three widely used multimodal conversational benchmark datasets, namely IEMO-CAP, MOSEI, and MELD, to evaluate the performance of our model.Please see Section 4.1 for their detailed statistical information.

Multimodal Fusion Module
Though the uni-modal extractors can capture longterm temporal context, they are unable to handle feature redundancy and noise due to the modality gap.Thus, we design a new multimodal fusion module (Figure 2 (B)) to inherently separate multiple modalities into two disjoint parts, contextual representations and specific representations, to extract the consistency and specificity of heterogeneous modalities collaboratively and individually.

Contextual Representation Learning
Contextual representation learning aims to explore and learn hidden contextual intent/topic knowledge of the dialogue, which can greatly improve the performance of JOYFUL.In Figure 2 (B1), we first project all uni-modal inputs x {v,a,t} into a latent space by using three separate connected deep neural networks f g {v,a,t} (•) to obtain hidden representations z g {v,a,t} .Then, we concatenate them as z g m and apply it to a multi-layer transformer to maximize the correlation between multimodal features, where we learn a global contextual multimodal representation ẑg m .Considering that the contextual information will change over time, we design a temporal smoothing strategy for ẑg m as where z con is the topic-related vector describing the high-level global contextual information without requiring topic-related inputs, following the definition in Joshi et al. (2022).We update the (i+1)-th utterance as z con ← z con +e η * i ẑg m , and η is the exponential smoothing parameter (Shazeer and Stern, 2018), indicating that more recent information will be more important.
To ensure fused contextual representations capture enough details from hidden layers, Hazarika et al. (2020) minimized the reconstruction error between fused representations with hidden representations.Inspired by their work, to ensure that ẑg m contains essential modality cues for downstream emotion recognition, we reconstruct z g m from ẑg m by minimizing their Euclidean distance: (2)

Specific Representation Learning
Specific representation learning aims to fully explore specific information from each modality to complement one another.Figure 2 (B2) shows that we first use three fully connected deep neural networks f ℓ {v,a,t} (•) to project uni-modal embeddings x {v,a,t} into a hidden space with representations as z ℓ {v,a,t} .Considering that visual, audio, and text features are extracted with different encoding methods, directly applying multiple specific features as an input for the downstream emotion recognition task will degrade the model's accuracy.To solve it, the multimodal features are projected into a shared subspace, and a shared trainable basis matrix is designed to learn aligned representations for them.Therefore, the multimodal features can be fully integrated and interacted to mitigate feature discontinuity and remove noise across modalities.We define a shared trainable basis matrix B with q basis vectors as where W {v,a,t,b} are trainable parameters.To learn new representations for each modality, we calculate the cosine similarity between them and B as where S v ij denotes the similarity between the i-th visual feature ( zℓ v ) i and the j-th basis vector representation b j .To prevent inaccurate representation learning caused by an excessive weight of a certain item, the similarities are further normalized by Then, the new representations are obtained as where ẑℓ {v,a,t} are new representations, and we also use reconstruction loss for their combinations where Concat( , ) indicating the concatenation, i.e., . Finally, we define the multimodal fusion loss by combining Eqs.( 1), (2), and (7) as:

Graph Construction
Graph construction aims to establish relations between past and future utterances that preserve both intra-and inter-speaker dependencies in a dialogue.We define the i-th dialogue with P speakers as C i = {U S 1 , . . ., U S P }, where U S i = {u S i 1 , . . ., u S i m } represents the set of utterances spoken by speaker S i .Following Ghosal et al. (2019), we define a graph with nodes representing utterances and directed edges representing their relations: R ij = u i → u j , where the arrow represents the speaking order.Intra-Dependency (R intra ∈ {U S i → U S i }) represents intra-relations between the utterances (red lines), and Inter-Dependency (R inter ∈ {U S i → U S j }, i ̸ = j) represents the inter-relations between the utterances (purple lines), as shown in Figure 3.All nodes are initialized by concatenating contextual and specific representations as h m = Concat( ẑg m , ẑℓ m ).And we show that window size is a hyper-parameter that controls the context information for each utterance and provide accurate intraand inter-speaker dependencies.

Graph Augmentation
Graph Augmentation (GA): Inspired by Zhu et al. (2020), creating two augmented views by using different ways to corrupt the original graph can provide highly heterogeneous contexts for nodes.By maximizing the mutual information between two augmented views, we can improve the robustness of the model and obtain distinguishable node representations (You et al., 2020).However, there are no universally appropriate GA methods for various downstream tasks (Xu et al., 2021), which motivates us to design specific GA strategies for MERC.Considering that MERC is sensitive to initialized representations of utterances, intra-speaker and inter-speaker dependencies, we design three corresponding GA methods: -Feature Masking (FM): given the initialized representations of utterances, we randomly select p dimensions of the initialized representations and mask their elements with zero, which is expected to enhance the robustness of JOYFUL to multimodal feature variations; -Edge Perturbation (EP): given the graph G, we randomly drop and add p% of intra-and inter-speaker edges, which is expected to enhance the robustness of JOYFUL to local structural variations; -Global Proximity (GP): given the graph G, we first use the Katz index (Katz, 1953) to calculate high-order similarity between intra-and inter-speakers, and randomly add p% highorder edges between speakers, which is expected to enhance the robustness of JOYFUL to global structural variations (Examples in Appendix A).
We propose a hybrid scheme for generating graph views on both structure and attribute levels to provide diverse node contexts for the contrastive objective.Figure 2 (C) shows that the combination of (FM & EP) and (FM & GP) are adopted to obtain two correlated views.

Graph Contrastive Learning
Graph contrastive learning adopts an L-th layer GCNs as a graph encoder to extract node hidden representations m } for two augmented graphs, where h i is the hidden representation for the i-th node.We follow an iterative neighborhood aggregation (or message passing) scheme to capture the structural information within the nodes' neighborhood.Formally, the propagation and aggregation of the ℓ-th GCN layer is: where h (i, ℓ) is the embedding of the i-th node at the ℓ-th layer, h (i, 0) is the initialization of the ith utterance, N i represents all neighbour nodes of the i-th node, and AGG (ℓ) (•) and COM (ℓ) (•) are aggregation and combination of the ℓ-th GCN layer (Hamilton et al., 2017).For convenience, we define h i = h (i,L) .After the L-th GCN layer, final node representations of two views are H (1) / H (2) .
In Figure 2 (C3), we design the intra-and interview graph contrastive losses to learn distinctive node representations.We start with the inter-view contrastiveness, which pulls closer the representations of the same nodes in two augmented views while pushing other nodes away, as depicted by the red and blue dash lines in Figure 2 (C3).Given the definition of our positive and negative pairs as (2) j ) − , where i ̸ = j, the inter-view loss for the i-th node is formulated as: i )) i , h (2) where sim(•, •) denotes the similarity between two vectors, i.e., the cosine similarity in this paper.Intra-view contrastiveness regards all nodes except the anchor node as negatives within a particular view (green dash lines in Figure 2 (C3)), as defined (h (1) j ) − where i ̸ = j.The intra-view contrastive loss for the i-th node is defined as: i )) i , h (1) j )) .
By combining the inter-and intra-view contrastive losses of Eqs.( 11) and ( 12), the contrastive objective function L ct is formulated as:

Emotion Recognition Classifier
We use cross-entropy loss for classification as: where k is the number of emotion classes, m is the number of utterances, ŷj i is the i-th predicted label, and y j i is the i-th ground truth of j-th class.Above all, combining the MF loss of Eq.( 8), contrastive loss of Eq.( 13), and classification loss of Eq.( 14) together, the final objective function is where α and β are the trade-off hyper-parameters.We give our pseudo-code in Appendix F. Please note that the detailed label distribution of the datasets is given in Appendix I. Implementation Details.We selected the augmentation pairs (FM & EP) and (FM & GP) for two views.We set the augmentation ratio p=20% and smoothing parameter η=0.2, and applied the Adam (Kingma and Ba, 2015) optimizer with an initial learning rate of 3e-5.For a fair comparison, we followed the default parameter settings of the baselines and repeated all experiments ten times to report the average accuracy.We conducted the significance by t-test with Benjamini-Hochberg (Benjamini and Hochberg, 1995) correction (Please see details in Appendix G).
Baselines.Different MERC datasets have different best system results, following COGMEN, we selected SOTA baselines for each dataset.

Parameter Sensitive Study
We first examined whether applying different data augmentation methods improves JOYFUL.We observed in Figure 4 (A) that 1) all data augmentation strategies are effective 2) applying augmentation pairs of the same type cannot result in the best performance; and 3) applying augmentation pairs of different types improves performance.Thus, we selected (FM & EP) and (FM & GP) as the default augmentation strategy since they achieved the best performance (More details please see Appendix C).JOYFUL has three hyperparameters.α and β determine the importance of MF and GCL in Eq.( 15), and window size controls the contextual length of conversations.In Figure 4 (B), we observed how α and β affect the performance of JOYFUL by varying α from 0.02 to 0.10 in 0.02 intervals and β from 0.1 to 0.5 in 0.1 intervals.The results indicated that JOYFUL achieved the best performance when α ∈ [0.06, 0.08] and β = 0.3.Figure 4 (C) shows that when window_size = 8, JOYFUL achieved the best performance.A small window size will miss much contextual information, and a longer one contains too much noise, we set it as 8 in experiments (Details in Appendix D).

Performance of JOYFUL
Tables 2 & 3 show that JOYFUL outperformed all baselines in terms of accuracy and WF1, improving 5.0% and 1.3% in WF1 for 6-way and 4-way, respectively.Graph-based methods, COGMEN and JOYFUL, outperform Transformers-based methods, Mult and FE2E.Transformers-based methods cannot distinguish intra-and inter-speaker dependencies, distracting their attention to important utterances.Furthermore, they use the cross-modal attention layer, which can enhance common features among modalities while losing uni-modal specific features (Rajan et al., 2022).JOYFUL outperforms other GNN-based methods since it explored features from both the contextual and specific levels, and used GCL to obtain more distinguishable features.However, JOYFUL cannot improve in Happy for 4-way and in Excited for 6-way since samples in IEMOCAP were insufficient for distinguishing these similar emotions (Happy is 1/3 of Neutral in Fig. 4 (D)).Without labels' guidance to re-sample or re-weight the underrepresented samples, selfsupervised GCL, utilized in JOYFUL, cannot ensure distinguishable representations for samples of minor classes by only exploring graph topological information and vertex attributes.Tables 4 & 5 show that JOYFUL outperformed  the baselines in more complex scenes with multiple speakers or various emotional labels.Compared with COGMEN and MM-DFN, which directly aggregate multimodal features, JOYFUL can fully explore features from each uni-modality by specific representation learning to improve the performance.
The GCL module can better aggregate similar emotional features for utterances to obtain better performance for multi-label classification.We cannot improve in Happy on MOSEI since the samples are imbalanced and Happy has only 1/6 of Surprise, making JOYFUL hard to identify it.
To verify the performance gain from each component, we conducted additional ablation studies.We deepened the GNN layers to verify JOYFUL's ability to alleviate the over-smoothing.In Table 7, COGMEN with four-layer GNN was 9.24% lower than that with one-layer, demonstrating that the over-smoothing decreases performance, while JOY-FUL relieved this issue by using the GCL framework.To verify the robustness, following Tan et al. (2022), we randomly added 5%∼20% noisy edges to the training data.In Table 7, COGMEN was  easily affected by the noise, decreasing 10.8% performance in average with 20% noisy edges, while JOYFUL had strong robustness with only an average 2.8% performance reduction for 20% noisy edges.
To show the distinguishability of the node representations, we visualize the node representations of FE2E, COGMEN, and JOYFUL on 6-way IEMO-CAP.In Figure 5, COGMEN and JOYFUL obtained more distinguishable node representations than FE2E, demonstrating that graph structure is more suitable for MERC than Transformers.JOYFUL performed better than COGMEN, illustrating the effectiveness of GCL.In Figure 6, we randomly sampled one example from each emotion of IEMOCAP (6-way) and chose best-performing COGMEN for comparison.JOYFUL obtained more discriminate prediction scores among emotion classes, showing GCL can push samples from different emotion class farther apart.

Conclusion
We proposed a joint learning model (JOYFUL) for MERC, that involves a new multimodal fusion mechanism and GCL module to effectively improve the performance of MERC.The MR mechanism can extract and fuse contextual and uni-modal specific emotion features, and the GCL module can help learn more distinguishable representations.
For future work, we plan to investigate the performance of using supervised GCL for JOYFUL on unbalanced and small-scale emotional datasets.

Limitations
JOYFUL has a limited ability to classify minority classes with fewer samples in unbalanced datasets.Although we utilized self-supervised graph contrastive learning to learn a distinguishable representation for each utterance by exploring vertex attributes, graph structure, and contextual information, GCL failed to separate classes with fewer samples from the ones with more samples because the utilized self-supervised learning lacks the label information and does not balance the label distribution.Another limitation of JOYFUL is that its framework was designed specifically for multimodal emotion recognition tasks, which is not straightforward and general as language models (Devlin et al., 2019;Liu et al., 2019) or image processing techniques (LeCun et al., 1995).This setting may limit the applications of JOYFUL for other multimodal tasks, such as the multimodal sentiment analysis task (Detailed experiments in Appendix J) and the multimodal retrieval task.Finally, although JOYFUL achieved SOTA performances on three widely-used MERC benchmark datasets, its performance on larger-scale and more heterogeneous data in real-world scenarios is still unclear.

A Example for Global Proximity
In Figure 7, given the network G and a modified p, we first used the Katz index (Katz, 1953) to calculate a high-order similarity between the vertices.We considered the arbitrary number of high-order distances.For example, second-order similarity between u A 1 and u B 4 as u A 1 → u B 4 = 0.83, third-order similarity between u A 1 and u B 5 as u A 1 → u B 5 = 0.63, and fourth-order similarity between u A 1 and u B 7 as u A 1 → u B 7 = 0.21.We then define the threshold score as 0.5, where a high-order similarity score less than the threshold will not be selected as added edges.Finally, we randomly selected p% edges (whose scores are higher than the threshold score) and added them to the original graph G to construct the augmented graph.

B Dimensions of Mathematical Symbols
Since we do not have much space to introduce details about the dimensions of the mathematical symbols in our main body.We carefully list all the dimensions of the mathematical symbols of IEMOCAP in Table 8.Mathematical symbols for other two datasets please see our source code.

C Observations of Graph Augmentation
As shown in Figure 8, when we consider the combinations of (FM & EP) and (FP & GP) as two graph augmentation methods of the original graph, we could achieve the best performance.Furthermore, we have the following observations: Obs.1: Graph augmentations are crucial.Without any data augmentation, GCL module will not improve Parameters of Aggregation Layer COM ∈ R 2,760×5,520 Input/Output of Combination Layer W graph ∈ R 5,520×2,760 Dimention Reduction after COM hm ∈ R 2,760 Node Features of GCN Layer accuracy, judging from the averaged WF1 gain of the pair (None, None) in the upper left corners of Figure 8.In contrast, composing an original graph and its appropriate augmentation can benefit the averaged WF1 of emotion recognition, judging from the pairs (None, any) in the top rows or the left-most columns of Figure 8. Similar observation were in graphCL (You et al., 2020), without augmentation, GCL simply compares two original samples as a negative pair with the positive pair loss becoming zero, which leads to homogeneously pushes all graph representations away from each other.Appropriate augmentations can enforce the model to learn representations invariant to the desired perturbations through maximizing the agreement between a graph and its augmentation.
Obs.2: Composing different augmentations benefits the model's performance more.Applying augmentation pairs of the same type does not often result in the best performance (see diagonals in Figure 8).In contrast, applying augmentation pairs of different types result in better performance gain (see offdiagonals of Figure 8).Similar observations were in SimCSE (Gao et al., 2021).As mentioned in that study, composing augmentation pairs of different types correspond to a "harder" contrastive  DAG-ERC (Shen et al., 2021b), CESTa (Wang et al., 2020), SumAggGIN (Sheng et al., 2020), DiaCRN (Hu et al., 2021), DialogXL et al., 2021a), DiaGCN (Ghosal et al., 2019), and COG-MEN (Joshi et al., 2022).Following COGMEN, text-based models were specifically optimized for text modalities and incorporated changes to architectures to cater to text.As shown in Table 11, JOYFUL, being a fairly generic architecture, still achieved better or comparable performance with respect to the state-of-the-art uni-modal methods.Adding more information via other modalities helped to further improve the performance of JOYFUL (Text vs A+T+V).When using only text modality, the DAG-ERC baseline could achieve higher WF1 than JOYFUL.And we conjecture the main reasons is: DAG-ERC (Shen et al., 2021b) fine-tuned RoBERTa large model (Liu et al., 2019), with 354 million parameters, as their text encoder.By fine-tuning on RoBERTa large model under the guidance of downstream emotion recognition signals, RoBERTa large model can provide the most suitable text features for ERC.Compared with DAG-ERC, JOYFUL and other methods directly use Sentence-BERT (Reimers and Gurevych, 2019), with 110 million parameters, as their text encoder without fine-tuning on ERC datasets.
To verify whether the above inference is reasonable, we used RoBERTa large model as our text feature extractor called Text (RoBERTa-large).And we fine-tuned RoBERTa large model on the downstream IEMPCAP (6-way) dataset, following the same method of DAG-ERC called Fine-tune Text (RoBERTA-large).The observation meets our intuition.With RoBERTa large model, JOYFUL improved the performance (68.05 vs 67.48) compared with Sentence-BERT as our text encoder.And JOYFUL could obtain better performance (68.45 vs 68.03) in terms of WF1 than DAG-ERC with fine-tuned RoBERTa-large, demonstrating that finetuning large-scale model can help obtain richer text features to improve the performance.However, considering a fair comparison with other multimodal emotion recognition baselines (they do not have the fine-tuning process (Joshi et al., 2022;Ghosal et al., 2019)) and saving the additional time-consuming on fine-tuning, we directly adopt Sentence-BERT as our text encoder for IEMOCAP.

F Pseudo-Code of JOYFUL
As shown in Algorithm 1, to make JOYFUL easy to understand, we also provide a pseudo-code.

G Benjamini-Hochberg Correction
Benjamini-Hochberg Correction (B-H) (Benjamini and Hochberg, 1995) is a powerful tool that decreases the false discovery rate.Considering the reproducibility of the multiple significant test, we introduce how we adopt the B-H correction and give the hyper-parameter values that we used.We first conduct a t-test (Yang et al., 1999) with default parameters2 to calculate the p-value between each comparison method with JOYFUL.We then put the individual p-values in ascending order as input to

[
Surprise] Don't you do that.Don't you say goodbyes.Do you understand ?

Figure 1 :
Figure 1: Emotions are affected by multiple uni-modal, global contextual, intra-and inter-person dependencies.Images are from the movie "Titanic".

Figure 2 :
Figure 2: Overview of JOYFUL.We first extract uni-modal features, fuse them using a multimodal fusion module, and use them as input of the GCL-based framework to learn better representations for emotion recognition.
q×d b with d b representing the dimensionality of each basis vector.Here, T indicates transposition.Then, z ℓ {v,a,t} and B are projected into the shared subspace: zℓ {v,a,t} = W {v,a,t} z ℓ {v,a,t} , B = BW b , (3)

Figure 3 :
Figure 3: An example of graph construction.
Figure 6: Visualization of emotion probability, each first row is JOYFUL and each second row is COGMEN.

Figure 7 :
Figure 7: Example of adding p% high-order edges to explore global topological information of graph.

Figure 9 :
Figure 9: Parameters tuning for α and β on validation datasets for all multimodal emotion recognition tasks.

Table 5 :
Results on MOSEI with the multimodal setting.

Table 6 :
Table 6 shows multi-modalities can greatly improve JOYFUL's performance compared with each single modality.GCL and each component of MF can Ablation study with different modalities.

Table 7 :
Adversarial attacks for GNN with different depth on 6-way IEMOCAP.

Table 11 :
Overall performance comparison on MOSEI with Text Modality.