Self-adaptive Context and Modal-interaction Modeling For Multimodal Emotion Recognition

,


Introduction
Emotion is a crucial part of human conversation.The emotion recognition in conversation task is to analyze each utterance in a conversation and give the corresponding emotion.This task has recently received more and more attention from researchers in both NLP and multimodal fields because of its potential applications, such as humancomputer interaction and opinion mining in social media (Chatterjee et al., 2019;Majumder et al., 2020).Traditional emotion recognition in conversation paradigms is either based on unrelated utterances in a dialogue or a single modality, such * Jianlong Wu is the corresponding author.Figure 1: Motivation of the proposed method.This is an example from the IEMOCAP dataset that contains three different kinds of context dependencies, including long, short, and independent dependency, which is marked by red arrow, green arrow, and blue square, respectively.Besides, the primary modality for the final prediction varies for different samples.
as text.However, in many cases, people's emotions are elusive and cannot be delivered well by just one utterance or a single modality.As multimodality is closer to real-world application scenarios, multimodal emotion recognition in conversation is gaining increasing research attention in recent years.To identify emotions more accurately, DialogueRNN (Majumder et al., 2019) first designs an RNN-based model which includes four GRUs to model both intra-and inter-speaker relations.DialogueGCN (Ghosal et al., 2019) then uses a graph neural network to model conversations.Later, MMGCN (Hu et al., 2021) proposes a graph-based method under the additional multimodal setting.
Although pioneer research studies have achieved promising progress, they mainly ignore the varying difficulty of each utterance for the model to recognize and multimodal interaction in conversa-tion, which leads to the following two limitations.First, existing methods treat all samples equally without considering their specific characteristic or difficulty for recognition.For example, they lack detailed modeling of diverse dependency ranges, i.e., long, short, and independent context-specific representations for each utterance.As illustrated in Figure 1, some utterances in a conversation require a long-range dependency, while others only require a short-range dependency or can determine the emotion on their own.Existing methods do not consider respectively modeling these varying dependency ranges.
Second, current approaches regard the contribution of each modality equally and simply concatenate the features of different modalities.However, the contribution of each modality varies and it is of great importance to investigate the correlation and interaction among different modalities.In particular, Figure 1 illustrates the different contributions among modalities for different utterances, where the primary modality for recognition varies from case to case.We argue the necessity to explore the modality-specific contributions.
Towards the above issues, we propose the Selfadaptive Context and Modal-interaction Modeling (SCMM) method for multimodal emotion recognition.First, to model different ranges of context dependency, we design the context representation module, which consists of three submodules, including global, local, and direct mapping.Second, towards the different contributions of various modalities, we propose the modal-interaction module, which also contains three submodules, including full, partial, and biased interaction, to investigate the correlation among them.Thereafter, faced with multiple outputs from each module, we come up with the self-adaptive path selection strategy to adaptively select an appropriate path to obtain the final representation for each utterance.We also put forward a contrastive learning loss to learn more discriminative representations.Finally, we conduct extensive experiments to validate the effectiveness of our approach.
Our main contributions are four-fold: • We propose a novel SCMM framework for multimodal emotion recognition in conversation.A new contextual representation module is designed to model different kinds of relation dependency, including long, short, and independent dependency.
• To capture the specific contribution of each modality, we design the modal-interaction module, which consists of three submodules, including full, partial, and biased interactions, to full investigate the correlation among different modalities.
• We come up with the self-adaptive path selection strategy to adaptively select an appropriate path based on module outputs.Moreover, we present a cross-modal contrastive learning loss for discriminative feature learning.
• Extensive experiments on three multimodal emotion recognition datasets, including IEMOCAP, MELD, and MOSEI, demonstrate the superiority of our method.Specifically, on the IEMOCAP dataset under both two different settings, the absolute improvement over state-of-the-art methods is higher than 4.0%.
2 Related Work

Emotion Recognition in Conversation
Recent years have witnessed growing research interest in Emotion Recognition in Conversation (ERC) due to its wide range of potential applications (Sebe et al., 2005;Yalamanchili et al., 2021).With the development of streaming services, many ERC datasets such as IEMOCAP (Busso et al., 2008), MELD (Poria et al., 2019), and MO-SEI (Bagher Zadeh et al., 2018) provide a new platform for ERC researchers.
To tackle the ERC task, DialogueRNN (Majumder et al., 2019) first proposes an RNN-based model which consists of four GRUs: Global, Speaker, Party, and Emotion, to keep track of the individual and global contextual states in the conversation simultaneously.Following that, DialogueGCN (Ghosal et al., 2019) presents a graph-based model that uses a context window to capture local contextual information.Later, DAGERC (Shen et al., 2021b) applies GNN to construct directed acyclic graphs in conversations and RNN to model local contextual representations.Moreover, COGMEN (Joshi et al., 2022) and MMGCN (Hu et al., 2021) adopt graph-based methods in the same period to model local and global contextual representations, respectively.
Previous work in ERC can be roughly divided into unimodal (Yu et al., 2019;Shen et al., 2021a;Wang et al., 2020) and multimodal approaches (Datcu and Rothkrantz, 2015;Wöllmer et al., 2010).The former uses a single textual modality in experiments, whereas the latter considers acoustic, textual, and visual modalities at the same time.We focus on the multimodal setting.

Multimodal Fusion
Multimodal fusion aims to make full use of the information in various modalities to improve the recognition results (Atrey et al., 2010;Bramon et al., 2011).This strategy is simple and effective, which has drawn many researchers' attention.For example, in ERC scenarios, DialogueRNN (Majumder et al., 2019) first conducts experiments with single text modality settings but also concatenates multimodal features as an additional experiment.Furthermore, COGMEN (Joshi et al., 2022) follows the setting of concatenating modality in Di-alogueRNN and designs a GNN model based on this setting.Moreover, MMGCN (Hu et al., 2021) and EmoCaps (Li et al., 2022) concatenate each modality together after passing it through a simple LSTM or linear layer.
However, the multimodal interactions of the existing efforts are still very simple and inevitably lead to suboptimal performance.For example, COGMEN and MMGCN simply concatenate the features of different modalities.We argue that the contribution of different modalities varies and should be treated separately.It is of vital importance to exploit the modal-interaction.

Problem Formulation
In ERC, a conversation is defined as a sequence of utterances C = {u 1 , u 2 , . . ., u n }, where n is the number of utterances.Each utterance u i can be labeled by a discrete value y i , where y i ∈ S and S is the emotion labels set.This task aims to predict the emotion label y i for a given query utterance u t based on the dialogue context u 1 to u n and the corresponding speaker identity.Each conversation dataset D contains N dialogues and can be denoted as D = {C j |j = 1, . . ., N }.
In a general multimodal setting, each utterance u i consists of three modalities, including audio, text, and video, so u i can be further expressed as u i = {u a i , u t i , u v i }, where u a i , u t i , u v i denote the acoustic, textual, and visual features of the i-th utterance with dimension d a , d t , d v , respectively.The whole conversation feature of each modality can be denoted as U a , U t , U v .

Overview of the Proposed SCMM
As illustrated in Section 1, existing methods do not consider the specific characteristic of diverse dependency ranges for different samples and simply concatenate multimodal features, leading to undesirable results.Therefore, we propose Self-adaptive Context and Modal-interaction Modeling (SCMM) for Multimodal Emotion Recognition.As shown in Figure 2(a), our model first takes the features of each modality as input and obtains the context representation of each modality after passing through the context representation module.Then, each context-represented modality feature will fully interact and complement the information from each other in the modal-interaction module, after which we use self-adaptive path selection module to select appropriate features to get the multimodal representation for final classification.
In the context representation module, we develop three submodules to obtain context representation for utterances with different dependency ranges.First, with the help of the attention mechanism, each utterance can attend to the information of other utterances, so we use a Transformer structure to extract global context representation for a long dependency range.Besides, the GRU structure contains a gate mechanism that can filter out information from long-distance utterances, so we use this unit to obtain the local contextual representation of the utterance for a short dependency range.Finally, for utterances that do not need the assistance of contextual information, we use a linear layer to extract the information.The arrows within each submodule of the context representation module illustrated in left of Figure 2(a) indicate the afflux type of contextual information during the representation process.
For multimodal features, we also consider the difficulty of each utterance for the model to recognize and model it by three modality interaction submodules.For simple utterances, e.g., sentences that contain emotional words, we directly concatenate all modality features together and pass through the linear layer.For slightly complex utterances, we use diverse combinations and interactions among modalities.For more difficult utterances, we take the text modality as the primary modality and others as the auxiliary modalities for interaction.An additional Transformer with a local attention mask is applied to leverage more modality information from adjacent utterances in this phase.

Context Representation Module
Integrating contextual information into the features of utterances is essential, but the demands to establish dependencies between different utterances vary.These dependencies can be summarized into three basic types: long, short, and independent dependency.Based on these different requirements, we design three submodules to consider each case separately.
Global Context Representation: People may discuss several topics in a conversation, and different topics may have different emotional vibes.The current utterance's emotion may be based on another topic raised a relatively long time ago, which is a long-distance emotional dependency relationship.We design the global context representation submodule to model this scenario.With the commonly used attention mechanism, each utterance can attend to other utterances without considering the distance, which ensures effectiveness during long-distance context representation.We use the following multi-head self-attention mechanism to capture global contextual information: where Q, K, and V are feature matrices, and Q, K, V ∈ R n×d .For the self-attention mechanism, Q, K and V are derived from input features with separate linear layers.They will be equally divided into k heads along the feature dimension, the i-th head can be denoted as and Attn is calculated by Eq. ( 2) for each head: where σ denotes the softmax operation.
For dialogue features U x of different modality, where x ∈ {a, t, v} and U x ∈ R n×dx , the intermediate representation obtained by MultiHead is then passed through the commonly used residual concatenation, LayerNorm, and feed-forward layers to obtain the final output U x g of this submodule, i.e., U a g , U t g , and U v g .Local Context Representation: In multi-turn conversations, the emotion of a speaker's utterance may be influenced by adjacent utterances, which is a short-distance emotional dependency that occurs at a local scale.To handle this scenario, we design the local context representation module.The Gated Recurrent Units (GRU) update mechanism ensures that each utterance will integrate contextual information from closer utterances while forgetting information about farther utterances.Therefore, we use a bidirectional GRU network to obtain the local context representation of each utterance.For any modality input U x , the local context representation feature U x l is computed by: We denote the features of each modality obtained by this submodule as U a l , U t l , and U v l , respectively.Direct Mapping: For the utterances that contain enough information on their own, the process of context representation may introduce additional noise.Therefore, we design the direct mapping submodule to directly extract information for each utterance through a linear layer as follows: In this submodule, the output features of each modality are U a d , U t d , and U v d , respectively.

Modal-interaction Module
Given multimodal features U a , U t , and U v , a multimodal interaction module takes these three features as input and outputs a multimodal feature U atv .By effectively exploiting the potential complementarity of information among these modalities, the multimodal features can be more discriminative, allowing the model to perform better than unimodal models.Considering the different difficulties among utterances, we design different interaction submodules to handle simple, more complex, and difficult scenarios, respectively.Full Interaction: For simple utterances and ideal cases where the three modalities U a , U t and U v complement each other, and each modality contains relatively equal information, we design the full interaction submodule, which concatenates three multimodal features directly and uses a linear layer to extract multimodal feature.We denote it as U atv f by linear layer and formulate it as follows: Partial Interaction: For slightly complex utterances, the contribution of different modalities varies due to the lack of key information or the mixing of noise.In this regard, we design the partial interaction submodule to alleviate this problem through diversified modality interactions.Specifically, we combines U a , U v and U t in pairs to obtain U at , U vt and U av features.For example, Finally, we concatenate all paired features and reduce the dimension by a linear layer.We denote this feature as U atv p .Biased Interaction: For more difficult utterances, we design the biased interaction submodule.In previous work, many experiments have shown that textual modality features are critical to the performance of the final model in predicting emotions, which indicates that the textual modality contains the primary information in most cases.Therefore, in this interaction process, we first take the text as the primary modality and others as auxiliary modalities to alleviate the information loss of text.Second, we use a small Transformer with a local attention mask to further leverage more modality information from adjacent utterances.
Specifically, the biased interaction submodule first concatenates U t together with U a and U v respectively to obtain U at b and U vt b .These two features will be concatenated after passing through their respective linear layers.Later, a Transformer with a local attention mask is applied to incorporate multimodal information from locally scaled multimodal features.
Take Q, K from the self-attention mechanism, the attention mask can be a binary matrix of dimension R n×n .M i,j = 1 means Q i can attend to K j during the attention process.Otherwise, it means not.The operation of masked attention can be formulated as follows: where ⊙ represents element-wise multiplication.
For the local attention mask of this part, we define the parameters w p , w f for length of the dependency context and the binary vector M i ∈ R n , with the value of the j-th element in M i being:

Self-Adaptive Path Selection
To best take advantage of the outputs of submodules obtained in Sections 3.3 and 3.4, we design the self-adaptive path selection module to adaptively select the most appropriate route and integrate them by groups for the next stage.The path selection process is done in a soft way, like an attention mechanism.As illustrated in Figure 2(b), for a given feature X 1 , X 2 , X 3 with the same dimension, we first calculate the similarity with these features through a trainable parameter Q p to get the score of each feature.Then, the normalized score is used as the weight of each feature.We use the softmax operation as the normalized function.Finally, we take the weighted average of these features as the final output, which can be formulated as follows: where [•, •] is the feature concatenation operation.
In the context representation module, we denote the output of each modality's context representation modality through self-adaptive path selection as U a c , U t c , U v c .In the modal-interaction module, we obtain the feature U atv all as the final multimodal feature by Select(U atv f , U atv p , U atv b ).

Cross-modal Contrastive Learning
We obtain the final prediction by passing U atv all through a linear layer, and the final emotion label Ŷ of the input dialogue U can be calculated by softmax (denoted by σ) and arg max operations: We first define the following classification loss: where N is the number of dialogues, c(i) is the number of utterances in the i-th dialogue, p i,j is the probability distribution of utterance j in the i-th dialogue, and y i,j is the expected class label of utterance j in the i-th dialogue.
In order to improve the discriminability of multimodal features we introduce supervised crossmodal contrastive loss in the modal-interaction module.In this stage, all dialogues within the batch are flattened into utterance feature sequences.For any two features of the same dimension X 1 , X 2 ∈ R C×d , where C denoting the number of utterances in the current batch, the supervised cross-modal contrastive loss is calculated as: where |M i | denotes the number of samples which have the same emotion label as the i-th sample, τ denotes the temperature defined in the original contrastive loss, and sim(x 1,i , x 2,i ) is used to calculate the cosine similarity of the two vectors.The crossmodal contrastive loss L cc between two feature set X 1 , X 2 is calculated by: We set text as the primary modality and assign the cross-modal contrastive loss to these three interaction submodules to get the following six parts: Then we get the overall training objective: where β is a constant to control the loss weight.

Dataset
We evaluated our method on three benchmark datasets, including IEMOCAP (Busso et al., 2008), MELD (Poria et al., 2019), and MO-SEI (Bagher Zadeh et al., 2018), all of which are multimodal datasets with aligned acoustic, textual, and visual information for each utterance in a conversation.In literature, two IEMOCAP settings are used, one with four emotions (IEMOCAP-4) and one with six emotions (IEMOCAP-6), so there are four benchmarks to be compared.For the train/validation/test splits of the dataset, following previous work, we split IEMOCAP and MOSEI according to the setting in (Joshi et al., 2022), and MELD according to the setting in (Hu et al., 2021)

Feature Extraction
We extracted uniform features to ensure a fair comparison.For IEMOCAP, audio and video features are obtained in the same way as COGMEN (Joshi et al., 2022), and text features are re-extracted by sBERT.For MELD, audio features (size 300) are extracted by OpenSmile toolkit with IS10 configuration (Schuller et al., 2011), video features (size 600) are extracted by DenseNet (Huang et al., 2017) in the same way as MMGCN (Hu et al., 2021), text features are extracted by sBERT.For MOSEI, audio features (size 640) are extracted using librosa1 with 640 filter banks, video features (size 35) are extracted by Facets, and text features are extracted by sBERT.We presented the dimensions of the final extracted features for each dataset in Table 2.

Compared Baselines
We compared both unimodal and multimodal methods proposed in the emotion recognition field to verify the effectiveness of our model.For unimodal methods, our model was compared with three baselines, including DialogueRNN (Majumder et al., 2019), DialogueGCN (Ghosal et al., 2019) and DAG-ERC (Shen et al., 2021b).For multimodal baselines, our model was compared with MMGCN (Hu et al., 2021), COGMEN (Joshi et al., 2022) and EMOCAPs (Li et al., 2022).We reimplemented all these methods under the same experimental settings for fair comparison.The BERT structure in the transformers (Wolf et al., 2020) library is adopted as the Transformer structure used in SCMM, and scipy (Virtanen et al., 2020) is used to calculate the F1-score value.For more information, please refer to Appendix A.2.

Implement Details
Our architecture trained on the IEMOCAP dataset has 304 million parameters and takes around 3 minutes to train for 55 epochs on one 2080Ti GPU.We fixed the random seed for all experiments to ensure the reproducibility of our experiments.
We trained our network using the Adam Optimizer with a learning rate of 1e-4.The length of the dependency context w f and w p are set to 5 for IEMOCAP and 2 for MELD and MOSEI.In the biased interaction submodule, the Transformer layers used for IEMOCAP, MELD, and MOSEI are 6, 2, and 2, respectively.β is set to 0.2 for MOSEI, and 1 for other datasets.The above optimal parameters are learned based on the grid-search strategy.

Main Results
Table 3 shows the results of our model compared with other models on several multimodal emotion conversation datasets.We have the following observations.On the one hand, our method achieves significant improvement over existing state-of-theart methods.Specifically, our results are 6.84%, 4.44%, 2.36%, and 1.25% absolutely higher than the second best result on IEMOCAP-6, IEMOCAP-4, MELD, and MOSEI, respectively, demonstrating the superiority of our method SCMM.
On the other hand, by comparing the results of last two lines, we can see that the cross-modal contrastive learning loss can bring consistent improvement on all these datasets, where the average improvement is about 0.8%.The reason is that the proposed contrastive loss can benefit the learning of discriminative features and make the margin between different classes more clear.

Effect of Submodules
We compared the effects of different context representation submodules and modal-interaction submodules.We divided these submodules into three parts based on their complexity, including the direct mapping with the full interaction, the local context representation with the partial interaction, and the global context representation with the biased interaction.We then tested the effectiveness of these three parts.The results are shown in  one or two modalities except U v , especially U t , the performance will decrease significantly.Above results can also verify that the text is the primary modality for this task.

Effect of Self-adaptive Path Selection
The self-adaptive path selection is designed for the integration of features in different modules.To demonstrate that this module plays a key role in our model, we replaced it with an alternative implementation, where the input features are directly concatenated and then reduced in dimension by a linear layer, which we call the linear selection module.
Table 5 shows that replacing our self-adaptive path selection module with the linear selection module leads to performance losses on all datasets, suggesting that the self-adaptive path selection can yield better features.We also illustrated the weights of each path from several samples to gain deep insights.As shown in Figure 3, in the context representation module, the global context representation submodule is the most important one.In the modal-interaction module, all the cases show that the biased and par-   tial interaction submodules are the most important, which implies that the modal-interaction requires more diverse interaction strategies rather than directly concatenating multimodal features.

Influence of Feature Extractor
For the results in Table 3, we reimplemented all compared baseline methods and used the same extracted features to ensure a fair comparison, which may result in different results than those reported in the paper.To demonstrate the generalization ability of our method, we also conducted additional experiments on IEMOCAP-6 based on the features extracted by COGMEN (Joshi et al., 2022) and EmoCaps (Li et al., 2022).The detailed difference between features can be found in Appendix.The results are shown in Table 6.We can see that our SCMM still achieves much better performance than them under their settings, validating our superiority and robustness.

Parameter Sensitivity Analysis
According to the training objective in Eq. ( 15), there is mainly one parameter β, which controls the contribution of cross-modal contrastive learning loss.In experiments, we find the optimal value for β by grid searching.We present the results of our method on IEMOCAP-4 with respect to different β in Figure 4. We can observe that our method is relatively stable when β varies in the range of [0.8, 1.2], which show that SCMM is insensitive to this parameter in a certain range.

Conclusion
In this paper, for the task of multimodal emotion recognition, we propose the self-adaptive contextual and modal-interaction modeling method.We first come up with the context representation module with global, local modeling and direct mapping to solve the issue of long, short, and independent dependency.Then the modal-interaction consists of full, partial, and bias interactions to fully investigate the correlation and potential complementarity among different modalities.Then we propose the self-adaptive path selection module for better combination and cross-modal contrastive learning loss for discriminative feature learning.Extensive experiments on three datasets under four settings have demonstrated the effectiveness and superiority of our proposed method.

Limitations
Our proposed method is an offline system in which the input is a dialogue containing all utterances rather than a single utterance input in chronological order.An online system for emotion recognition can be applied in real-time conference systems or human-computer interaction, so the online system has potential value for future research.Our method can be built into online systems by creating buffer systems such as history windows.However, all the baseline methods in the past are offline systems, such as COGMEN, DialogueRNN, etc.In addition, the form of datasets also leads us to construct an offline system for training and testing.On the other hand, the offline system also has application scenarios such as analyzing emotions of posted videos, opinion mining in social media, etc.Therefore, our method only builds an offline system under the offline experimental setting that can be compared and evaluated.
Besides, the input of our method is feature-based.The original text, audio, and video files will first pass through feature extractors to obtain multimodal features, which may cause information loss and hurt performance.We focus on feature-based training methods because training based on the original files costs a lot.For example, training a video encoder generally requires several V100 GPUs and days of training time.Therefore, we, including the baseline methods we compare, adapt the feature-based training methods.When the cost permits, training based on source files is worth exploring in future work.With feature-based training methods, different baseline methods use feature extractors to obtain features, leading to a lack of fairness in method comparison.In this regard, we reimplemented all open-source methods and compared them using a unified feature file to ensure the fairness of the experimental results.At the same time, we also conducted evaluations with different signature files to verify the generalization of the method.

A.1 Datasets and Feature Extraction
We summarized the statistics for these three datasets in Table 1.All used datasets are commonly used for emotion recognition in the English language.The ids of the data are anonymized by sequential ids or random hash values.IEMOCAP: IEMOCAP is a multimodal dataset that contains approximately 12 hours of videos for human emotion recognition analysis.Each video consists of a single dyadic dialogue, and every utterance in a conversation is annotated with an emotion label from six categories: happy, sad, neutral, angry, excited, and frustrated.IEMOCAP has two settings, one for four emotion recognition tasks (angry, sad, happy, neutral) and one for six emotion recognition tasks (happy, sad, neutral, angry, excited, and frustrated).We conducted experiments on both of these settings.The IEMOCAP dataset uses the license written by itself, and we have obtained the authorization of The Signal Analysis and Interpretation Laboratory required for accessing and using the IEMOCAP dataset.MELD: MELD is a large-scale multimodal and multi-speaker emotional dialog dataset collected from the Friends TV series.There are more than 1.4k dialogues in the dataset, and the dialogues are participated by multiple speakers instead of only two.Each utterance in a conversation is annotated with an emotion label from seven categories: anger, disgust, sadness, joy, neutral, surprise, and fear.It uses the GNU (General Public License) v3.0 license.MOSEI: MOSEI is an emotional recognition dataset made up of 23k sentence utterance video clips taken from YouTube.Specifically, unlike multi-speaker datasets such as IEMOCAP and MELD, MOSEI has only one speaker in a video clip.Each utterance is annotated with an emotion label from six categories: happiness, sadness, disgust, fear, surprise, and anger.CMU-MOSEI also uses a license written by itself, which declaims that the dataset is free for anyone.
We extracted uniform features to ensure a fair comparison.For IEMOCAP, audio and video features are obtained in the same way as COG-MEN (Joshi et al., 2022), and text features are re-extracted by sBERT.For MELD, audio features (size 300) are extracted by OpenSmile toolkit with IS10 configuration (Schuller et al., 2011), video features (size 600) are extracted by DenseNet (Huang et al., 2017) in the same way as MMGCN (Hu et al., 2021), text features are extracted by sBERT.For MOSEI, audio features (size 640) are extracted using librosa2 with 640 filter banks, video features (size 35) are extracted by Facets, and text features are extracted by sBERT.
The distribution of the data used in our evaluation may have some bias.For example, IEMOCAP comes from the performance of some actors, and MELD is obtained from the TV series Friends.In real-world scenarios, conversations may be more complex, such as the position of the camera may be more variable, the types of emotions may be more, the modality of the collected data may be missing, etc.However, all baselines we compared are evaluated on these datasets.In the future, datasets in the wild or collected from natural scenes can be considered to verify the effectiveness of our algorithms.

A.2 Baselines and Implementation
DialogueGCN (Ghosal et al., 2019): it leverages self and inter-speaker dependency based on a graph convolutional network.Each node of the graph represents individual utterance features encoded by bi-LSTM, and the edges between a pair of nodes are constructed relying on the dependency between speakers within a sliding window.Due to only the text modality being used in DialogueGCN, we simply concatenated the features of three modalities for DialogueGCN to make it comparable to SCMM.DialogueRNN (Majumder et al., 2019): it employs four gated recurrent units(GRU), global GRU, party GRU, and emotion GRU to model the speaker, the context, and the emotion of the preceding utterances.Specifically, the global, party, and speaker GRU update the context, party state, and speaker state, respectively.The emotion GRU is used to model the emotionally relevant representations.DAG-ERC (Shen et al., 2021b): it models the conversation context through a directed acyclic graph with constraints on speaker identity and positional relations.Furthermore, DAG-ERC gathers contextual information for utterances in a single layer based on a directed acyclic graph neural network.COGMEN (Joshi et al., 2022): it leverages both local information in a dialogue based on GNN, and the GraphTransformers are used to fuse multiple modalities.However, instead of exploiting the in- trinsic connections between features of different modalities, COGMEN simply concatenates them and does not enhance much in multimodal settings.MMGCN (Hu et al., 2021): it utilizes both multimodal and long-distance contextual information based on a graph convolutional network.In addition, MMGCN constructs graphs in each modality and builds edges between nodes corresponding to the same utterance across multiple modalities.Though good results were achieved on IEMOCAP and MELD, it still treats different modalities in nearly the same way, which somewhat reduces the performance on multimodal tasks.EmoCaps (Li et al., 2022): it designs a model named Emoformer based on Transformer for feature extraction.After feature extraction, the three modality features are concatenated.Finally, a model based on bi-LSTM layers is applied for emotion prediction.
We used PyTorch to reimplement all these methods and SCMM.The BERT structure in the transformers (Wolf et al., 2020) library is adopted as the Transformer structure used in SCMM, and scipy (Virtanen et al., 2020) is used to calculate the F1-score value.Our architecture trained on the IEMOCAP dataset has 304 million parameters and takes around 3 minutes to train for 55 epochs on one 2080Ti GPU.We fixed the random seed for all experiments to ensure the reproducibility of our experiments.

A.3 Visualization of Contrastive Learning Features
We adopted the t-SNE to visualize feature maps before and after adding the cross-modal contrastive learning loss.As shown in Figure 5, our contrastive learning loss widens the gap among different classes, leading to more discriminative feature representations.

A.4 Error Analysis
After analyzing the dataset, we found that the error predictions of our model mainly came from the error identification of similar emotions.As shown in Figure 6, where most of the error samples in happy are classified as excited and most of the error samples in frustration are classified as anger, etc.These problems also exist in DialogueRNN, COGMEN, and DAGERC.Even though our final results show some improvement compared to previous work, the model still cannot avoid such prediction bias.

Figure 2 :
Figure 2: The overview of our proposed SCMM.(a): The overall framework.After feature extraction, the multimodal features go through context representation, modal-interaction, and self-adaptive path selection modules in turn and finally get predictions by the classifier.(b): The internal structure of the self-adaptive path selection module drawn in (a).(c): The diagram of the local attention mask.
) Eventually, we obtain the local attention mask M = [M 1 , M 2 , ..., M n ], where M ∈ R n×n .The final multimodal feature U atv b is obtained after the Transformer with the local attention mask M .

Table 5 :
Comparison of experimental results using the self-adaptive path selection module and the linear selection module.

Figure 3 :
Figure 3: Illustration of weights computed by selfadaptive path selection module for two utterances.

Figure 5
Figure 5: t-SNE representation of IEMOCAP-6 before and after appling L cc .

Table 1 :
Statistics of three datasets under four settings.

Table 2 :
Feature dimensions of each dataset.
. Statistics for these three datasets are summarized in Table1.For more information, please refer to Appendix A.1.

Table 4 :
Ablation study of our method.

Table 6 :
Results comparison under other methods' feature extraction settings on IEMOCAP-6.