UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition

Multimodal sentiment analysis (MSA) and emotion recognition in conversation (ERC) are key research topics for computers to understand human behaviors. From a psychological perspective, emotions are the expression of affect or feelings during a short period, while sentiments are formed and held for a longer period. However, most existing works study sentiment and emotion separately and do not fully exploit the complementary knowledge behind the two. In this paper, we propose a multimodal sentiment knowledge-sharing framework (UniMSE) that unifies MSA and ERC tasks from features, labels, and models. We perform modality fusion at the syntactic and semantic levels and introduce contrastive learning between modalities and samples to better capture the difference and consistency between sentiments and emotions. Experiments on four public benchmark datasets, MOSI, MOSEI, MELD, and IEMOCAP, demonstrate the effectiveness of the proposed method and achieve consistent improvements compared with state-of-the-art methods.


Introduction
With the rapid development of multimodal machine learning (Liang et al., 2022;Baltrušaitis et al., 2018) and dialog system (He et al., 2022a,b,c), Multimodal Sentiment Analysis (MSA) and Emotion Recognition in Conversations (ERC) have become the keys for machines to perceive, recognize, and understand human behaviors and intents (Zhang et al., 2021a,b;Hu et al., 2021a,b).Multimodal data provides not only verbal information, such as textual (spoken words) features but also non-verbal information, including acoustic (prosody, rhythm, pitch) and visual (facial attributes) features.These different modalities allow the machine to make decisions from different perspectives, thereby achieving more accurate predictions (Ngiam et al., 2011).The goal of MSA is to predict sentiment intensity or polarity, and ERC aims to predict predefined emotion categories.There are many research directions in MSA and ERC, such as multimodal fusion (Yang et al., 2021), modal alignment (Tsai et al., 2019a), context modeling (Mao et al., 2021) and external knowledge (Ghosal et al., 2020).However, most existing works treat MSA and ERC as separate tasks, ignoring the similarities and complementarities between sentiments and emotions.
On the one hand, from a psychological perspective, both sentiments and emotions are experiences that result from the combined influences of the biological, cognitive, and social (Stets, 2006), and could be expressed similarly.In Figure 1, we illustrate how sentiments and emotions are relevant in the verbal or non-verbal, and could be projected into a unified embedding space.On the other hand, emotions are reflections of the perceived change in the present within a short period (Batson et al., 1992), while sentiments are held and formed in longer periods (Murray and Morgan, 1945).In our preliminary study, we found that the video dura-tion of MSA is almost twice of ERC1 , which is consistent with the above definitions.A variety of psychological literature (Davidson et al., 2009;Ben-Ze'ev, 2001;Shelly, 2004) explain the similarities and differences between sentiment and emotion.Munezero et al. (2014) also investigates the relevance and complementarity between the two and point out that analyzing sentiment and emotion together could better understand human behaviors.
Based on the above motivation, we propose a multimodal sentiment knowledge-sharing framework that Unified MSA and ERC (UniMSE) tasks.UniMSE reformulates MSA and ERC as a generative task to unify input, output, and task.We extract and unify audio and video features and formalize MSA and ERC labels into Universal Labels (UL) to unify sentiment and emotion.
Besides, previous works on multimodal fusion at multi-level textual features (Peters et al., 2018;Vaswani et al., 2017), like syntax and semantics, are lacking.Therefore, we propose a pre-trained modality fusion layer (PMF) and embed it in Transformer (Vaswani et al., 2017) layers of T5 (Raffel et al., 2020), which fuses the acoustic and visual information with different level textual features for probing richer information.Last but not least, we perform inter-modal contrastive learning (CL) to minimize intra-class variance and maximize interclass variance across modalities.
Our contributions are summarized as follows: 1. We propose a multimodal sentimentknowledge sharing framework2 (UniMSE) that unifies MSA and ERC tasks.The proposed method exploits the similarities and complementaries between sentiments and emotions for better prediction.

Related Work
Multimodal Sentiment Analysis (MSA) MSA aims to predict sentiment polarity and sentiment intensity under a multimodal setting (Morency et al., 2011).MSA research could be divided into four groups.The first is multimodal fusion.Early works of multimodal fusion mainly operate geometric manipulation in the feature spaces (Zadeh et al., 2017).The recent works develop the reconstruction loss (Hazarika et al., 2020), or hierarchical mutual information maximization (Han et al., 2021) to optimize multimodal representation.The second group focuses on modal consistency and difference through multi-task joint learning (Yu et al., 2021a) or translating from one modality to another (Mai et al., 2020).The third is multimodal alignment.
Unified Framework In recent years, the unification of related but different tasks into a framework has achieved significant progress (Chen et al., 2022;Xie et al., 2022;Zhang et al., 2022c).For example, T5 (Raffel et al., 2020) unifies various NLP tasks by casting all text-based language problems as a text-to-text format and achieves state-of-the-art results on many benchmarks.More recently, the works (Wang et al., 2021a;Cheng et al., 2021b;Wang et al., 2021a)  3 Method

Overall Architecture
As shown in Figure 2, UniMSE comprises the task formalization, pre-trained modality fusion and inter-modality contrastive learning.First, we process off-line the labels of MSA and ERC tasks into the universal label (UL) format.Then we extract separately audio and video features using unified feature extractors among datasets.After obtaining audio and video features, we feed them into two individual LSTMs to exploit long-term contextual information.For textual modality, we use the T5 as the encoder to learn contextual information on the sequence.Unlike previous works, we embed multimodal fusion layers into T5, which follows the feed-forward layer in each of several Transformer layers of T5.Besides, we perform inter-modal contrastive learning to differentiate the multimodal fusion representations among samples.Specifically, contrastive learning aims to narrow the gap between modalities of the same sample and push the modality representations of different samples further apart.

Task Formalization
Given a multimodal signal i , m ∈ {t, a, v} to represent unimodal raw sequence drawn from the video fragment i, where {t, a, v} denote the three types of modalities-text, acoustic and visual.MSA aims to predict the real number y r i ∈ R that reflects the sentiment strength, and ERC aims to predict the emotion category of each utterance.MSA and ERC are unified in input feature, model architecture, and label space through task formalization.Task formalization contains input formalization and label formalization, where input formalization is used to process the dialogue text and modal feature, and label formalization is used to unify MSA and ERC tasks by transferring their labels into universal labels.Furthermore, we formalize the MSA and ERC as a generative task to unify them in a single architecture.

Input Formalization
The contextual information in conversation is especially important to understand human emotions and intents (Lee and Lee, 2021;Hu et al., 2022).Based on this observation, we concatenate current utterance u i with its former 2-turn utterances {u i−1 , u i−2 }, and its latter 2-turn utterances {u i+1 , u i+2 } as raw text.Additionally, we set segment id S t i to distinguish utterance u i and its contexts in textual modality: where the utterances are processed into the format of I t i , and we take I t i as the textual modality of I i .Furthermore, we process raw acoustic input into numerical sequential vectors by librosa 3 to extract Mel-spectrogram as audio features.It is the short-term power spectrum of sound and is widely used in modern audio processing.For video, we extract fixed T frames from each segment and use effecientNet (Tan and Le, 2019) pre-trained (supervised) on VGGface 4 and AFEW dataset to obtain video features.

Label Formalization
To break the information boundary between MSA and ERC, we design a universal label (UL) scheme and take UL as the target sequence of UniMSE.The universal label aims to fully explore the shared knowledge between MSA and ERC on sentiment and emotion.Given a universal label y i = {y p i , y r i , y c i }, it is composed by sentiment polarity y p i ∈ {positive, negative and neutral} contained in MSA and ERC, sentiment intensity y r i 3 https://github.com/librosa/librosa. 4 https://www.robots.ox.ac.uk/ vgg/software/vgg_face/. (the supervision signal of MSA task, a real number ranged from -3 to +3) and an emotion category y c i (the supervision signal of ERC task, a predefined emotion category).We align the sample with similar semantics (like 1.6 and joy), in which one is annotated with sentiment intensity, and the other is annotated with emotion category.After the alignment of label space, each sample's label is formalized into a universal label format.Next, we introduce in detail how to unify MSA and ERC tasks in the label space, as follows: First, we classify the samples of MSA and ERC into positive, neutral, and negative sample sets according to their sentiment polarity.Then we calculate the similarity of two samples with the same sentiment polarity but belonging to different annotation scheme, thereby completing the missing part in the universal label.We show an example in Figure 3.Given an MSA sample m 2 , it carries a positive sentiment and an annotation score of 1.6.Benchmarking the format of universal label, m 2 lacks an emotion category label.In this example, e 1 has the maximal semantic similarity to m 2 , and then we assign the emotion category of e 1 as m 2 's emotion category.
Previous works (Tsai et al., 2019a;Yang et al., 2021) have demonstrated that textual modality is more indicative than the other modalities, so we adopt textual similarity as the semantic similarity among samples.Specifically, we utilize a strong sentence embedding framework SimCSE (Gao et al., 2021) to calculate the semantic similarity of two texts for completion of the universal label.Similarly, the sample in the ERC dataset is assigned a real number by calculating the MSA sample that is most similar to it.After our formalization, the samples of MSA and ERC are processed into {(I 0 , y 0 ), (I 1 , y 1 )...(I N , y N )}, where I i denotes raw multimodal signal of sample i and y i denotes its universal label.We can obtain the predictions of MSA and ERC by decoding from the predicted UL.Additionally, we evaluate the performance of the generated automatically part in the universal labels.We randomly selected 80 samples with universal labels from MOSI and manually evaluated the generated labels used for universal label completion; the accuracy is about 90%.

Pre-trained Modality Fusion (PMF)
Unlike the previous works just using a pre-trained model (such as T5) as a text encoder, we embed the multimodal fusion layers into the pre-trained model.(Peters et al., 2018;Vaswani et al., 2017) are fused with audio and video features into multimodal representation.Besides, the injection of audio and vision to T5 can probe the relevant information in the massive pre-trained text knowledge, thereby incorporating richer pretrained understanding into the multimodal fusion representation.We name this multimodal fusion process as pre-trained modality fusion (PMF).
We use T5 as the backbone of UniMSE.T5 contains multiple stacked Transformer layers, and each Transformer layer for the encoder and decoder contains a feedforward layer.The multimodal fusion layer is set to follow after the feedforward layer.Essentially, the PMF unit in the first Transformer layer of T5 receives a triplet M i = (X t i , X a i , X v i ) as the input, where X m i , X m i ∈ R lm×dm denotes the modality representation of I m i , m ∈ {t, a, v}, l m and d m are the sequence length and the representation dimension of modality m, respectively.We view the multimodal fusion layer as an adapter (Houlsby et al., 2019) and insert it into the T5 model to optimize specific parameters for multimodal fusion.The multimodal fusion layer receives modal representation triplet M i , and maps the multimodal concatenation representation's size back to the layer's input size.Specifically, we concatenate the three modal representations and then feed the concatenation into the down-projection and up-projection layers to fuse representations.For j-th PMF, the multimodal fusion is given by: where X a,la i ∈ R 1×da and X denotes the fusion representation after (j − 1) Transformer layers.⊙ denotes the element addition.The output of the fusion layer is then passed directly into the following layer normalization (Ba et al., 2016).
Although we can embed a multimodal fusion layer in each Transformer of T5's encoder and decoder, it may bring two shortcomings: 1) disturb the encoding of text sequences, and 2) cause overfitting as more parameters are set for the multimodal fusion layer.Considering these issues, we use the former j Transformer layers to encode the text, and the remaining Transformer layers are injected with the non-verbal (i.e., acoustic and visual) signals.

Inter-modality Contrastive Learning
Contrastive learning (CL) has gained major advances in representation learning by viewing sample from multiple views (Gutmann and Hyvärinen, 2010;Khosla et al., 2020;Gao et al., 2021).The principle of contrastive learning is that an anchor and its positive sample should be pulled closer, while the anchor and negative samples should be pushed apart in feature space.In our work, we perform inter-modality contrastive learning to enhance the interaction between modalities and magnify the differentiation of fusion representation among samples.To ensure that each element of the input sequence is aware of its context, we process each modal representation to the same sequence length.We pass acoustic representation X a i , visual representations X v i and fusion representation F (j) i through a 1D temporal convolutional layer: where F (j) i is obtained after j Transformer layers containing pre-trained modality fusion.k u is the size of the convolutional kernel for modalities u, u ∈ {a, v}, k f is the size of the convolutional kernel for fusion modality.
We construct each mini-batch with K samples (each sample consists of acoustic, visual, and text modalities).Previous works (Han et al., 2021;Tsai et al., 2019a) have proved that textual modality is more important than the other two modalities, so we take the textual modality as the anchor and the other two modalities as its augmented version.
A batch of randomly sampled pairs for each anchor consists of two positive pairs and 2K negative pairs.Here, the positive sample is the modality pair composed of text and corresponding acoustic in the same sample, and the modality pair composed of text and corresponding visual in the same sample.The negative example is the modality pair composed of text and the other two modalities of the other samples.For each anchor sample, the self-supervised contrastive loss is formulated as follows: where L ta,j and L tv,j represent the contrastive loss of text-acoustic and text-visual performing on the j-th Transformer layer of encoder, respectively.

Grounding UL to MSA and ERC
During the training phase, we use the negative loglikelihood to optimize the model, which takes the universal label as the target sequence.The overall loss function can be formulated as follows: where L task denotes the generative task loss, j is the index of the Transformer layer of the Encoder, and {α, β} are decimals between 0 and 1, indicating the weight values.Moreover, during the inference, we use the decoding algorithm5 to convert the output sequence into the real number for MSA and the emotion category for ERC.
MOSI contains 2199 utterance video segments, and each segment is manually annotated with a sentiment score ranging from -3 to +3 to indicate the sentiment polarity and relative sentiment strength of the segment.MOSEI is an upgraded version of MOSI, annotated with sentiment and emotion.MOSEI contains 22,856 movie review clips from YouTube.Most existing studies only use MOSEI's sentiment annotation, and MOSEI's emotion annotation is multiple labels, so we do not use its emotion annotation although they are available.Note that there is no overlap between MOSI and MOSEI, and the data collection and labeling processes for the two datasets are independent.
IEMOCAP consists of 7532 samples.Following previous works (Wang et al., 2019;Hu et al., 2022), we select six emotions for emotion recognition, including joy, sadness, angry, neutral, excited, and frustrated.MELD contains 13,707 video clips of multi-party conversations, with labels following Ekman's six universal emotions, including joy, sadness, fear, anger, surprise and disgust.

Evaluation metrics
For MOSI and MOSEI, we follow previous works (Han et al., 2021) and adopt mean absolute error (MAE), Pearson correlation (Corr), seven-class classification accuracy (ACC-7), binary classification accuracy (ACC-2) and F1 score computed for positive/negative and non-negative/negative classification as evaluation metrics.For MELD and IEMOCAP, we use accuracy (ACC) and weighted F1 (WF1) for evaluation.

Baselines
We compare the proposed method with competitive baselines in MSA and ERC tasks.For MSA, the baselines can be grouped into 1) early multimodal fusion methods like Tensor Fusion Network TFN (Zadeh et al., 2017), Low-rank Multimodal Fusion LMF (Liu et al., 2018), and Multimodal Factorization Model MFM (Tsai et al., 2019b), and 2) the methods that fuse multimodality through modeling modality interaction, such as multimodal Transformer MulT (Tsai et al., 2019a), interaction canonical correlation network ICCN (Sun et al., 2020), sparse phased Transformer SPC (Cheng et al., 2021a), and modal-temporal attention graph MTAG (Yang et al., 2021) and 3) the methods focusing on the consistency and the difference of modality, in which MISA (Hazarika et al., 2020) controls the modal representation space, Self-MM (Yu et al., 2021a) learns from unimodal representation using multi-task learning, MAG-BERT (Rahman et al., 2020) designs a fusion gate, and MMIM (Han et al., 2021) hierarchically maximizes the mutual information.
With the rise of multimodal information, MMGCN (Hu et al., 2021c), MM-DFN (Hu et al., 2022) and COGMEN (Joshi et al., 2022) consider the multimodal conversational context to solve ERC task.Some works only use textual modality to recognize emotion, in which ERMC-DisGCN (Sun et al., 2021), Psychological (Li et al., 2021a), DAG-ERC (Shen et al., 2021) and DialogueGCN (Ghosal et al., 2019) adapt the GNN-based model to capture contexts.Additionally, CoG-BART (Li et al., 2021b) learns the context knowledge from the pre-trained model, COSMIC (Ghosal et al., 2020) incorporates different elements of commonsense, and TODKAT (Zhu et al., 2021) uses topicdriven knowledge-aware Transformer to model affective states.Similar to MSA and ERC works, UniMSE still attends to improve multimodal fusion representation and modality comparison in feature space.But, UniMSE unifies MSA and ERC tasks into a single architecture to implement knowledgesharing.

Experimental Settings
We use pre-trained T5-Base6 as the backbone of UniMSE.We integrate the training sets of MOSI, MOSEI, MELD, IEMOCAP to train the model and valid sets to select hyperparameters.The batch size is 96, the learning rate for T5 fine-tuning is set at 3e-4, and the learning rate for main and pretrained modality fusion are 0.0001 and 0.0001, respectively.The hidden dim of acoustic and visual representation is 64, the T5 embedding size is 768, and the fusion vector size is 768.We insert a pretrained modality fusion layer into the last 3 Transformer layers of T5's encoder.The contrastive learning performs the last 3 Transformer layers of T5's encoder, and we set α = 0.5 and β = 0.5.More details can see Appendix A.3.

Results
We compare UniMSE with the baselines on datasets MOSI, MOSEI, IEMOCAP, and MELD, and the comparative results are shown in Table 2. UniMSE significantly outperforms SOTA in all metrics on MOSI, MOSEI, IEMOCAP, and MELD.Compared to the previous SOTA, UniMSE improves ACC-2 of MOSI, ACC-2 of MOSEI, ACC of MELD, and ACC of IEMOCAP by 1.65%, 1.16%, 2.6%, and 2.35% respectively, and improves F1 of MOSI, F1 of MOSEI, and WF1 of IEMOCAP by 1.73%, 1.29%, and 2.48% respectively.It can be observed that early works like LMF, TFN, and MFM performed on the four datasets.However, the later works, whether MSA or ERC, only evaluate their models on partial datasets or metrics, yet we provide results on all datasets and corresponding metrics.For example, MTAG only conducts experiments on MOSI, and most ERC works only give the WF1, which makes MSA and ERC tasks tend to be isolated in sentiment knowledge.Unlike these works, UniMSE unifies MSA and ERC tasks on these four datasets and evaluates them based on the common metrics of the two tasks.In summary, 1) UniMSE performs on all benchmark datasets of MSA and ECR; 2) UniMSE significantly outperforms SOTA in most cases.These results illustrate the superiority of UniMSE in MSA and ERC tasks and demonstrate the effectiveness of a unified framework in knowledge sharing among tasks and datasets.

Ablation Study
We conducted a series of ablation studies on MOSI, and the results are shown in SEI, resulting in poor performance in the four metrics.The proposed UniMSE is orthogonal to the existing works, and it is believed that introducing our unified framework to other tasks can also bring improvements.

Visualization
To verify the effects of UniMSE's UL and crosstask learning on multimodal representation, we visualize multimodal fusion representation (i.e., F i ) of the last Transformer layer.Specifically, we select samples that carry positive/negative sentiment polarity from the test set of MOSI and select samples that have the joy/sadness emotion from the test set of MELD.Their representation visualization is shown in Figure 4(a).It can be observed that the representations of samples with positive sentiment cover the representation of samples with joy emotion, which demonstrates that although these samples are from different tasks, a common feature space exists between the samples with joy emotion and positive sentiment.
Moreover, we also select the MOSI samples with generated emotion joy/sadness and compare them to MELD samples with the original emotion label joy/sadness in embedding space.Their visualization is shown in Figure 4(b).The samples with joy emotion, whether annotated with the original label or generated based on UL, share a common feature space.These results verify the superiority of UniMSE on representation learning across samples and demonstrate the complementarity between sentiment and emotion.

Conclusion
This paper provides a psychological perspective to demonstrate that jointly modeling sentiment and emotion is feasible and reasonable.We present a unified multimodal knowledge-sharing framework, UniMSE, to solve MSA and ERC tasks.UniMSE not only captures knowledge of sentiment and emotion, but also aligns the input features and output labels.Moreover, we fuse acoustic and visual modal representation with multi-level textual features and introduce inter-modality contrastive learning.We conduct extensive experiments on four datasets and achieve SOTA results in all metrics.We also provide the visualization of multimodal representation, proving the relevance of sentiment and emotion in embedding space.We believe this work presents a new experimental setting that can provide a new and different perspective to the MSA and ERC research communities.

A.1 Datasets
We count the duration of the video segment in MSA and ERC and give the results in Table 4.We take the length of the video segment as the duration of sentiment or emotion.We can observe that the average time of sentiment in MSA is longer than that of emotion in ERC, demonstrating the difference between sentiment and emotion.The average length of the video segment in MOSEI is 7.6 seconds.This may indicate why MOSEI is usually used to study sentiments rather than emotions.Furthermore, we count emotion categories of MELD and IEMOCAP, and their distributions of the train set, valid set, and test set are shown in Table 5

A.2 Decoding Algorithm for MSA and ERC tasks
In this part, we introduce the decoding algorithm we used to convert the predicted target sequence of UniMSE into a sentiment intensity for MSA and an emotion category for ERC.The algorithm is shown in Algorithm 1.

A.3 Experimental Environment
All experiments are conducted in the NVIDIA RTX A100 and NVIDIA RTX V100

Figure 1 :
Figure 1: Illustration of sentiment and emotion sharing a unified embedding space.The bottom is a unified label after formalizing sentiment and emotion according to the similarity ⌢ sim between samples with the same sentiment polarity label.

Figure 3 :
Figure 3: The generating process of a universal label (UL) and the red dashed line denotes that e 1 is the sample with the maximal semantic similarity to m 2 .
Figure 4: T-SNE visualization comparison of the multimodal fusion representation between: (a) samples with sentiment and emotion, and (b) samples with original emotion and generated emotion, where joyˆand sad-nessˆdenote the generated emotion.
Thus the acoustic and visual signals can participate in text encoding and are fused with multiple levels of textual information.The low-level text syntax feature encoded by the shallow Transformer layers and high-level text semantic feature encoded by the deep Transformer layers [•] is concatenation operation on feature dim, σ is the Sigmoid function, and {W d , W u , W, b d , b u } are learnable parameters.F

Table 1 :
The details of MOSI, MOSEI, MELD, and IEMOCAP, including data splitting and the labels it contains, where Senti.and Emo.represent the label sentiment polarity and intensity of MSA and emotion category of ERC, respectively.and denote the dataset has or does not have the label.

Table 2 :
Results on MOSI, MOSEI, MELD, and IEMOCAP.*The performances of baselines are updated by their authors in the official code repository, and the baselines with italics indicate it only uses textual modality.The results with underline denote the previous SOTA performance.

Table 4 :
and Table6, respectively.Average video length of samples.D-A VL(s) and T-A VL(s) denote the average video length of datasets and tasks, respectively.

Table 5 :
The distribution of emotion category on dataset MELD.

Table 6 :
. The T5-base model has 220M parameters, including 12 layers, 768 hidden dimensions, and 12 heads.PML contains The distribution of emotion category on dataset IEMOCAP.Target task t ∈ {M SA, ERC}, target sequence Y = {y 1 , y 2 , • • • , y N } and y i = (y p i , y r i , y c i ) Output: Task prediction Y t = {y t 1 , y t 2 , • • • , y t N } for target t 1 Y t = {} 2 for each y i in Y do