Topic and Style-aware Transformer for Multimodal Emotion Recognition

Understanding emotion expressions in multi-modal signals is key for machines to have a better understanding of human communication. While language, visual and acoustic modalities can provide clues from different perspectives, the visual modality is shown to make minimal contribution to the performance in the emotion recognition field due to its high dimensionality. Therefore, we first leverage the strong multi-modality backbone VATT to project the visual signal to the common space with language and acoustic signals. Also, we propose content-oriented features Topic and Speaking style on top of it to approach the subjectivity issues. Experiments conducted on the benchmark dataset MOSEI show our model can outperform SOTA results and effectively incorporate visual signals and handle subjectivity issues by serving as content "normalization".


Introduction
Emotion recognition is intrinsic for social robots to interact with people naturally.The ability to tell emotional change and propose timely intervention solutions can help maintain people's mental health and social relations.Though the traditional task of sentiment analysis is purely based on text (Wang et al., 2020;Ghosal et al., 2020;Shen et al., 2021), humans express emotions not only with spoken words but also through non-verbal signals such as facial expressions and the change of tones.Therefore, following the current trend of multimodal emotion recognition (Delbrouck et al., 2020;Zadeh et al., 2017;Rahman et al., 2020;Gandhi et al., 2022), we focus on addressing problems of understanding the expressed emotions in videos along with their audio and transcripts.
In this work, we tackle the problem of the multimodal emotion recognition task from two major issues: Minimal contribution of visual modality, and emotional subjectivity.Previous works which have used multimodal approaches (Rahman et al., Figure 1: Left table: "happy" under different topics.Right table: speaking styles can affect how emotion is displayed on the face 2020; Joshi et al., 2022;Delbrouck et al., 2020) have shown that text+audio outperforms the results of combining all three modalities.While facial and gesture signals contain abundant information, they tend to introduce more noise to the data due to its high dimensionality.In order to increase the contribution from visual modality , we propose to take advantage of the strong multimodal backbone VATT (Akbari et al., 2021) that can project features of different granularity levels into a common space.On the other hand, the expression of emotion is subjective.People's emotion judgment can be influenced by enclosed scenarios.As shown in the left two columns in Figure 1, though the two examples are all labeled as "happy", the signals we use to detect "happy" may not be the same.In a public speech, showing gratitude may mean a positive sentiment while in movie reviews, we may focus more on sentiment words like good or bad.Also, subjectivity may come from individual differences in their own emotional intensity.As the examples shown in the right three columns in Figure 1, the sadness and happiness of the person in the excited style are more distinguishable through his face while the person in the calm style always adopts a calm face that makes sad and happy less recognizable.Therefore, we introduce content-oriented features: topic and speaking style serving as a content "normalization" for each person.
Our work makes the following contribution: 1) We propose to leverage the multimodal backbone to reduce the high dimensionality of visual modality and increase its contribution to the emotion recognition task.
2) We incorporate emotion-related features to handle modeling issues with emotional subjectivity 3) Experiments conducted on the benchmark dataset MOSEI show our model can outperform SOTA results and effectively incorporate visual signals and handle subjectivity issues.

Related Work
Emotion recognition using a fusion of input modalities such as text, speech, image, etc is the key research direction of human-computer interaction.Specific to the area of sentiment analysis, Multimodal Transformer applies pairwise crossattention to different modalities (Tsai et al., 2019).The Memory Fusion Network synchronizes multimodal sequences using a multi-view gated memory that stores intra-view and cross-view interactions through time (Zadeh et al., 2018).TFN performs the outer product of the modalities to learn both the intra-modality and inter-modality dynamics (Sahay et al., 2018).(Rahman et al., 2020) begins the endeavor to take BERT (Devlin et al., 2018) as a strong backbone pretrained on large scale corpus.(Arjmand et al., 2021) follows the direction and combines Roberta with a light-weighed audio encoder to fuse the text and audio features.A recent work (Yang et al., 2022a) presents a self-supervised framework to pretrain features within a single modality and across different modalities.Other frameworks include context and speaker-aware RNN (Shenoy and Sardana, 2020;Wang et al., 2021), graph neural networks modeling knowledge graphs and inter/intra relations between videos (Joshi et al., 2022;Fu et al., 2021;Lian et al., 2020), while (Zhu et al., 2021) has used topic information to improve emotion detection 3 Method

Overview
Our model aims to predict the presence of different emotions given an utterance-level video input along with its audio and transcripts.Figure 2 shows the overall structure of our model.To first get a better alignment of features from different modalities,

Backbone
Video-Audio-Text Transformer (VATT) is a framework for learning multimodal representations that takes raw signals as inputs.For each modality encoder, VATT appends an aggregation head at the beginning of the input sequence.The corresponding latent feature will serve as the projection head for this modality.For pretraining, contrastive loss is applied to align features from different modalities in a common projected space.Details can be found in (Akbari et al., 2021).

Topic
For each utterance input, we will first predict the topic of this utterance and feed the corresponding topic embedding into the model.Since we don't have the ground truth label for topics, we use Latent Dirichlet Allocation (LDA) (Blei et al., 2003) model to cluster all the text from the training set into 3 topics.The number of topics is decided by grid search.

Speaking Style
We define speaking style based on the expression coefficient and the projection parameters of a 3DMM model (Blanz and Vetter, 1999).In a 3DMM model, the face shape is represented as an affine model of facial expression and facial identity: S = S + B id α + B exp β.This 3D face will be projected into a 2D image by translation and rotation p.Since there are multiple video frames, the expression coefficient β and the project parameter p will become time series β(t) and p(t).For a detailed analysis of the relations between the 3DMM parameters and the talking styles, (Wu et al., 2021) collected a dataset consisting of 3 talking styles: excited, tedious, and solemn.They find out that the standard deviation of the time series features and the gradient of these features are closely related to the styles.The final style code are denoted as σ(β(t)) ⊕ σ( ∂β(t) ∂t ) ⊕ σ( ∂p(t) ∂t ), ⊕ signifies the vector concatenation.

Aggregating Different Features
Given each data input with its corresponding video ID, we collect all the transcripts with the same video ID as the context, and the context feature will be extracted from the text encoder of VATT.To adapt general topic and style features to the current speaker, we treat them as the feature sequence of length 2 and use an additional cross-attention layer to aggregate these features queried by the video context.Then this information along with the context and aligned features will be concatenated and fed into the final linear classifier.3 5 Experiments

Setup
We train our models on 8 V100 GPU for 8 hours using the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 1e − 4 and a mini-batch size of 64.The total number of parameters of our model is 155M.For topic clustering, we adopt the scikitlearn LDA library (Pedregosa et al., 2011).We extract the style code for each video using https: //github.com/wuhaozhe/style_avatar.The final model is selected based on validation accuracy on the development set.
Task We evaluate the performance of our model on two tasks: 1) Multi-label emotion recognition: the model needs to classify whether each of the 6 emotion classes presents or not.2) Sentiment anal-ysis: the model is tested on both 2-class (sentiment is positive or negative) and 7-class (a scale from -3 to +3) classification.
Evaluation Since the labels in MOSEI are unbalanced, we use the weighted F1 score for each emotion as the evaluation metric.We compare the performance with Multilogue-Net (Shenoy and Sardana, 2020) that adopted context and speaker-aware RNN , TBJE (Delbrouck et al., 2020), a state-ofthe-art method using cross-attention for modality fusion and MESM (Dai et al., 2021), who were the first to introduce a fully end-to-end trainable model for the multimodal emotion recognition task .There are two recent works on emotion recognition, COGMEN (Joshi et al., 2022) and i-Code (Yang et al., 2022b).Since COGMEN adopted a structural representation that can exploit more relational information from other data samples and i-Code did not report the same metrics and is not opensourced, we will not compare with them in this paper.

Emotion Recognition
Table 1 shows our quantitative results.Compared with other SOTA methods in the first three rows, our full model achieves the best performance on recognizing happy, sad and angry.We reckon that it is due to very limited data for surprise and fear to train the big backbone, our model does not gain much improvement (shown in Table 3).To further analyze the contribution of each component of our model design, we also conduct a detailed ablation study: 1) We first remove the aligned features from the backbone each at a time.We can see from the results in the second block that combining all three modalities in our full model outperforms the bimodality input.Especially contrasting rows with and without video input, their comparative performance validates that our model can learn effectively from visual modalities.2) In the third block, we report the performance when we simply concatenate aligned features as the input to the emotion classification layer without high-level features.The degraded performance reveals the efficacy of our content feature design.3) Lastly, we investigate the influence of each content feature and the aggregation using context.To remove the context, we directly apply a self-attention layer to the feature sequence and use a linear layer to project the outputs into the aggregate feature dimension.For topic and style, we just remove the corresponding feature from the input.As shown in the last block, removing any part will result in a performance drop.Overall, our full model in comparison yields the best performance.

Sentiment Analysis
To further validate our methods, we run our model on the other subtask, sentiment analysis.For each data sample, the annotation of sentiment polarity is a continuous value from -3 to 3. -3 means extremely negative, and 3 means extremely positive.Our model is trained to regress the sentiment intensity.Then we ground the continuous value into 2 or 7 classes to calculate the accuracy.Contrasting 2-class and 7-class results in Table 2, our model works better for more fine-grained classification.

Limitations
For modeling simplicity, we adopt the classic LDA methods to get the topic ID for each video segment.We plan to investigate more advanced topic clustering methods and check how it can be applied to multilingual cases.Also, we propose a twostage framework that first extract topic and style features, based on which the emotion classifier will be trained.In the future, we hope to extend this work to learn features in an end-to-end manner.

Words
Examples Topic 1 movie, umm, uhh, like, know, really, one, im, good, go, see, two, kind, would, think, even, thats, going, there 1) hi there today we're going to be reviewing cheaper by the dozen which is umm the original version; 2) i was a huge fan of the original film bruce almighty but i did think it was funny like jim Topic 2 people, get, think, make, business, u, want, time, world, need, company, way, also, work, one, year, take, money, right, new 1)future and it's a retirement future that can ultimately turned in to an income for you when you no longer have an income and you're fully retired; 2)um this year switching up how we approach funding and hopefully going to be able to arrange for some sustainable more officially recognized sorts of funding Topic 3 going, thing, like, know, one, want, really, well, also, im, video, make, way, thats, something, think, were, time, get, look 1)is you can say hey i really like baby skin they are so soft they have any hair on their face so nice; 2) okay what happens at this point after we've taken this brief walk down memory lane is the presentation of the gift now

Figure 3 :
Figure 3: Our model can recognize happy/sad under 3 different topics

Figure 4 :
Figure 4: People expressing different with excited/calm styles

Table 1 :
Impact of different input modalities and content features

Table 2 :
Sentiment analysis on 2-class and 7-class

Table 4 :
Topic clustering results