Sentiment Knowledge Enhanced Self-supervised Learning for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis (MSA) has made great progress that benefits from extraordinary fusion scheme. However, there is a lack of labeled data, resulting in severe over-fitting and poor generalization for supervised models applied in this field. In this paper, we propose Sentiment Knowledge Enhanced Self-supervised Learning (SKESL) to capture common sentimental patterns in unlabeled videos, which facilitates further learning on limited labeled data. Specifically, with the help of sentiment knowledge and non-verbal behavior, SKESL conducts sentiment word masking and predicts fine-grained word sentiment intensity, so as to embed sentiment information at the word level into pre-trained multimodal representation. In addition, a non-verbal injection method is also proposed to integrate non-verbal information into the word semantics. Experiments on two standard benchmarks of MSA clearly show that SKESL significantly outperforms the baseline, and achieves new State-Of-The-Art (SOTA) results.


Introduction
Multimodal Sentiment Analysis (MSA) is a rapidly developing research field, which extends conventional text sentiment analysis to a multimodal setup where three modalities are present: text, audio and visual (Morency et al., 2011).With the abundance of user-generated opinion videos, MSA has a wide range of applications in e-commerce, intelligent customer service, human-computer interaction, etc.
In MSA, the construction of the training dataset relies on artificial perceptual evaluation for the sentiment of opinion videos, which is a very timeconsuming and labor-intensive task, that is why video data with sentimental annotation is insufficient.As a result, supervised models applied in this field suffer from severe overfitting and poor † Corresponding author generalization (Dai et al., 2021).Although previous studies have used several methods to alleviate the overfitting, most of them are based on general approaches such as multi-task learning (Dai et al., 2021;Akhtar et al., 2019;Chauhan et al., 2020;Yu et al., 2020), parameter regularization (Liang et al., 2019;Mai et al., 2020) and data augmentation (Liu et al., 2022), which neglect to consider the large number of unlabeled opinion videos that naturally exist on the Internet.
These opinion videos contain common sentimental patterns or compositional sentiment semantics about how the three modalities in the video are fused to express the overall sentiment, which can be leveraged to learn better sentiment representations.Inspired by recent knowledge-enhanced pretraining models on text sentiment analysis (Tian et al., 2020;Yin et al., 2020;Ke et al., 2020;Zhao et al., 2022), we argue that pre-training models enriched with the sentiment knowledge of words and non-verbal behavior will facilitate the characteriza-tion of the sentimental patterns in videos, thereby resulting in better performance on multimodal sentiment analysis.
In this paper, we propose a Sentiment Knowledge Enhanced Self-supervised Learning (SKESL) method, which uses contextual and non-verbal information to predict the fine-grained sentiment intensity of a word to learn the common sentimental patterns in opinion videos, as shown in Figure 1.Specifically, given a speaker video without sentiment annotation, we first use the Automatic Speech Recognition (ASR) technology to obtain the transcribed text and then mask the most sentimentally salient words in the text according to the pre-specified sentiment lexicon.A pre-trained language representation model is utilized to acquire the sequence representation of the processed text.
To integrate non-verbal information into the text representations, we further propose a non-verbal information aggregation method based on the crossmodal attention mechanism to derive non-verbal information-enhanced text representations.Finally, the masked word representations are exploited to predict the sentiment intensity itself.
After the SKESL is completed, we transfer the pre-trained model to the task of multimodal sentiment analysis, and adopt a small amount of sentiment-annotated data to fine-tune the model.To evaluate the effectiveness of SKESL, we test on two benchmark datasets: CMU-MOSI (Zadeh et al., 2016) and CMU-MOSEI (Zadeh and Pu, 2018).Experimental results demonstrate that our model outperforms both the baseline and the current State-of-the-Art (SOTA) approach.
The main contributions can be summarized as follows: • To the best of our knowledge, this paper is the first self-supervised learning method for multimodal sentiment analysis that leverages sentiment knowledge from large-scale unlabeled videos to facilitate improved sentiment representation learning.
• This paper proposes a novel non-verbal information aggregation method for obtaining text sequence representations enhanced by audio and visual information.
• The proposed SKESL method not only surpasses the baseline in experimental performance, but also achieves SOTA in the field of multimodal sentiment analysis.

Related Work
Pre-training Language Models In NLP, it has become a paradigm to pre-train language models on the large scale unlabeled data in an autoencoding (Devlin et al., 2019) or auto-regressive manner (Radford et al., 2018(Radford et al., , 2019)), and fine-tune the pre-trained models on the downstream tasks using task-specific labeled data.
Recently, more pre-trained language models have been proposed, which can be roughly divided into four categories (Ke et al., 2020): 1) Knowledge enhancement: Introducing domain-specific knowledge in the process of pre-training language representation models has been shown to be effective.A representative model is ERNIE (Zhang et al., 2019), which explicitly introduces knowledge graph to pretrained language models.2) Transferability: On the basis of the general pre-training language model, further post-training is performed on more specific auxiliary tasks (Li et al., 2021).3) Model compression: The compressed pre-trained language model can be widely applied in resource-constrained devices and tasks requiring real-time capability.The commonly used model compression methods include knowledge distillation (Sanh et al., 2019;Jiao et al., 2020), quantization (Shen et al., 2020) and pruning (Gordon et al., 2020).4) Pre-training objectives: Aiming at rich and variable text expressions, many studies have further improved the text feature on basis of general Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objective (Devlin et al., 2019).For instance, SpanBERT (Joshi et al., 2020) masks consecutive spans randomly instead of individual tokens, while BERT-WWM (Cui et al., 2021) utilizes the Whole Word Mask (WWM) strategy to impose the model to learn complete semantics.
Knowledge Enhanced Pre-training Language Models Incorporating external knowledge into pretraining language models has become prevalent and has been shown to be significant.Such external knowledge includes commonsense knowledge for tasks such as entity typing and relation classification (Zhang et al., 2019;Peters et al., 2019;Liu et al., 2020;Xiong et al., 2020), sentiment knowledge for sentiment analysis (Tian et al., 2020;Yin et al., 2020;Ke et al., 2020), word sense knowledge for word sense disambiguation (Levine et al., 2020), commonsense knowledge for commonsense reasoning and sarcasm generation (Klein and Nabi, 2020;Chakrabarty et al., 2020), legal knowledge for legal element extraction (Zhong et al., 2020), and biomedical knowledge for health question answering and medical inference (He et al., 2020).
Knowledge Enhanced Pre-training Models for Text Sentiment Analysis Some research (Tian et al., 2020;Yin et al., 2020;Ke et al., 2020;Zhao et al., 2022) integrates the sentiment knowledge into the pre-training process which includes sentiment words, word polarity and aspect-sentiment pairs.The learned representation would be more sentiment-specific and appropriate for text sentiment analysis.
Knowledge Enhanced Models for Multimodal Sentiment Analysis In MSA, some works consider sentiment knowledge with explicit supervision.For example, SWAFN (Chen and Li, 2020) designs a sentimental words prediction objective as an auxiliary task to incorporate sentimental words knowledge.MAGCN (Xiao et al., 2022) also incorporates sentiment knowledge into inter-modality learning.

Methodology
In this section, we describe our proposed Sentiment Knowledge Enhanced Self-supervised Learning (SKESL) framework for multimodal sentiment analysis, as shown in Figure 2. The framework contains Sentiment Word Masking (SWM), text representation learning, non-verbal information injection (a.k.a., multimodal fusion), and Sentiment Intensity Prediction (SIP) modules.In the subsequent subsections, we will detail the four modules.

Formulation
Our task is defined as follows: given a set of three modalitiesM = {T (Text),A(Audio),V (Visual)}, an opinion video, i.e., multimodal sequence, can be represented as i ∈ R dm denotes the extracted sentiment feature corresponding to modality m, d m is the dimension of the feature, and T m is the length of sequence of modality m.Our goal is to predict the sentiment intensity y ∈ R or polarity y ∈ {positive, neutral, negative} of the whole video.

Sentiment Word Masking
Sentiment Word Masking (SWM) aims to construct a corrupted version for each input sequence where sentiment information is masked.For a speaker video without sentiment annotation, a good ASR technique is first exploited to transcribe the speech to text S = {w 1 , w 2 , ..., w N }.As sentiment words in the text, especially those with the most salient sentiment, are the most essential clues in the textual modality for detecting sentiment, we employ a sentiment lexicon to search and then mask them, i.e., use special tokens in their place.The sentiment lexicon (Hutto and Gilbert, 2014) consists of explicit sentiment intensity scores for each sentiment word, thus we can easily find the sentiment words with the highest sentiment intensity.Meanwhile, the score y MASK of the highest sentiment intensity is chosen to act as a label for guiding SKESL.The corrupted sentence is represented as S ′ = {w 1 , w 2 , ..., w MASK , ..., w N } where w MASK denotes the masked word.
It is worth noting that a sentence with sentiment tendencies does not necessarily have sentiment words.To cope with this situation, we adopt a random masking strategy and assign a label with the sentiment intensity "0.0" to the masked word.The motivation is that the pre-training model is induced to distinguish whether the masked position holds a word without any sentiment based on contextual and non-verbal information.In this way, the model has a stronger sentimental semantic cognition of the words in the sentence and can learn better sentimental multimodal representations.

Text representation learning
After getting the corrupted sentence S ′ , we need to encode it into a sequence of word representations for subsequent processing.Given the outstanding language representation capabilities and widespread use of BERT, we opted to utilize it as the text encoder and input the corrupted sentence S ′ to it, where θ LM represents the parameters of BERT, x T i ∈ R d T denotes the encoded word representation, and d T is the dimension of the representation.

Non-verbal information injection
Unlike knowledge enhanced pre-training models towards text sentiment analysis, we emphasize that our SKESL deals with opinion videos that contain multiple modalities rather than just text.For the same word, there exists different sentiment with different non-verbal accompaniments.Therefore, the exact sentiment semantics of a word is determined by the word itself and the accompanied nonverbal behavior (Wang et al., 2019).Without the

Pretrained Language Model [CLS]
The food is very [MASK] !

Sentiment Word Masking
[CLS] The food is very great !help of non-verbal information, it is difficult for the pre-trained language model to determine the masked word and infer the sentiment intensity of the word.Therefore, to integrate the non-verbal information into word representations, inspired by Multimodal Transformer (MulT) (Tsai et al., 2019) which provides a latent cross-modal adaptation that fuses multimodal information by directly attending to low-level features in other modalities, we propose a new non-verbal information injection (a.k.a., multimodal fusion) method as shown in Figure 3.The method repeatedly reinforces the text representations with the low-level features from audio and visual modalities by learning the attention across the features of two modalities.The low-level features benefit the model to preserve the original sentiment semantics for non-verbal behavior and learn the text-centric multimodal representations.Formally, we first define X T 0 = X T , X V 0 = X V and X A 0 = X A to represent the text, visual and audio feature sequences before multimodal fusion, respectively.The Queries, Keys and Values sequences for Cross-Modal Attention (CMA) is computed by linear transformation as follows,

Sentiment Intensity Prediction
where After obtaining Queries, Keys and Values sequences, we utilize the CMA to inject audio and visual information into text representations, where m ∈ M − {T }, Y m l ∈ R N ×d T denotes the text sequence enhanced by modality m.In this way, each word receives information from all the elements across audio and visual feature sequences.Then, the enhanced text representations Y m l along with the previous text representation X T l−1 are aggregated together, Similar to the structure of vanilla Transformer (Vaswani et al., 2017), Y l ∈ R N ×d T then goes through layer normalization, FeedForward Neural Network (FFNN) and residual connection, where θ FF represents the parameters of FFNN, X T l ∈ R N ×d T is the output of block l.

Sentiment Intensity Prediction
After L blocks, the refined text representations are X T L ∈ R N ×d T , in which x T MASK,L is the refined representation of masked word.We simply use a 2-layer fully connected network with non-linear activation function to predict the sentiment intensity of masked words.
where θ FC represents the parameters of the fully connected network, y pred is the predicted sentiment intensity.We define θ = {θ LM , θ CMA , θ FF , θ FC }, therefore the objective of the model is as follows, where L is chosen as Mean Absolute Error (MAE) loss function.The model is pre-trained in an endto-end way.

Fine-tuning
We verify the effectiveness of SKESL on multimodal sentiment analysis task.On top of the pre-trained language model and multimodal fusion module, an output layer is added to perform taskspecific prediction.The neural network is then fine-tuned on labeled multimodal data.

Datasets
We pre-train our models on two speaker video datasets: VoxCeleb1 (Nagrani et al., 2017)  In addition, two multimodal sentiment datasets are used for fine-tuning and testing: CMU-MOSI (Zadeh et al., 2016), and CMU-MOSEI (Zadeh and Pu, 2018).The CMU-MOSI and CMU-MOSEI datasets consist of 2,199 and 22,846 opinion video clips from YouTube movie reviews, respectively.Each video clip has been scored between −3 (strong negative) and +3 (strong positive).Following previous works (Tsai et al., 2019;Rahman et al., 2020;Qian et al., 2022), for CMU-MOSI dataset, we utilize 1,284 segments for training, 229 segments for validation, and 686 segments for testing.For CMU-MOSEI dataset, we use 16,326 segments for training, 1,861 segments for validation, and 4,659 segments for testing.Table 1 and 2 show the statistics of all datasets.

Sentiment Features
We extract sentiment-related features for nonverbal modalities.

Experimental Design
Sentiment Lexicon We use the VADER 2 sentiment lexicon (Hutto and Gilbert, 2014) to search and mask sentiment words.The VADER sentiment lexicon is sensitive to both the polarity and the intensity of sentiments expressed in social media contexts.It contains rich sentiment words with explicit sentiment scores from −4 to +4.In order to be consistent with the CMU-MOSI and CMU-MOSEI datasets, we linearly scale the score to [−3, +3].ASR We utilize the widely recognized Google Cloud Speech API 3 to acquire transcripts for the pretraining datasets.Considering that we do not have access to the actual transcripts, it is not possible to calculate the precise ASR word error rate.However, we can confidently state that improved ASR performance would result in better outcomes.This is because a low-performing ASR system may inaccurately identify sentiment words, potentially leading to flawed results.
Training Details All models are built on the Pytorch (Paszke et al., 2019) toolbox with the NVIDIA RTX 3090 GPUs.The Adam (Kingma and Ba, 2014) optimizer is adopted for both pretraining and fine-tuning.The initial learning rate is set to 5e -6 for BERT and 1e -4 for other parameters.The batch size is 32.The number of epoch is 200.All our experiments were done with the exact same random seed.The models use the designated validation set of CMU-MOSI and CMU-MOSEI for finding best hyper-parameters 4 .You can refer to Appendices A for more details.

Evaluation Metrics
Following previous works (Tsai et al., 2019;Yu et al., 2021), we record our experimental results in two forms: classification and regression.For classification, we report the weighted F1 score and binary accuracy.For regression, we report Mean Absolute Error (MAE) and Pearson correlation (Corr).Except for MAE, higher values denote better performance for all metrics.

Baseline Models
Our model does not require manual alignment of language words with visual and audio, since the unlabeled video data has no explicit word timestamps.We perform a comprehensive comparative study against SKESL by considering various baselines and state-of-the-art models in either aligned or unaligned settings as detailed below.

Aligned Setting
MARN (Zadeh et al., 2018b) models intra-modal and cross-modal interactions by designing the Long-short Time Hybrid Memory and Multiattention Block.MFN (Zadeh et al., 2018a) focuses on continuously modeling the view-specific and cross-view interactions, and aggregating them through time with a Multi-view Gated Memory.RMFN (Liang et al., 2018) decomposes the modeling process into multi-stage fusion, with each stage specifically targeting a subset of multimodal signals.RAVEN (Wang et al., 2019) considers the fine-grained structure of non-verbal subword sequences, and dynamically adjusts the word representations based on these non-verbal cues.MCTN (Pham et al., 2019) learns joint representations by cyclically translating from source to target modalities while ensures robustness even in the presence of noisy or missing target modalities.MISA (Hazarika et al., 2020) incorporates a combination of losses including distributional similarity, orthogonal loss, reconstruction loss and task prediction loss to learn both modality-invariant and modalityspecific representations.MAG-BERT (Rahman et al., 2020) is an improvement over RAVEN on aligned data with applying multimodal adaptation gate at different layers of the BERT backbone.

Unaligned Setting
MulT (Tsai et al., 2019)   other modalities.PMR (Lv et al., 2021) introduces a message hub to explore three-way interactions across all involved modalities within the context of multimodal fusion in unaligned multimodal sequences.LMR-CBT (Fu et al., 2021) achieves complementary learning of different modalities by incorporating three effective components: local temporal learning, cross-modal feature fusion and global self-attention representations.Self-MM (Yu et al., 2021) designs a label generation module to obtain independent unimodal supervisions, effectively balancing the learning progress across different sub-tasks.

Results and Analysis
In this section, we make a detailed analysis and discussion about our experimental results.

Quantitative Results
As shown in Table 3, our model achieves stateof-the-art performance on all the metrics on both datasets.Specifically, consistently significant improvement is observed compared to the previous unaligned models.Even comparing with aligned models, our model still achieves competitive or better results.
To investigate the impact of the amount of the unlabeled video data, we pre-train on VoxCeleb1 (132K) and VoxCeleb2 (947K) datasets and results are shown as Table 4 and 5 significant performance improvements.Furthermore, we find that the performance improvement on the CMU-MOSEI dataset is larger than that on the CMU-MOSI dataset.For example, the accuracy is relatively improved by 0.39% and 1.07%, respectively.The most likely reasons are that the CMU-MOSI dataset is too small and contains noisy labels.Therefore, we boldly guess that it will be difficult to improve the performance on the CMU-MOSI dataset in the future.
In addition, to study the effect of different sizes of backbone language models, we use bert-base and bert-large models with 110M parameters and 340M parameters, respectively.Results are shown in Table 6 and 7  mance improves as the pre-trained language model has more parameters and stronger expressiveness.This fits with our intuition and previous conclusion (Brown et al., 2020) that scaling up language models can greatly improve performance.

Ablation Study
To further explore the contributions of different components, we conduct an ablation study on CMU-MOSI dataset and the results are shown in Table 8.Without SKESL, the model's accuracy and F1 score dropped by 1.40% and 1.46%, respectively.This suggests that it is indeed useful to transfer sentimental knowledge mined from unlabeled video data to downstream prediction tasks.Further, if the audio and visual modality is not used, i.e., only the BERT language model is used for the sentiment prediction task, the performance will be further degraded.This fact aligns with the observations in prior work (Tsai et al., 2019;Rahman et al., 2020;Qian et al., 2022)  analysis is better than text-only sentiment analysis.

Case Study
To visually validate the reliability of our model, we present some examples shown in Figure 4.The examples are from the first three speaker videos of the test set of the CMU-MOSI dataset.The prediction results demonstrate that pre-training models enriched with the sentiment knowledge of words and non-verbal behavior will facilitate the characterization of the sentimental patterns in videos, thereby resulting in better performance on multimodal sentiment analysis.

Conclusion
In this paper, we highlighted the sentiment knowledge enhanced self-supervised learning in MSA.We find that mining sentimental prior information from unlabeled video data can lead to better predictions on labeled data.The larger the amount of unlabeled video data and the stronger the language modeling ability, the better the performance that can be achieved.

Limitations
We note that there are several limitations with such a sentiment knowledge enhanced self-supervised learning approach.First, the preprocessing of massive videos is time-consuming and laborious.Second, the pre-training of our model has relatively large requirements on the GPU resources.Finally, we argue that there should not be too many videos without sentimental words, so as to avoid the model having a large bias and not learning any sentiment knowledge.

Figure 1 :
Figure 1: The pipeline of SKESL.The purple dashed box denotes an opinion video which includes text, visual and audio modalities.Visual and audio modalities with shaded boxes indicate not being used.The red circle represents the searched sentiment word.

Figure 2 :
Figure 2: The model framework of Sentiment Knowledge Enhanced Self-supervised Learning (SKESL).SKESL contains two parts: (1) Sentiment Word Masking searches for the most sentimentally salient word of an input sentence based on the sentiment lexicon, and generates a corrupted version by replacing it with a special token [MASK].(2) Sentiment Intensity Prediction requires the model to infer exact sentiment intensity according to contextual and non-verbal information.

Figure 3 :
Figure 3: The framework of multimodal fusion.The superscripts {T, A, V } denote text, audio and visual modalities, respectively.

Figure 4 :
Figure 4: Examples from the CMU-MOSI dataset.For each example, we show the Ground Truth and prediction output of the model with and without SKESL.

Table 2 :
Datasets statistics for fine-tuning and testing

Table 3 :
(Yu et al., 2021) 2020)sentiment analysis on CMU-MOSI and CMU-MOSEI datasets.NOTE: The unit of Acc and F1 is %.↑ means higher is better, and ↓ is the opposite. 3 the result is from(Hazarika et al., 2020); ⊗ from(Yu et al., 2021).And * denotes the reimplementation with non-verbal sentiment feature mentioned in 4.2.Best results are highlighted in bold.

Table 4 :
. It is evident that a larger amount of pre-training data leads to more Results on CMU-MOSI dataset with different amounts of pre-training data.

Table 5 :
Results on CMU-MOSEI dataset with different amounts of pre-training data.

Table 6 :
Results on CMU-MOSI dataset under different pre-trained language models.

Table 7 :
Results on CMU-MOSEI dataset under different pre-trained language models.