Learning Language-guided Adaptive Hyper-modality Representation for Multimodal Sentiment Analysis

Though Multimodal Sentiment Analysis (MSA) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential sentiment-irrelevant and conflicting information across modalities may hinder the performance from being further improved. To alleviate this, we present Adaptive Language-guided Multimodal Transformer (ALMT), which incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation from visual and audio features under the guidance of language features at different scales. With the obtained hyper-modality representation, the model can obtain a complementary and joint representation through multimodal fusion for effective MSA. In practice, ALMT achieves state-of-the-art performance on several popular datasets (e.g., MOSI, MOSEI and CH-SIMS) and an abundance of ablation demonstrates the validity and necessity of our irrelevance/conflict suppression mechanism.


Introduction
Multimodal Sentiment Analysis (MSA) focuses on recognizing the sentiment attitude of humans from various types of data, such as video, audio, and language.It plays a central role in several applications, such as healthcare and human-computer interaction (Jiang et al., 2020;Qian et al., 2019).Compared with unimodal methods, MSA methods are generally more robust by exploiting and exploring the relationships between different modalities, showing significant advantages in improving the understanding of human sentiment.
Most recent MSA methods can be grouped into two categories: representation learning-centered methods (Hazarika et al., 2020;Yang et al., 2022;Model × Figure 1: In multimodal sentiment analysis, language modality usually is a dominant modality in all modalities, while audio and visual modalities not contributing as much to performance as language modality.Yu et al., 2021;Han et al., 2021;Guo et al., 2022) and multimodal fusion-centered methods (Zadeh et al., 2017;Liu et al., 2018;Tsai et al., 2019a;Huang et al., 2020).The representation learningcentered methods mainly focus on learning refined modality semantics that contains rich and varied human sentiment clues, which can further improve the efficiency of multimodal fusion for relationship modelling.On the other hand, the multimodal fusion-centered methods mainly focus on directly designing sophisticated fusion mechanisms to obtain a joint representation of multimodal data.
In addition, some works and corresponding ablation studies (Hazarika et al., 2020;Rahman et al., 2020;Guo et al., 2022) further imply that various modalities contribute differently to recognition, where language modality stands out as the dominant one.We note, however, information from different modalities may be ambiguous and conflicting due to sentiment-irrelevance, especially from non-dominating modalities (e.g., lighting and head pose in video and background noise in audio).Such disruptive information can greatly limit the performance of MSA methods.We have observed this phenomenon in several datasets (see Section 4.5.1) and an illustration is in Figure 1.To the best of our knowledge, there has never been prior work explicitly and actively taking this factor into account.Motivated by the above observation, we propose a novel Adaptive Language-guided Multimodal Transformer (ALMT) to improve the performance of MSA by addressing the adverse effects of disruptive information in visual and audio modalities.In ALMT, each modality is first transformed into a unified form by using a Transformer with initialized tokens.This operation not only suppresses the redundant information across modalities, but also compresses the length of long sequences to facilitate efficient model computation.Then, we introduce an Adaptive Hyper-modality Learning (AHL) module that uses different scales of language features with dominance to guide the visual and audio modalities to produce the intermediate hyper-modality token, which contains less sentiment-irrelevant information.Finally, we apply a cross-modality fusion Transformer with language features serving as query and hyper-modality features serving as key and value.In this sense, the complementary relations between language and visual and audio modalities are implicitly reasoned, achieving robust and accurate sentiment predictions.In summary, the major contributions of our work can be summarized as: • We present a novel multimodal sentiment analysis method, namely Adaptive Languageguided Multimodal Transformer (ALMT), which for the first time explicitly tackles the adverse effects of redundant and conflicting information in auxiliary modalities (i.e., visual and audio modalities), achieving a more robust sentiment understanding performance.
• We devise a novel Adaptive Hyper-modality Learning (AHL) module for representation learning.The AHL uses different scales of language features to guide the visual and audio modalities to form a hyper modality that complements the language modality.
• ALMT achieves state-of-the-art performance in several public and widely adopted datasets.We further provide in-depth analysis with rich empirical results to demonstrate the validity and necessity of the proposed approach.

Related Work
In this part, we briefly review previous work from two perspectives: multimodal sentiment analysis and Transformers.

Multimodal Sentiment Analysis
As mentioned in the section above, most previous MSA methods are mainly classified into two categories: representation learning-centered methods and multimodal fusion-centered methods.
For representation learning-centered methods, Hazarika et al. (2020) and Yang et al. (2022) argued representation learning of multiple modalities as a domain adaptation task.They respectively used metric learning and adversarial learning to learn the modality-invariant and modality-specific subspaces for multimodal fusion, achieving advanced performance in several popular datasets.Han et al. (2021) proposed a framework named MMIM that improves multimodal fusion with hierarchical mutual information maximization.Rahman et al. (2020) and Guo et al. (2022) devised different architectures to enhance language representation by incorporating multimodal interactions between language and non-verbal behavior information.However, these methods do not pay enough attention to sentiment-irrelevant redundant information that is more likely to be present in visual and audio modalities, which limits the performance of MSA.
For multimodal fusion-centered methods, Zadeh et al. (2017) proposed a fusion method (TFN) using a tensor fusion network to model the relationships between different modalities by computing the cartesian product.Tsai et al. (2019a) and Huang et al. (2020) introduced a multimodal Transformer to align the sequences and model long-range dependencies between elements across modalities.However, these methods directly fuse information from uni-modalities, which is more accessible to the introduction of sentiment-irrelevant information, thus obtaining sub-optimal results.

Transformer
Transformer is an attention-based building block for machine translation introduced by Vaswani et al. (2017).It learns the relationships between tokens by aggregating data from the entire sequence, showing an excellent modeling ability in various tasks, such as natural language processing, speech processing, and computer vision, etc. (Kenton and

Audio Video Language
Figure 2: Processing pipeline of the proposed ALMT for multimodal sentiment analysis (MSA).With the multimodal input, we first apply three Transformer layers to embed modality features with low redundancy.Then, we employ a Hyper-modality Learning (AHL) module to learn a hyper-modality representation from visual and audio modalities under the guidance of language features at different scales.Finally, a Cross-modality Fusion Transformer is applied to incorporate hyper-modality features based on their relations to the language features, thus obtaining a complementary and joint representation for MSA.Toutanova, 2019;Carion et al., 2020;Chen et al., 2022;Liu et al., 2023a).In MSA, this technique has been widely used for feature extraction, representation learning, and multimodal fusion (Tsai et al., 2019a;Huang et al., 2020;Liu et al., 2023b;Yuan et al., 2021).

Overview
The overall processing pipeline of the proposed Adaptive Language-guided Multimodal Transformer (ALMT) for robust multimodal sentiment analysis is in Figure 2. As shown, ALMT first extracts unified modality features from the input.Then, Adaptive Hyper-Modality Learning (AHL) module is employed to learn the adaptive hypermodality representation with the guidance of language features at different scales.Finally, we apply a Cross-modality Fusion Transformer to synthesize the hyper-modality features with language features as anchors, thus obtaining a language-guided hypermodality network for MSA.

Multimodal Input
Regarding the multimodal input, each sample consists of language (l), audio (a), and visual (v) sources.Referring to previous works, we first obtain pre-computed sequences calculated by BERT (Kenton and Toutanova, 2019), Librosa (McFee et al., 2015), and OpenFace (Baltrusaitis et al., 2018), respectively.Then, we denote these sequence inputs as U m ∈ R Tm×dm , where m ∈ {l, v, a}, T m is the sequence length and d m is the vector dimension of each modality.In practice, T m and d m are different on different datasets.For example, on the MOSI dataset, T v , T a , T l , d a , d v and d l are 50, 50, 50, 5, 20, and 768, respectively.

Modality Embedding
With multimodal input U m , we introduce three Transformer layers to unify features of each modality, respectively.More specifically, we randomly initialize a low-dimensional token H 0 m ∈ R T ×dm for each modality and use the Transformer to embed the essential modality information to these to-

Language Modality
Visual Modality Adaptive Hyper-modality Learning Layer kens : where H 1 m is the unified feature of each modality m with a size of T × d, E 0 m and θ E 0 m respectively represent the modality feature extractor and corresponding parameters, concat(•) represent the concatenation operation.
In practice, T and d are set to 8 and 128, respectively.The structure of the transformer layer is designed as the same as the Vision Transformer (VIT) (Dosovitskiy et al., 2021) with a depth setting of 1.Moreover, it is worth noting that transferring the essential modality information to initialized lowdimensional tokens is beneficial to decrease the redundant information that is irrelevant to human sentiment, thus achieving higher efficiency with lesser parameters.

Adaptive Hyper-modality Learning
After modality embedding, we further employ an Adaptive Hyper-modality Learning (AHL) module to learn a refined hyper-modality representation that contains relevance/conflict-suppressing information and highly complements language features.The AHL module consists of two Transformer layers and three AHL layers, which aim to learn language features at different scales and adaptively learn a hyper-modality feature from visual and audio modalities under the guidance of language features.In practice, we found that the language features significantly impact the modeling of hyper-modality (with more details in section 4.5.4).

Construction of Two-scale Language Features
We define the feature H 1 l as low-scale language feature.With the feature, we introduce two Transformer layers to learn language features at middlescale and high-scale (i.e.H 2 l and H 3 l ).Different from the Transformer layer in the modality embedding stage that transfers essential information to an initialized token, layers in this stage directly model the language features: where i ∈ {2, 3}, H i l is language features at different scales with a size of T × d, E i l and θ E i l represents the i-th Transformer layer for language features learning and corresponding parameters.In practice, we used 8-head attention to model the information of each modality.

Adaptive Hyper-modality Learning Layer
With the language features of different scales H i l , we first initialize a hyper-modality feature H 0 hyper ∈ R T ×d , then update H 0 hyper by calculating the relationship between obtained language features and two remaining modalities using multihead attention (Vaswani et al., 2017).As shown in Figure 3, using the extracted H i l as query and H 1 a as key, we can obtain the similarity matrix α between language features and audio features : where softmax represents weight normalization operation, In practice, we used 8-head attention and set d k to 16.Similar to α, β represents the similarity matrix between language modality and visual modality: where Then the hyper-modality features H j hyper can be updated by weighted audio features and weighted visual features as: where j ∈ {1, 2, 3} and H j hyper ∈ R T ×d respectively represent the j-th AHL layer and corresponding output hyper-modality features, W Va ∈ R d×d k and W Vv ∈ R d×d k are learnable parameters.

Multimodal Fusion and Output
In the Multimodal Fusion, we first obtain a new language feature H l and H hyper and a new hypermodality feature by respectively concatenating initialized a token H 0 ∈ R 1×d with H 3 hyper and H 3 l .Then we apply Cross-modality Fusion Transformer to transfer the essential joint and complementary information to these tokens.In practice, the Crossmodality Fusion Transformer fuse the language features H l (serving as the query) and hyper-modality features H hyper (serving as the key and value), thus obtaining a joint multimodal representation H ∈ R 1×d for final sentiment analysis.We denote the Cross-modality Fusion Transformer as CrossTrans, so the fusion process can be written as: After the multimodal fusion, we obtain the final sentiment analysis output ŷ by applying a classifier on the outputs of Cross-modality Fusion Transformer H.In practice, we also used 8-head attention to model the relationships between language modality and hyper-modality.For more details of the Cross-modality Fusion Transformer, we refer readers to Tsai et al. (2019a).

Overall Learning Objectives
To summarize, our method only involves one learning objective, i.e., the sentiment analysis learning loss L, which is: where N b is the number of samples in the training set, y n is the sentiment label of the n-th sample.ŷn is the prediction of our ALMT.
In addition, thanks to our simple optimization goal, compared with advanced methods (Hazarika et al., 2020;Yu et al., 2021) with multiple optimization goals, ALMT is much easier to train without tuning extra hyper-parameters.More details are shown in section 4.5.10.
MOSI.The dataset comprises 2,199 multimodal samples encompassing visual, audio, and language modalities.Specifically, the training set consists of 1,284 samples, the validation set contains 229 samples, and the test set encompasses 686 samples.Each individual sample is assigned a sentiment score ranging from -3 (indicating strongly negative) to 3 (indicating strongly positive).
MOSEI.The dataset comprises 22,856 video clips collected from YouTube with a diverse factors (e.g., spontaneous expressions, head poses,

Performance Comparison
Table 1 and Table 2 list the comparison results of our proposed method and state-of-the-art methods on the MOSI, MOSEI, and CH-SIMS, respectively.
As shown in the Table 1, the proposed ALMT obtained state-of-the-art performance in almost all metrics.On the task of more difficult and finegrained sentiment classification (Acc-7), our model achieves remarkable improvements.For example, on the MOSI dataset, ALMT achieved a relative improvement of 1.69% compared to the secondbest result obtained by CHFN.It demonstrates that eliminating the redundancy of auxiliary modalities is essential for effective MSA.
Moreover, it is worth noting that the scenarios in SIMS are more complex than MOSI and MOSEI.Therefore, it is more challenging to model the multimodal data.However, as shown in the Table 2, ALMT achieved state-of-the-art performance in all metrics compared to the sub-optimal approach.For example, compared to Self-MM, it achieved relative improvements with 1.44% on Acc-2 and 1.40% on the corresponding F1, respectively.Achieving such superior performance on SIMS with more complex scenarios demonstrates ALMT's ability to extract effective sentiment information from various scenarios.

Effects of Different Modalities
To better understand the influence of each modality in the proposed ALMT, Table 3 reports the ablation results of the subtraction of each modality to the ALMT on the MOSI and CH-SIMS datasets, respectively.It is shown that, if the AHL is removed based on the subtraction of each modality, the performance decreases significantly in all metrics.This phenomenon demonstrates that AHL is beneficial in reducing the sentiment-irrelevant redundancy of visual and audio modalities, thus improving robustness MSA.
In addition, we note that after removing the video and audio inputs, the performance of ALMT remains relatively high.Therefore, in the MSA task, we argue that eliminating the sentimentirrelevant information that appears in auxiliary modalities (i.e., visual and audio modalities) and improving the contribution of auxiliary modalities in performance should be paid more attention to.

Effects of Different Components
To verify the effectiveness of each component of our ALMT, in Table 4, we present the ablation result of the subtraction of each component on the MOSI and CH-SIMS datasets, respectively.We observe that deactivating the AHL (replaced with feature concatenation) greatly decreases the performance, demonstrating the language-guided hypermodality representation learning strategy is effective.Moreover, after the removal of the fusion Transformer and Modality Embedding, the performance drops again, also supporting that the fusion Transformer and Modality embedding can effectively improve the ALMT's ability to explore the sentiment information in each modality.Table 5 presents the experimental results of different query, key, and value settings in Transformer on the MOSI and MOSEI datasets, respectively.We observed that ALMT can obtain better performance when aligning hyper-modality features to language features (i.e., using H 3 l as query and using H 3 hyper as key and value).We attribute this phenomenon to the fact that language information is relatively clean and can provide more sentimentrelevant information for effective MSA.

Effects of the Guidance of Different Language Features in AHL
To discuss the effect of the guidance of different language features in AHL, we show the ablation result of different guidance settings on MOSI and CH-SIMS in Table 6.In practice, we replace the AHL layer that do not require language guidance with MLP layer.Obviously, we can see that the ALMT can obtain the best performance when all scals of language features (i.e., H 1 l , H 2 l , H 3 l ) involve the guidance of hyper-modality learning.
In addition, we found that the model is more difficult to converge when AHL is removed.It indicates that sentiment-irrelevant and conflicting information visual and audio modalities may limit the improvement of the model.

Effects of Different Fusion Techniques
To analyze the effects of different fusion techniques, we conducted some experiments, whose results are shown in the table 7. Obviously, on the MOSI dataset, the use of our Cross-modality Fusion Transformer to fuse language features and hyper-modality features is the most effective.On the CH-SIMS dataset, although TFN achieves better performance on the MAE metric, its Acc-5 is lower.Overall, using Transformer for feature fusion is an effective way.

Analysis on Model Complexity
As shown in Table 8, we compare the parameters of ALMT with other state-of-the-art Transformer-based methods.Due to the different hyper-parameter configurations for each dataset may lead to a slight difference in the number of parameters calculated.We calculated the model parameters under the hyper-parameter settings on the MOSI.Obviously, our ALMT obtains the best performance (Acc-7 of 49.42 %) with a second computational cost (2.50M).It shows that ALMT achieves a better trade-off between accuracy and computational burden.In Figure 4, we present the average attention matrix (i.e., α and β) on CH-SIMS.As shown, ALMT pays more attention to the visual modality, indicating that the visual modality provides more complementary information than the audio modality.In addition, from Table 3, compared to removing audio input, the performance of ALMT decreases more obviously when the video input is removed.It also demonstrates that visual modality may provide more complementary information.

Visualization of Robustness of AHL
To test the AHL's ability to perceive sentimentirrelevant information, as shown in Figure 5, we visualize the attention weights (β) of the last AHL layer between language features (H 3 l ) and visual features (H 1 v ) on CH-SIMS.More specifically, we first randomly selected a sample from the test set.Then we added random noise to a peak frame (marked by the black dashed boxes) of H 1 v , and finally observed the change of attention weights between H 3 l and H 1 v .It is seen that when the random noise is added to the peak frame, the attention weights between language and the corresponding peak frame show a remarkable decrease.This phenomenon demonstrates that AHL can suppress sentiment-irrelevant information, thus obtaining a more robust hyper-modality representation for multimodal fusion.

Visualization of Different Representations
In Figure 6, we visualized the hyper-modality representation H 3 hyper , visual representation H v 1 and audio representation H a 1 in a 3D feature space by using t-SNE (Van der Maaten and Hinton, 2008) on CH-SIMS.Obviously, there is a modality distribution gap existing between audio and visual features, as well as within their respective modalities.However, the hyper-modality representations learned from audio and visual features converge in the same distribution, indicating that the AHL can narrow the difference of inter-/intra modality distribution of audio and visual representations, thus reducing the difficulty of multimodal fusion.

Visualization of Convergence Performance
In Figure 7, we compared the convergence hehavior of ALMT with three state-of-the-art methods (i.e., MulT, MISA and Self-MM) on CH-SIMS.We choose the MAE curve for comparison as MAE indicates the model's ability to predict fine-grained sentiment.Obviously, on the training set, although Self-MM converges the fastest, its MAE of convergence is larger than ALMT at the end of the epoch.On the validation set, ALMT seems more stable compared to other methods, while the curves of other methods show relatively more dramatic fluctuations.It demonstrates that the ALMT is easier to train and has a better generalization capability.

Conclusion
In this paper, a novel Adaptive Language-guided Multimodal Transformer (ALMT) is proposed to better model sentiment cues for robust Multimodal Sentiment Analysis (MSA).Due to effectively suppressing the adverse effects of redundant information in visual and audio modalities, the proposed method achieved highly improved performance on several popular datasets.We further present rich indepth studies investigating the reasons behind the effectiveness, which may potentially advise other researchers to better handle MSA-related tasks.

Limitations
Our AMLT which is a Transformer-based model usually has a large number of parameters.It requires comprehensive training and thus can be subjected to the size of the training datasets.As current sentiment datasets are typically small in size, the performance of AMLT may be limited.For example, compared to classification metrics, such as Acc-7 and Acc-2, the more fine-grained regression metrics (i.e., MAE and Corr) may need more data for training, resulting in relatively small improvements compared to other advanced methods.

Figure 3 :
Figure 3: An example of the Adaptive Hyper-modality Learning (AHL) Layer.

Figure 4 :
Figure 4: Visualization of average attention weights from the last AHL layer on CH-SIMS dataset.(a) Average attention matrix α between language and audio modalities; (b) average attention matrix β between language and visual modalities.Note: darker colors indicate higher attention weights for learning.

Figure 5 :
Figure 5: Visualization of the attention weights between language and visual modalities learned by the AHL for a randomly selected sample with and without a random noise on the CH-SIMS dataset.(a) The attention weights without a random noise; (b) the attention weights with a random noise.Note: darker colors indicate higher attention weights for learning.

Figure 6 :
Figure 6: Visualization of different representations in 3D space by using t-SNE.

Figure 7 :
Figure 7: Visualization of convergence performance on the train and validation sets of CH-SIMS.(a) The comparison of MAE curves on the training set; (b) the comparison of MAE curves on the validation set.Note: the results of other methods reproduced by authors from open source code with default hyper-parameters.

Table 3 :
Effects of different modalities.Note: the best result is highlighted in bold.

Table 4 :
Effects of different components.Note: the best result is highlighted in bold.

Table 5 :
Effect of different Query, Key, and Value settings in Fusion Transformer.

Table 6 :
Effects of different guidance of different language features in AHL.Note: the best result is highlighted in bold.

Table 7 :
Effects of different fusion techniques.

Table 8 :
Analysis on model complexity.Note: the parameter of other Transformer-based methods was calculated by authors from open source code with default hyper-parameters on MOSI.