Analyzing Modality Robustness in Multimodal Sentiment Analysis

Building robust multimodal models are crucial for achieving reliable deployment in the wild. Despite its importance, less attention has been paid to identifying and improving the robustness of Multimodal Sentiment Analysis (MSA) models. In this work, we hope to address that by (i) Proposing simple diagnostic checks for modality robustness in a trained multimodal model. Using these checks, we find MSA models to be highly sensitive to a single modality, which creates issues in their robustness; (ii) We analyze well-known robust training strategies to alleviate the issues. Critically, we observe that robustness can be achieved without compromising on the original performance. We hope our extensive study–performed across five models and two benchmark datasets–and proposed procedures would make robustness an integral component in MSA research. Our diagnostic checks and robust training solutions are simple to implement and available at https://github.com/declare-lab/MSA-Robustness


Introduction
Multimodal Sentiment Analysis (MSA) is a burgeoning field of research that has seen accelerated developments in recent years.Numerous models have been proposed that utilize multiple modalities such as audio, visual, and language signals to predict sentiments, emotions, and other forms of affect.
While progress in MSA has been driven mainly by improvements in multimodal performance, we call for attention towards an equally important aspect in multimodal systems -multimodal robustness.
Robustness is crucial when models are deployed in the wild, where it is common to encounter inadvertent errors in the source modalities due to data loss, data corruption, jitter, privacy issues, amongst others.
A well-known fact in the MSA research is that language modality tends to be the most effective, which has prompted models to utilize language as its core modality (Wu et al., 2021;Han et al., 3 Testing Robustness via Diagnostic

Checks
In this section we perform an elaborate study on modality robustness by simulating potential issues with modality signals during testing (or deployment) of MSA models.and more diverse samples.Both these datasets contain short utterance videos and provide language, audio, and visual modality features.

Proposed Diagnostic Checks
We propose two diagnostic checks that introduce i) Missing modalities, which drops (or nullifies) a modality from the input and ii) Noisy Modalities ), to achieve the language representation vec- We intervene on this rep-183 resentation and apply our diagnostics as follows.

184
We sample 30% of u l from the testing set and 185 modify them as ûl = f (u l ), where f (x) is defined 186 as either f (x) = x ⊙ 0 for modality dropping 187 (nulling the vector to 0s by elementwise multipli-188 cation) or f (x) = x + N (0, 1) to add white noise.

189
The modified ûl is then fed to the rest of the net- 1 While missing and noisy errors are predominant in the wild, we leave other potential forms of errors, such as affine transformations to the representations for future work.
2 We set 30% arbitrarily to simulate modality errors to a proportion of the input signals.
work as usual.
In the selected models, we apply diagnostics at different network locations.These include the representations before the hidden projection, such as in MISA, or fusion operation, such as in Self-MM.Observations.Fig. 2 presents the results, where across both MOSI and MOSEI datasets, we find that language modality is highly sensitive to modality errors in the language source (across all models).This trend is observed for both missing and noisy modality checks, thus highlighting the concerns over robustness of these SOTA models.These diagnostic checks are easy to analyze, and we hope they will become an integrated part of the modeldevelopment pipeline in MSA.

Robust Training
In this section, we explore how to reduce the sensitivity of the models to the dominant modality, i.e., language.One of the popular ways to alleviate such issues is to teach the model such scenarios during training.We dub this approach as modality-perturbation, which is conceptually similar to removing modalities in (Ma et al., 2021) or adding noise in (Miyato et al., 2018).It sim-ulates the modality errors during training so that the model learns to expect duch events during testing/deployment.The procedure is as follows,

Training:
(a) For a particular batch of data, sample a proportion of the data to be perturbed.
(b) Similar to the diagnostic checks in § 3, perturb the dominant modality (in our case, language) of half of this data with missing and the other half with noisy perturbation.
Repeat both these steps for the next batch.
2. Testing: Apply the diagnostic checks as in § 3.
This simple approach can be interpreted as regularization akin to dropouts or noising strategies used in de-noising auto-encoders.

Results
Robustness.Table 1 presents the results, where we perform balanced perturbation between missing and noisy modalities.For the 30% perturbable data in training, we drop the language modality on 15% and for the other 15%, we add noise.This setting improves the diagnostics in both kinds of errors.

Appendix C presents results on other proportions of the training data.
With balanced perturbation, (BBFN-MOSI) reduces the relative drop on missing language diagnostic by 31% (in F1) and 98% on noise.Also, missing drop reduces by 11% in Corr and by 99% for noisy diagnostic.(Self-MM, MOSI) increases the relative drop in Corr slightly on missing diagnostic, but in all other cases, it is significantly reduced.For example, the F1 drop on MOSI for noisy diagnostic reduces significantly by 93%.Table 1 also  We find the answer to this is No. Surprisingly, our robust training procedure does not degrade in

A Model Details
MISA: We get the MISA model from its official repository 3 .In this model, we apply the interventions at the following encoded language representation from the original paper: That is, the interventions are applied before the language representation is projected to its shared and private subspaces.

BBFN:
We get the BBFN model from its official repository 4 .In this model, we execute the interventions after the following language embedding from the original paper.
Self-MM: We get the Self-MM model from its official repository 5 .In this model, we set the interventions after the language features encoded below from the original paper.

448
For each model we train the models to achieve

Modality-Perturbation
We also analyze with varying proportions of perturbations in the training and testing phase, respectively.As seen in Table 3, as the noise gradually increases from 5% to 15%, the drop of Corr in MOSI is gradually reduced , which shows the robustness is getting better, until it reaches the optimum at 30% perturbation (15% missing + 15% noise).In other models, 30% perturbation is also advantageous.For example, in Table 7, (Mult, MOSEI) reduces Corr drop while improving F1 performance in 30% perturbation.Although it is only a small improvement at present, we believe that there will be more meaningful improvements in the future.

Figure 1 :
Figure 1: Removing modalities one at a time from the testing set of CMU-MOSI (Zadeh et al., 2016) on a trained MISA (Hazarika et al., 2020).

3. 1
Experiment Setup Models.In order to fully verify the universality of our experiments, we select a series of diverse SOTA models, ranging from RNN-based to Transformer-based architectures.These models work across different granularities from word-level to sentence-level variants: (i) MISA (Hazarika et al., 2020) is a popular model that generates modality-invariant and -specific features of multimodal data, to learn both shared and unique characteristics of each modality.(ii) BBFN (Han et al., 2021a) in a similar vein performs fusion and separation to increase cross-modal relevances and differences.This work acknowledges the dominance of text modality in MSA and proposes two text-centric bi-modal transformers to increase performance.(iii) Self-MM (Yu et al., 2021) focuses on the relationship between multi-and uni-modal predictions by multitasking consistencies and differences between them.(iv) MMIM (Han et al., 2021b) incorporates mutual information (MI) into MSA by maximizing MI at the input and fusion level.(v) MulT (Tsai et al., 2019) merges multimodal time series through multiple sets of directional pairwise cross-modal transformers.It accounts for long-range dependencies across modality elements to create a strong baseline (see Appendix B).Datasets.We consider two benchmark datasets widely used in the field of multimodal sentiment analysis, CMU-MOSI (Zadeh et al., 2016), which is a popular dataset for studying the intensity of multimodal sentiment in the MSA field and CMU-MOSEI (Bagher Zadeh et al., 2018) which is a larger counterpart of MOSI with richer annotations

For
MulT and BBFN, we apply the interventions right after the word embeddings.Detailed discussion on the location of interventions is provided in Appendix A.
shows that our method performs well on both RNN-based and Transformer-based models, demonstrating the wide applicability of our method.Performance Trade-off.While alleviating robustness via regularization is well-known in the literature, there is often a trade-off with absolute performance in the original testing setup.Most approaches that achieve robustness take a hit at their best performance on clean input (Zhang et al., 2022) (Nakkiran, 2019) (Su et al., 2018) (Tsipraset al., 2019).This raises the question of whether introducing modality-perturbation reduces the performance of the model on the original testing set.

449
performances close to reported in the respective 450 papers.Table2presents the hyper-parameters we 451 used to reproduce their results.

Figure 3 :
Figure 3: Diagnostic checks (noisy modality) for modality robustness in MOSI and MOSEI datasets.Results are averaged over three independent runs.Each modality error is applied to 30% of testing data.

Table 2 :
Hyper-parameter config used to train the models.
MMIM:We get the MMIM model from its of-437 ficial repository 6 .For this model, we perform the 438 interventions after the following encoded language 439 representation from the original paper.440 x l = BERT X l ; θ BERT l (8)

Table 3 :
MISA Robust Training.Results are averaged over three random runs.

Table 4 :
BBFN Robust Training.Results are averaged over three random runs.

Table 5 :
Self-MM Robust Training.Results are averaged over three random runs.

Table 6 :
MMIM Robust Training.Results are averaged over three random runs.

Table 7 :
MulT Robust Training.Results are averaged over three random runs.