Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis

In multimodal sentiment analysis (MSA), the performance of a model highly depends on the quality of synthesized embeddings. These embeddings are generated from the upstream process called multimodal fusion, which aims to extract and combine the input unimodal raw data to produce a richer multimodal representation. Previous work either back-propagates the task loss or manipulates the geometric property of feature spaces to produce favorable fusion results, which neglects the preservation of critical task-related information that flows from input to the fusion results. In this work, we propose a framework named MultiModal InfoMax (MMIM), which hierarchically maximizes the Mutual Information (MI) in unimodal input pairs (inter-modality) and between multimodal fusion result and unimodal input in order to maintain task-related information through multimodal fusion. The framework is jointly trained with the main task (MSA) to improve the performance of the downstream MSA task. To address the intractable issue of MI bounds, we further formulate a set of computationally simple parametric and non-parametric methods to approximate their truth value. Experimental results on the two widely used datasets demonstrate the efficacy of our approach.


Introduction
With the unprecedented advances in social media in recent years and the availability of smartphones with high-quality cameras, we witness an explosive boost of multimodal data, such as movies, short-form videos, etc. In real life, multimodal data usually consists of three channels: visual (image), acoustic (voice), and transcribed text. Many of them often express sort of sentiment, which is a long-term disposition evoked when a person encounters a specific topic, person or entity (Deonna and Teroni, 2012;Poria et al., 2020). Mining and understanding these emotional elements from multimodal data, namely multimodal sentiment analysis (MSA), has become a hot research topic because of numerous appealing applications, such as obtaining overall product feedback from customers or gauging polling intentions from potential voters (Melville et al., 2009). Generally, different modalities in the same data segment are often complementary to each other, providing extra cues for semantic and emotional disambiguation (Ngiam et al., 2011). The crucial part for MSA is multimodal fusion, in which a model aims to extract and integrate information from all input modalities to understand the sentiment behind the seen data. Existing methods to learn unified representations are grouped in two categories: through loss back-propagation or geometric manipulation in the feature spaces. The former only tunes the parameters based on back-propagated gradients from the task loss Tsai et al., 2019a;, reconstruction loss (Mai et al., 2020), or auxiliary task loss Yu et al., 2021). The latter additionally rectifies the spatial orientation of unimodal or multimodal representations by matrix decomposition (Liu et al., 2018) or Euclidean measure optimization (Sun et al., 2020;Hazarika et al., 2020). Although having gained excellent results in MSA tasks, these methods are limited to the lack of control in the information flow that starts from raw inputs till the fusion embeddings, which may risk losing practical information and introducing unexpected noise carried by each modality (Tsai et al., 2020). To alleviate this issue, different from previous work, we leverage the functionality of mutual information (MI), a concept from the subject of information theory. MI measures the dependencies between paired multi-dimensional variables. Maximizing MI has been demonstrated ef-ficacious in removing redundant information irrelevant to the downstream task and capturing invariant trends or messages across time or different domains (Poole et al., 2019), and has been shown remarkable success in the field of representation learning Veličković et al., 2018). Based on these experience, we propose MultiModal InfoMax (MMIM), a framework that hierarchically maximizes the mutual information in multimodal fusion. Specifically, we enhance two types of mutual information in representation pairs: between unimodal representations and between fusion results and their low-level unimodal representations. Due to the intractability of mutual information (Belghazi et al., 2018), researchers always boost MI lower bound instead for this purpose. However, we find it is still difficult to figure out some terms in the expressions of these lower bounds in our formulation. Hence for convenient and accurate estimation of these terms, we propose a hybrid approach composed of parametric and non-parametric parts based on data and model characteristics. The parametric part refers to neural network-based methods, and in the non-parametric part we exploit a Gaussian Mixture Model (GMM) with learning-free parameter estimation. Our contributions can be summarized as follows: 1. We propose a hierarchical MI maximization framework for multimodal sentiment analysis. MI maximization occurs at the input level and fusion level to reduce the loss of valuable task-related information. To our best knowledge, this is the first attempt to bridge MI and MSA.
2. We formulate the computation details in our framework to solve the intractability problem. The formulation includes parametric learning and non-parametric GMM with stable and smooth parameter estimation.
3. We conduct comprehensive experiments on two publicly available datasets and gain superior or comparable results to the state-ofthe-art models.

Related Work
In this section, we briefly overview some related work in multimodal sentiment analysis and mutual information estimation and application.

Multimodal Sentiment Analysis (MSA)
MSA is an NLP task that collects and tackles data from multiple resources such as acoustic, visual, and textual information to comprehend varied human emotions (Morency et al., 2011). Early fusion models adopted simple network architectures, such as RNN based models (Wöllmer et al., 2013; that capture temporal dependencies from low-level multimodal inputs, SAL-CNN  which designed a selectadditive learning procedure to improve the generalizability of trained neural networks, etc. Meanwhile, there were many trials to combine geometric measures as accessory learning goals into deep learning frameworks. For instance, Hazarika et al. (2018); Sun et al. (2020) optimized the deep canonical correlation between modality representations for fusion and then passed the fusion result to downstream tasks. More recently, formulations influenced by novel machine learning topics have emerged constantly: Akhtar et al. (2019) presented a deep multi-task learning framework to jointly learn sentiment polarity and emotional intensity in a multimodal background. Pham et al. (2019) proposed a method that cyclically translates between modalities to learn robust joint representations for sentiment analysis. Tsai et al. (2020) proposed a routing procedure that dynamically adjusts weights among modalities to provide interpretability for multimodal fusion. Motivated by advances in the field of domain separation, Hazarika et al. (2020) projected modality features into private and common feature spaces to capture exclusive and shared characteristics across different modalities. Yu et al. (2021) designed a multi-label training scheme that generates extra unimodal labels for each modality and concurrently trained with the main task.
In this work, we build up a hierarchical MImaximization guided model to improve the fusion outcome as well as the performance in the downstream MSA task, where MI maximization is realized not only between unimodal representations but also between fusion embeddings and unimodal representations.

Mutual Information in Deep Learning
Mutual information (MI) is a concept from information theory that estimates the relationship between pairs of variables. It is a reparameterizationinvariant measure of dependency (Tishby and Za-slavsky, 2015) defined as:  Amjad and Geiger, 2019). However, since direct MI estimation in high-dimensional spaces is nearly impossible, many works attempted to approximate the true value with variational bounds (Belghazi et al., 2018;Cheng et al., 2020;Poole et al., 2019). In our work, we apply MI lower bounds at both the input level and fusion level and formulate or reformulate estimation methods for these bounds based on data characteristics and mathematical properties of the terms to be estimated.

Problem Definition
In MSA tasks, the input to a model is unimodal raw sequences X m ∈ R lm×dm drawn from the same video fragment, where l m is the sequence length and d m is the representation vector dimension of modality m, respectively. Particularly, in this paper we have m ∈ {t, v, a}, where t, v, a denote the three types of modalities-text, visual and acoustic that we obtained from the datasets. The goal for the designed model is to extract and integrate task-related information from these input vectors to form a unified representation and then utilize that to make accurate predictions about a truth value y that reflects the sentiment strength.

Overall Architecture
As shown in Figure 1, our model firstly processes raw input into numerical sequential vectors with feature extractor (firmware for visual and acoustic with no parameters to train) and tokenizer (for text). Then we encode them into individual unitlength representations. The model then works in two collaborative parts-fusion and MI maximization, marked by solid and dash lines in Figure  1 respectively. In the fusion part, a fusion network F of stacked linear-activation layers transforms the unimodal representations into the fusion result Z, which is then passed through a regression multilayer perceptron (MLP) for final predictions. In the MI part, the MI lower bounds at two levels-input level and fusion level are estimated and boosted. The two parts work concurrently to produce task and MI-related losses for back-propagation, through which the model learns to infuse the task-related information into fusion results as well as improve the accuracy of predictions in the main task.

Modality Encoding
We firstly encode the multimodal sequential input X m into unit-length representations h m . Specifically, we use BERT (Devlin et al., 2019) to encode an input sentence and extract the head embedding from the last layer's output as h t . For visual and acoustic, following previous works (Hazarika et al., 2020;Yu et al., 2021), we employ two modality-specific unidirectional LSTMs (Hochreiter and Schmidhuber, 1997) to capture the temporal features of these modalities:

Inter-modality MI Maximization
For a modality representation pair X, Y that comes from a single video clip, although they seem to be independent sequences, there is a certain correlation between them (Arandjelovic and Zisserman, 2017). Formally, suppose we have a collection of videos V and assume that their prior distributions are known. Then the prior distribution of X and Y can be decomposed by the sampling process in V as i.e., X and Y are conditionally independent from V , the MI is never trivially 0.
Since the analysis above, we hope that through prompting MI between multimodal input we can filter out modality-specific random noise that is irrelevant to our task and keep modality-invariant contents that span all modalities as much as possible. As stated before, we boost a tractable lower bound instead of computing MI directly for this purpose. We exploit an accurate and straightforward MI lower bound introduced in Barber and Agakov (2004). It approximates the truth conditional distribution p(y|x) with a variational coun-  terpart q(y|x): where H(Y ) is the differential entropy of Y . This lower bound is tight, i.e., there is no gap between the bound and truth value, when q(y|x) = p(y|x). In our implementation, we optimize the bounds for two modality pairs-(text, visual) and (text, acoustic). In each pair, we treat text as X and the other modality as Y in (3). We do so because 1) Since we have to train a predictor q(y|x) to approximate p(y|x), prediction from higherdimensional vectors h t ∈ R dt (d t =768) to lower ones h v ∈ R dv and h a ∈ R da (d v , d a ¡ 50) converges faster with higher accuracy; 2) many previous works (Tsai et al., 2019a;Hazarika et al., 2020) pointed out that from empirical study the text modality is predominate, which can integrate more task-related features than other modalities in this step. Additionally, we examine the efficacy of the design choice in the ablation study part. Following Cheng et al. (2020), we formulate q(y|x) as a multivariate Gaussian distributions q θ (y|x) = N (y|µ θ 1 (x), σ 2 θ 2 (x)I), with two neural networks parameterized by θ 1 and θ 2 to predict the mean and variance, respectively. The loss function for likelihood maximization is: where N is the batch size in training, tv, ta means summing the likelihood of two predictors. For the entropy term H(Y ), we solve its computation with the Gaussian Mixture Model (GMM), a commonly utilized approach for unknown distribution approximation that can facilitate distribution-based estimation (Nilsson et al., 2002;Kerroum et al., 2010). GMM builds up multiple Gaussian distributions for different property classes. We choose the sentiment polarity (non-negative/negative), which is a natural property in the datasets, as the classification criterion, which can also balance the trade-off between estimation accuracy (requires more classes) and computational cost (requires fewer classses). We build up two normal distributions N pos (µ 1 , Σ 1 ) and N neg (µ 2 , Σ 2 ) for each class, where µ is the mean vector and Σ is the covariance matrix. The parameters are estimated via the maximum likelihood method on a sufficiently large sampling batch D s ⊂ D train : where c ∈ {pos, neg} represents the polarity class that the sample belongs to, N c is the number of samples in class c and is component-wise multiplication. The entropy of a multivariate normal distribution is given by: where k is the dimensionality of the vectors in GMM and det(Σ) is the determinant of Σ. Based on the nearly equal frequencies of the two polarity classes in the dataset, we assume the prior probability that one data point x = (x 1 , ..., x k ) belongs to each is equal, i.e., w pos = p(x ∈ pos) = w neg = p(x ∈ neg) = 1 2 . Under the assumption that the two sub-distributions are disjoint, from where h c is the entropy of the sub-distribution for class c. Taking the lower bound as an approximation, we obtain the entropy term for the MI lower bound: In this formulation, we implicitly assume that the prior probabilities of the two classes are equal. We further notice that H(Y ) changes every time during each training epoch but at a very slow pace in several continuous steps due to the small gradients and consequently slight fluctuation in parameters. This fact demands us to update parameters timely to ensure estimation accuracy. Besides, according to statistical theory, we should increase the batch size (N * ) to reduce estimation error, but the maximum batch size is restricted to the GPU's capacity. Considering the situation above, we indirectly enlarge D s by encompassing the data from the nearest history. In implmentation, we store such data in a history data memory. The loss function for MI lower bound maximization in this level is given by:

MI Maximization in the Fusion Level
To enforce the intermediate fusion results to capture modality-invariant cues among modalities, we repeat MI maximization between fusion results and input modalities. The optimization target is the fusion network F that produces fusion results Z = F (X t , X v , X a ). Since we already have a generation path from X m to Z, we expect an opposite path, i.e. to constructs X m , m ∈ {t, v, a} from Z. Inspired by but different from Oord et al.
(2018), we use a score function that acts on the normalized prediction and truth vectors to gauge their correlation: where G φ is a neural network with parameters φ that generates a prediction of h m from Z, · 2 is the Euclidean norm, by dividing which we obtain unit-length vectors. Because we find the model intends to stretch both vectors to maximize the score in (10) Here is a short explanation of the rationality of such formulation. Contrastive Predictive Coding (CPC) scores the MI between context and future elements "across the time horizon" to keep the portion of "slow features" that span many time steps (Oord et al., 2018). Similarly, in our model, we ask the fusion result Z to reversely predict representations "across modalities" so that more modality-invariant information can be passed to Z. Besides, by aligning the prediction to each modality we enable the model to decide how much information it should receive from each modality. This insight will be further discussed with experimental evidence in Section 5.2. The loss function for this level is given by:

Training
The training process consists of two stages in each iteration: In the first stage, we approximate  (2) Compute L lld as (4) Update parameters of predictor q: θq ← θq − η lld ∇ θ L lld end Stage 2: MI-maximization Joint Training: (2) (3), (8), (9) Produce fusion results Zi = F (X i t , X i v , X i a ) and predictionsŷ Compute LN , LCP C as (10), (11), (12) Compute Lmain as (14) Update all parameters in the model except q: θ k ← θ k − η k ∇ θ Lmain end end p(y|x) with q(y|x) by minimizing the negative log-likelihood for inter-modality predictors with the loss in (4). In the second stage, hierarchical MI lower bounds in previous subsections are added to the main loss as auxiliary losses. After obtaining the final predictionŷ, along with the truth value y, we have the task loss: where MAE stands for mean absolute error loss, which is a common practice in regression tasks. Finally, we calculate the weighted sum of all these losses to obtain the main loss for this stage: where α, β are hyper-parameters that control the impact of MI maximization. We summarize the training algorithm in Algorithm 1.

Experiments
In this section, we present some experimental details, including datasets, baselines, feature extraction tool kits, and results. Train  1284  16326  Validation  229  1871  Test  686  4659  All  2199  22856   Table 1: Dataset split.

Datasets and Metrics
We conduct experiments on two publicly available academic datasets in MSA research: CMU-MOSI (Zadeh et al., 2016) and CMU-MOSEI . CMU-MOSI contains 2199 utterance video segments sliced from 93 videos in which 89 distinct narrators are sharing opinions on interesting topics. Each segment is manually annotated with a sentiment value ranged from -3 to +3, indicating the polarity (by positive/negative) and relative strength (by absolute value) of expressed sentiment. CMU-MOSEI dataset upgrades CMU-MOSI by expanding the size of the dataset. It consists of 23,454 movie review video clips from YouTube. Its labeling style is the same as CMU-MOSI. We provide the split specifications of the two datasets in Table 1.
We use the same metric set that has been consistently presented and compared before: mean absolute error (MAE), which is the average mean absolute difference value between predicted values and truth values, Pearson correlation (Corr) that measures the degree of prediction skew, seven-class classification accuracy (Acc-7) indicating the proportion of predictions that correctly fall into the same interval of seven intervals between -3 and +3 as the corresponding truths, binary classification accuracy (Acc-2) and F1 score computed for positive/negative and non-negative/negative classification results.

Baselines
To inspect the relative performance of MMIM, we compare our model with many baselines. We consider pure learning based models, such as TFN , LMF (Liu et al., 2018), MFM (Tsai et al., 2019b) and MulT (Tsai et al., 2019a), as well as approaches involving feature space manipulation like ICCN (Sun et al., 2020) and MISA (Hazarika et al., 2020). We also compare our model with more recent and competitive baselines, in-  cluding BERT-based model-MAG-BERT (Rahman et al., 2020) and Self-MM (Yu et al., 2021), which works with multi-task learning and is the SOTA method. Some of the baselines are available at https://github.com/declare-lab/ multimodal-deep-learning. The baselines are listed below: TFN : Tensor Fusion Network disentangles unimodal into tensors by threefold Cartesian product. Then it computes the outer product of these tensors as fusion results.
LMF (Liu et al., 2018): Low-rank Multimodal Fusion decomposes stacked high-order tensors into many low rank factors then performs efficient fusion based on these factors.
MFM (Tsai et al., 2019b): Multimodal Factorization Model concatenates a inference network and a generative network with intermediate modality-specific factors, to facilitate the fusion process with reconstruction and discrimination losses.
MulT (Tsai et al., 2019a): Multimodal Transformer constructs an architecture unimodal and crossmodal transformer networks and complete fusion process by attention.
ICCN (Sun et al., 2020): Interaction Canonical Correlation Network minimizes canonical loss between modality representation pairs to ameliorate fusion outcome.
MISA (Hazarika et al., 2020): Modality-Invariant and -Specific Representations projects features into separate two spaces with special limitations. Fusion is then accomplished on these features. (Rahman et al., 2020): Multimodal Adaptation Gate for BERT designs an alignment gate and insert that into vanilla BERT model to refine the fusion process.

MAG-BERT
SELF-MM (Yu et al., 2021): Self-supervised Multi-Task Learning assigns each modality a unimodal training task with automatically generated labels, which aims to adjust the gradient backpropagation.

Basic Settings and Results
Experimental Settings. We use unaligned raw data in all experiments as in Yu et al. (2021). For visual and acoustic, we use COVAREP (Degottex et al., 2014) and P2FA (Yuan and Liberman, 2008), which both are prevalent tool kits for feature extraction and have been regularly employed before. We trained our model on a single RTX 2080Ti GPU and ran a grid search for the best set of hyper-parameters. The details are provided in the supplementary file.
Summary of the Results. In accord with previous work, we ran our model five times under the same hyper-parameter settings and report

Ablation Study
To show the benefits from the proposed loss functions and the corresponding estimation methods in MMIM, we carried out a series of ablation experiments on CMU-MOSEI. The results under different ablation settings are categorized and listed in Table 4. First, we eliminate one or several MI loss terms, for both the inter-modality MI lower bound (I BA ) and CPC loss (L z,m N where m ∈ {v, a, t}), from the total loss. We note the manifest performance degradation after removing part of the MI loss, and the results are even worse when removing all terms in one loss than only removing single term, which shows the efficacy of our MI maximization framework. Besides, by replacing current optimization target pairs in inter-modality MI with single pair or other pair combinations we can not gain better results, which provides experimental evidence for the candidate pair choice in that level. Then we test the components for entropy estimation. We deactivate the history memory and evaluate µ and Σ in (5) using only the current batch. It is surprising to observe that the training process broke down due to the gradient's "NaN" value. Therefore, the history-based estimation has another advantage of guaranteeing the training stability. Finally, we substitute the GMM with a unified Gaussian where µ and Σ are estimated on all samples regardless of their polarity classes. We spot a clear drop in all metrics, which implies the GMM built on natural class leads to a more accurate estimation for entropy terms.

Further Analysis
In this section, we dive into our models to explore how it functions in the MSA task. We first visualize all types of losses in the training process, then we analyze some representative cases.

Tracing the Losses
To better understand how MI losses work, we visualize the variation of all losses during training in Figure 2. The values for plotting are the average losses in a constant interval of every 20 steps. From the figure, we can see throughout the training process, L task and L CP C keep decreasing nearly all the time, while L BA goes down in an epoch except the beginning of that. We also mark the time that the best epoch ends, i.e., the task loss on the validation set reaches the mini-  mum. It is notable that L BA and L CP C reach a relatively lower level at this time while the task loss on the training set does not. This scenario reveals the crucial role that L BA and L CP C play in the training process-they offer supplemental unsupervised gradient rectification to the parameters in their respective back-propagation path and fix up the over-fitting of the task loss. Besides, because in the experiment settings α and β are in the same order and at the end of best epoch L BA reaches the lowest value, which is synchronized as the validation loss, but L CP C fails to, we can conclude that L BA , or MI maximization in the input (lower) level, has a more significant impact on model's performance than L CP C , or MI maximization in the fusion (higher) level.

Case Study
We display some predictions and truth values, as well as corresponding input raw data (for visual and acoustic we only illustrate literally) and three CPC scores in Table 5. As described in Section 3.5, these scores imply how much the fusion results depend on each modality. It is noted that the scores are all beyond 0.35 in all cases, which demonstrates the fusion results seize a certain amount of domain-invariant features. We also observe the different extents that the fusion results depend on each modality. In case (A), visual provides the only clue of the truth sentiment, and correspondingly s zv is higher than the other two scores. In case (B), the word "only" is a piece of additional evidence apart from what visual modality exposes, and we find s zt achieves a higher level than in (A). For (C), acoustic and visual help infer a neutral sentiment and thus s zv and s za are large than s zt . Therefore, we conclude that the model can intelligently adjust the information that flows from unimodal input into the fusion results consistently with their individual contribution to the final predictions. However, this mechanism may malfunction in cases like (D). The remark "I'm sorry" bewilders the model and meanwhile visual and acoustic remind none. In this circumstance, the model casts attention on text and is misled to a wrong prediction in the opposite direction.

Conclusion
In this paper, we present MMIM, which hierarchically maximizes the mutual information (MI) in a multimodal fusion pipeline. The model applies two MI lower bounds for unimodal inputs and the fusion stage, respectively. To address the intractability of some terms in these lower bounds, we specifically design precise, fast and robust estimation methods to ensure the training can go on normally as well as improve the test outcome. Then we conduct comprehensive experiments on two datasets followed by the ablation study, the results of which verify the efficacy of our model and the necessity of the MI maximization framework. We further visualize the losses and display some representative examples to provide a deeper insight into our model. We believe this work can inspire the creativity in representation learning and multimodal sentiment analysis in the future.