Improving Multimodal fusion via Mutual Dependency Maximisation

Multimodal sentiment analysis is a trending area of research, and multimodal fusion is one of its most active topic. Acknowledging humans communicate through a variety of channels (i.e visual, acoustic, linguistic), multimodal systems aim at integrating different unimodal representations into a synthetic one. So far, a consequent effort has been made on developing complex architectures allowing the fusion of these modalities. However, such systems are mainly trained by minimising simple losses such as L_1 or cross-entropy. In this work, we investigate unexplored penalties and propose a set of new objectives that measure the dependency between modalities. We demonstrate that our new penalties lead to a consistent improvement (up to 4.3 on accuracy) across a large variety of state-of-the-art models on two well-known sentiment analysis datasets: CMU-MOSI and CMU-MOSEI. Our method not only achieves a new SOTA on both datasets but also produces representations that are more robust to modality drops. Finally, a by-product of our methods includes a statistical network which can be used to interpret the high dimensional representations learnt by the model.


Introduction
Humans employ three different modalities to communicate in a coordinated manner: the language modality with the use of words and sentences, the vision modality with gestures, poses and facial expressions and the acoustic modality through change in vocal tones. Multimodal representation learning has shown great progress in a large variety of tasks including emotion recognition, sentiment analysis (Soleymani et al., 2017), speaker trait analysis (Park et al., 2014) and fine-grained opinion mining (Garcia et al., 2019a). Learning from different modalities is an efficient way to improve performance on the target tasks (Xu et al., 2013). Nevertheless, heterogeneities across modalities increase the difficulty of learning multimodal representations and raise specific challenges. Baltrušaitis et al. (2018) identifies fusion as one of the five core challenges in multimodal representation learning, the four other being: representation, modality alignment, translation and co-learning. Fusion aims at integrating the different unimodal representations into one common synthetic representation. Effective fusion is still an open problem: the best multimodal models in sentiment analysis (Rahman et al., 2020) improve over their unimodal counterparts, relying on text modality only, by less than 1.5% on accuracy. Additionally, the fusion should not only improve accuracy but also make representations more robust to missing modalities. Multimodal fusion can be divided into early and late fusion techniques: early fusion takes place at the feature level (Ye et al., 2017), while late fusion takes place at the decision or scoring level (Khan et al., 2012). Current research in multimodal sentiment analysis mainly focuses on developing new fusion mechanisms relying on deep architectures (e.g TFN , LFN (Liu et al., 2018), MARN (Zadeh et al., 2018b), MISA (Hazarika et al., 2020), MCTN (Pham et al., 2019), HFNN (Mai et al., 2019), ICCN (Sun et al., 2020)). Theses models are evaluated on several multimodal sentiment analysis benchmark such as IEMOCAP (Busso et al., 2008), MOSI (Wöllmer et al., 2013), MOSEI (Zadeh et al., 2018c) and POM (Garcia et al., 2019b;Park et al., 2014). Current state-ofthe-art on these datasets uses architectures based on pre-trained transformers (Tsai et al., 2019;Siriwardhana et al., 2020) such as MultiModal Bert (MAGBERT) or MultiModal XLNET (MAGXLNET) (Rahman et al., 2020).
The aforementioned architectures are trained by minimising either a L 1 loss or a Cross-Entropy loss between the predictions and the ground-truth labels. To the best of our knowledge, few efforts have been dedicated to exploring alternative losses.
In this work, we propose a set of new objectives to perform and improve over existing fusion mechanisms. These improvements are inspired by the InfoMax principle (Linsker, 1988), i.e. choosing the representation maximising the mutual information (MI) between two possibly overlapping views of the input. The MI quantifies the dependence of two random variables; contrarily to correlation, MI also captures non-linear dependencies between the considered variables. Different from previous work, which mainly focuses on comparing two modalities, our learning problem involves multiple modalities (e.g text, audio, video). Our proposed method, which induces no architectural changes, relies on jointly optimising the target loss with an additional penalty term measuring the mutual dependency between different modalities.

Our Contributions
We study new objectives to build more performant and robust multimodal representations through an enhanced fusion mechanism and evaluate them on multimodal sentiment analysis. Our method also allows us to explain the learnt high dimensional multimodal embeddings. The paper contributions can be summarised as follows: A set of novel objectives using multivariate dependency measures. We introduce three new trainable surrogates to maximise the mutual dependencies between the three modalities (i.e audio, language and video). We provide a general algorithm inspired by MINE (Belghazi et al., 2018), which was developed in a bi-variate setting for estimating the MI. Our new method enriches MINE by extending the procedure to a multivariate setting that allows us to maximise different Mutual Dependency Measures: the Total Correlation (Watanabe, 1960), the f-Total Correlation and the Multivariate Wasserstein Dependency Measure (Ozair et al., 2019). Applications and numerical results. We apply our new set of objectives to five different architectures relying on LSTM cells (Huang et al., 2015) (e.g EF-LSTM, LFN, MFN) or transformer layers (e.g MAGBERT, MAG-XLNET). Our proposed method (1) brings a substantial improvement on two different multimodal sentiment analysis datasets (i.e MOSI and MOSEI,sec. 5.1), (2) makes the encoder more robust to missing modalities (i.e when predicting without language, audio or video the observed performance drop is smaller, sec. 5.3), (3) provides an explanation of the decision taken by the neural architecture (sec. 5.4).

Problem formulation & related work
In this section, we formulate the problem of learning multi-modal representation (sec. 2.1) and we review both existing measures of mutual dependency (see sec. 2.2) and estimation methods (sec. 2.3). In the rest of the paper, we will focus on learning from three modalities (i.e language, audio and video), however our approach can be generalised to any arbitrary number of modalities.

Learning multimodal representations
Plethora of neural architectures have been proposed to learn multimodal representations for sentiment classification. Models often rely on a fusion mechanism (e.g multi-layer perceptron (Khan et al., 2012), tensor factorisation (Liu et al., 2018;Zadeh et al., 2019) or complex attention mechanisms (Zadeh et al., 2018a)) that is fed with modalityspecific representations. The fusion problem boils down to learning a model M f : X a × X v × X l → R d . M f is fed with uni-modal representations of the inputs X a,v,l = (X a , X v , X l ) obtained through three embedding networks f a , f v and f l . M f has to retain both modality-specific interactions (i.e interactions that involve only one modality) and cross-view interactions (i.e more complex, they span across both views). Overall, the learning of M f involves both the minimisation of the downstream task loss and the maximisation of the mutual dependency between the different modalities.

Mutual dependency maximisation
Mutual information as mutual dependency measure: the core ideas we rely on to better learn cross-view interactions are not new. They consist of mutual information maximisation (Linsker, 1988), and deep representation learning. Thus, one of the most natural choices is to use the MI that measures the dependence between two random variables, including high-order statistical dependencies (Kinney and Atwal, 2014). Given two random variables X and Y , the MI is defined by where p XY is the joint probability density function (pdf) of the random variables (X, Y ), and p X , p Y represent the marginal pdfs. MI can also be defined with a the KL divergence: Extension of mutual dependency to different metrics: the KL divergence seems to be limited when used for estimating MI (McAllester and Stratos, 2020). A natural step is to replace the KL divergence in Eq. 2 with different divergences such as the f-divergences or distances such as the Wasserstein distance. Hence, we introduce new mutual dependency measures (MDM): the f-Mutual Information (Belghazi et al., 2018), denoted I f and the Wasserstein Measures (Ozair et al., 2019), denoted I W . As previously, p XY denotes the joint pdf, and p X , p Y denote the marginal pdfs. The new measures are defined as follows: where D f denotes any f -divergences and where W denotes the Wasserstein distance (Peyré et al., 2019).

Estimating mutual dependency measures
The computation of MI and other mutual dependency measures can be difficult without knowing the marginal and joint probability distributions, thus it is popular to maximise lower bounds to obtain better representations of different modalities including image (Tian et al., 2019;Hjelm et al., 2018), audio (Dilpazir et al., 2016) and text (Kong et al., 2019) data. Several estimators have been proposed: MINE (Belghazi et al., 2018) uses the Donsker-Varadhan representation (Donsker and Varadhan, 1985) to derive a parametric lower bound holds, Nguyen et al. (2017Nguyen et al. ( , 2010 uses variational characterisation of f-divergence and a multi-sample version of the density ratio (also known as noise contrastive estimation (Oord et al., 2018;Ozair et al., 2019)). These methods have mostly been developed and studied in a bi-variate setting.
Illustration of neural dependency measures on a bivariate case. In Fig. 1 we can see the aforementioned dependency measures (i.e see Eq. 2, Eq. 4, Eq. 3) when estimated with MINE (Belghazi et al., 2018) for multivariate Gaussian random variables, X a and X b . The component wise correlation for the considered multivariate Gaussian is defined as follow: corr(X i , X k ) = δ i,k ρ , where ρ ∈ (−1, 1) and δ i,k is Kronecker's delta. We observe that the dependency measure based on Wasserstein distance is different from the one based on the divergences and thus will lead to different gradients. Although theoretical studies have been done on the use of different metrics for dependency estimations, it remains an open question to know which one is the best suited. In this work, we will provide an experimental response in a specific case.

Model and training objective
In this section, we introduce our new set of losses to improve fusion. In sec. 3.1, we first extend widely used bi-variate dependency measures to multivariate dependencies (James and Crutchfield, 2017) measures (MDM). We then introduce variational bounds on the MDM, and in sec. 3.2, we describe our method to minimise the proposed variational bounds.
Notations We consider X a , X v , X l as the multimodal data from the audio,video and language modality respectively with joint probability distribution p XaXvX l . We denote as p X j the marginal distribution of X j with j ∈ {a, v, l} corresponding to the jth modality. General loss As previously mentioned, we rely on the InfoMax principle (Linsker, 1988) and aim at jointly maximising the MDM between the different modalities and minimising the task loss; hence, we are in a multi-task setting (Argyriou et al., 2007;Ruder, 2017) and the objective of interest can be defined as: L down. represents a downstream specific (target task) loss i.e a binary cross-entropy or a L 1 loss, λ is a meta-parameter and L M DM is the multivariate dependencies measures (see sec. 3.2). Minimisation of our newly defined objectives requires to derive lower bounds on the L M DM terms, and then to obtain trainable surrogates.

From bivariate to multivariate dependencies
In our setting, we aim at maximising cross-view interactions involving three modalities, thus we need to generalise bivariate dependency measures to multivariate dependency measures. Definition 3.1 (Multivariate Dependencies Measures). Let X a , X v , X l be a set of random variables with joint pdf p XaXvX l and respective marginal pdf p X j with j ∈ {a, v, l}. Then we defined the multivariate mutual information I kl which is also refered as total correlation (Watanabe, 1960) or multi-information (Studenỳ and Vejnarová, 1998): .
Simarly for any f-divergence we define the multivariate f-mutual information I f as: Finally, we also extend Eq. 3 to obtain the multivariate Wasserstein dependency measure I W : where W denotes the Wasserstein distance.

From theoretical bounds to trainable surrogates
To train our neural architecture we need to estimate the previously defined multivariate dependency measures. We rely on neural estimators that are given in Th. 1. Theorem 1. Multivariate Neural Dependency Measures Let the family of functions T (θ) : X a × X v × X l → R parametrized by a deep neural network with learnable parameters θ ∈ Θ. The multivariate mutual information measure I kl is defined as: The neural multivariate f-mutual information measure I f is defined as follows: The neural multivariate Wasserstein dependency measure I W is defined as follows: Where L is the set of all 1-Lipschitz functions from R d → R Sketch of proofs: Eq. 6 is a direct application of the Donsker-Varadhan representation of the KL divergence (we assume that the integrability constraints are satisfied). Eq. 7 comes from the work of Nguyen et al. (2017). Eq. 8 comes from the Kantorovich-Rubenstein: we refer the reader to (Villani, 2008;Peyré et al., 2019) for a rigorous and exhaustive treatment.
Practical estimate of the variational bounds.
The empirical estimator that we derive from Th. 1 can be used in practical way: the expectations in Eq. 6, Eq. 7 and Eq. 8 are estimated using empirical samples from the joint distribution p XaXvX l . The empirical samples from j∈{a,v,l} p X j are obtained by shuffling the samples from the joint distribution in a batch. We integrate this into minimising a multi-task objective (5) by using minus the estimator. We refer to the losses obtained with the penalty based on the estimators described in Eq. 6, Eq. 7 and Eq. 8 as L kl , L f and L W respectively. Details on the practical minimisation of our variational bounds are provided in Algorithm 1.
Remark. In this work we choose to generalise MINE to compute multivariate dependencies. Comparing our proposed algorithm to other alternatives mentioned in sec. 2 is left for future work. This choice is driven by two main reasons: (1) our framework allows the use of various types of contrast measures (e.g Wasserstein distance,fdivergences); (2) the critic network T θ can be used for interpretability purposes as shown in sec. 5.4.

Experimental setting
In this section, we present our experimental settings including the neural architectures we compare, the datasets, the metrics and our methodology, which includes the hyper-parameter selection.
Algorithm 1 Two-stage procedure to minimise multivariate dependency measures. m] three permutations, θ c weights of the deep classifier, θ weights of the statistical network T θ . Initialization: parameters θ and θ c Build Negative Dataset: Update θ based on the empirical version of Eq. 6 or Eq. 7 or Eq. 8.
end for Sample a batch B from D Update θ c with B using Eq. 5. end while OUTPUT: Classifiers weights θ c

Datasets
We empirically evaluate our methods on two english datasets: CMU-MOSI and CMU-MOSEI. Both datasets have been frequently used to assess model performance in human multimodal sentiment and emotion recognition. CMU-MOSI: Multimodal Opinion Sentiment Intensity (Wöllmer et al., 2013) is a sentiment annotated dataset gathering 2, 199 short monologue video clips. CMU-MOSEI: CMU-Multimodal Opinion Sentiment and Emotion Intensity (Zadeh et al., 2018c) is an emotion and sentiment annotated corpus consisting of 23, 454 movie review videos taken from YouTube. Both CMU-MOSI and CMU-MOSEI are labelled by humans with a sentiment score in [−3, 3]. For each dataset, three modalities are available; we follow prior work (Zadeh et al., 2018bRahman et al., 2020)  Audio : Audio features are extracted using CO-VAREP (Degottex et al., 2014). This results into a vector of dimension 74 which includes 12 Melfrequency cepstral coefficients (MFCCs), as well as pitch tracking and voiced/unvoiced segmenting features, peak slope parameters, maxima dispersion quotients and glottal source parameters. Video and audio are aligned on text-based following the convention introduced in  and the forced alignment described in (Yuan and Liberman, 2008).

Evaluation metrics
Multimodal Opinion Sentiment Intensity prediction is treated as a regression problem. Thus, we report both the Mean Absolute Error (MAE) and the correlation of model predictions with true labels. In the literature, the regression task is also turned into a binary classification task for polarity prediction. We follow standard practices (Rahman et al., 2020) and report the Accuracy 2 (Acc 7 denotes accuracy on 7 classes and Acc 2 the binary accuracy) of our best performing models.

Neural architectures
In our experiments, we choose to modify the loss function of the different models that have been introduced for multi-modal sentiment analysis on To assess the validity of the proposed losses, we also apply our method to a simple early fusion LSTM (EF-LSTM) as a baseline model. Model overview: Aforementioned models can be seen as a multi-modal encoder f θe providing a representation Z avl containing information and dependencies between modalities X l , X a , X v namely: As a final step, a linear transformation A θp is applied to Z avl to perform the regression. EF-LSTM: is the most basic architecture used in the current multimodal analysis where each sequence view is encoded separately with LSTM channels. Then, a fusion function is applied to all representations. TFN: computes a representation of each view, and then applies a fusion operator. Acoustic and visual views are first mean-pooled then encoded through a 2-layers perceptron. Linguistic features are computed with a LSTM channel. Here, the fusion function is a cross-modal product capturing unimodal, bimodal and trimodal interactions across modalities. MFN enriches the previous EF-LSTM architecture with an attention module that computes a crossview representation at each time step. They are then gathered and a final representation is computed by a gated multi-view memory (Zadeh et al., 2018a). MAG-BERT and MAG-XLNT are based on pretrained transformer architectures (Devlin et al., 2018;Yang et al., 2019) allowing inputs on each of the transformer units to be multimodal, thanks to a special gate inspired by Wang et al. (2018). The Z avl is the [CLS] representation provided by the last transformer head. For each architecture, we use the optimal architecture hyperparameters provided by the associated papers (see sec. 8).

Numerical results
We present and discuss here the results obtained using the experimental setting described in sec. 4. To better understand the impact of our new methods, we propose to investigate the following points: Efficiency of the L M DM : to gain understanding of the usefulness of our new objectives, we study the impact of adding the mutual dependency term on the basic multimodal neural model EF-LSTM. Improving model performance and comparing multivariate dependency measures: the choice of the most suitable dependency measure for a given task is still an open problem (see sec. 3). Thus, we compare the performance -on both multimodal sentiment and emotion prediction tasks-of the different dependency measures. The compared measures are combined with different models using various fusion mechanisms. Improving the robustness to modality drop: a desirable quality of multimodal representations is the robustness to a missing modality. We study how the maximisation of mutual dependency measures during training affects the robustness of the representation when a modality becomes missing. Towards explainable representations: the statistical network T θ allows us to compute a dependency measure between the three considered modalities. We carry out a qualitative analysis in order to investigate if a high dependency can be explained by complementariness across modalities.

Efficiency of the MDM penalty
For a simple EF-LSTM, we study the improvement induced by addition of our MDM penalty. The results are presented in Tab. 1, where a EF-LSTM trained with no mutual dependency term is denoted with L ∅ . On both studied datasets, we observe that the addition of a MDM penalty leads to stronger performances on all metrics. For both datasets, we observe that the best performing models are obtained by training with an additional mutual dependency measure term. Keeping in mind the example shown in Fig. 1, we can draw a first comparison between the different dependency measures. Although in a simple case L f and L kl estimate a similar quantity (see Fig. 1), in more complex practical applications they do not achieve the same performance. Even though, the Donsker-Varadhan bound used for L kl is stronger 3 than the one used to estimate L f ; for a simple model the stronger bound does not lead to better results. It is possible that most of the differences in performance observed come from the optimisation process during training 4 . Takeaways: On the simple case of EF-LSTM adding MDM penalty improves the performance on the downstream tasks.

Improving models and comparing multivariate dependency measures
In this experiment, we apply the different penalties to more advanced architectures, using various fusion mechanisms. General analysis. Tab. 2 shows the performance of various neural architectures trained with and without MDM penalty. Results are coherent with the previous experiment: we observe that jointly maximising a mutual dependency measure leads to better results on the downstream task: for example, a MFN on CMU-MOSI trained with L W outperforms by 4.6 points on Acc h 7 the model trained without the mutual dependency term. On CMU-MOSEI we also obtain subsequent improvements while training with MMD. On CMU-MOSI the TFN also strongly benefits from the mutual dependency term with an absolute improvement of 3.7% (on Acc h 7 ) with L W compared to L ∅ . Tab. 2 shows that our methods not only perform well on recurrent architectures but also on pretrained Transformer-based models, that achieve higher results due to a superior capacity to model contextual dependencies (see (Rahman et al., 2020)). Improving state-of-the-art models. MAGBERT and MAGXLNET are state-of-the art models on both CMU-MOSI and CMU-MOSEI. From Tab. 2, we observe that our methods can improve the performance of both models. It is worth noting that, in both cases, L W combined with pre-trained transformers achieves good results. This performance gain suggests that our method is able to capture dependencies that are not learnt during either pretraining of the language model (i.e BERT or XL-  Takeaways: The addition of MMD not only benefits simple models (e.g EF-LSTM) but also improves performance when combined with both complex fusion mechanisms and pretrained models. For practical applications, the Wasserstein distance is a good choice of contrast function.

Improved robustness to modality drop
Although fusion with visual and acoustic modalities provided a performance improvement (Wang et al., 2018), the performance of Multimodal systems on sentiment prediction tasks is mainly carried by the linguistic modality (Zadeh et al., 2018a. Thus it is interesting to study how a multimodal system behaves when the text modality is missing because it gives insights on the robustness of the representation. Experiment description. In this experiment, we Spoken Transcripts Acoustic and visual behaviour T θ um the story was all right low energy monotonous voice + headshake L i mean its a Nicholas Sparks book it must be good disappointed tone + neutral facial expression L the action is fucking awesome head nod + excited voice H it was cute you know the actors did a great job bringing the smurfs to life such as joe george lopez neil patrick harris katy perry and a fourth multiple smiles H  focus on the MAGBERT and MAGXLNET since they are the best performing models. 5 As before, the considered models are trained using the losses described in sec. 3 and all modalities are kept during training time. During inference, we either keep only one modality (Audio or Video) or both. Text modality is always dropped. Results. Results of the experiments conducted on CMU-MOSI are shown in Fig. 2, giving values for the ratio Acc corrupt 2 /Acc 2 where Acc corrupt 2 is the binary accuracy in the corrupted configuration and Acc 2 the accuracy obtained when all modalities are considered. We observe that models trained with an MDM penalty (either L kl , L f or L W ) resist better to missing modalities than those trained with L ∅ . For example, when trained with L kl or L f , the drop in performance is limited to ≈ 25% in any setting. Interestingly, for MAGBERT L W and L KL achieve comparable results; L KL is more resistant 5 Because of space constraints results corresponding to MAGXLNET are reported in sec. 8. to dropping the language modality, and thus, could be preferred in practical applications. Takeaway: Maximising the MMD allows an information transfer between modalities.

Towards explainable representations
In this section, we propose a qualitative experiment allowing us to interpret the predictions made by the deep neural classifier. During training, T θ estimates the mutual dependency measure, using the surrogates introduced in Th. 1. However, the inference process only involves the classifier, and T θ is unused. Eq. 6, Eq. 7, Eq. 8 show that T θ is trained to discriminate between valid representations (coming from the joint distribution) and corrupted representations (coming from the product of the marginals). Thus, T θ can be used, at inference time, to measure the mutual dependency of the representations used by the neural model. In Tab. 3 we report examples of low and high discrepancy measures for MAGBERT on CMU-MOSI. We can observe that high values correspond to video clips where audio, text and video are complementary (e.g use of head node (McClave, 2000)) and low values correspond to the case where there exists contradictions across several modalities. Results on MAGXLET can be found in sec. 8.3. Takeaways: T θ used to estimate the MDM provides a mean to interpret representations learnt by the encoder.

Conclusions
In this paper, we introduced three new losses based on MDM. Through extensive set of experiments on CMU-MOSI and CMU-MOSEI, we have shown that SOTA architectures can benefit from these innovations with little modifications. A by-product of our method involves a statistical network that is a useful tool to explain the learnt high dimensional multimodal representations. This work paves the way for using and developing new alternative methods to improve the learning (e.g new estimator of mu-tual information (Colombo et al., 2021a), Wasserstein Barycenters (Colombo et al., 2021b), Data Depths (Staerman et al., 2021), Extreme Value Theory (Jalalzai et al., 2020)). A future line of research involves using this methods for emotion (Colombo et al., 2019;Witon et al., 2018) and dialog act (Chapuis et al., 2021(Chapuis et al., , 2020a classification with pretrained model tailored for spoken language (Dinkar et al., 2020).

Acknowledgments
The research carried out in this paper has received funding from IBM, the French National Research Agency's grant ANR-17-MAOI and the DSAIDIS chair at Telecom-Paris. This work was also granted access to the HPC resources of IDRIS under the allocation 2021-AP010611665 as well as under the project 2021-101838 made by GENCI. MD Donsker and SRS Varadhan. 1985 In this section, we both present a comprehensive illustration of the Algorithm 1 and state the details of experimental hyperparameters selection as well as and the architectures used for the statistic network T θ .
8.1.1 Illustration of Algorithm 1 Fig. 3 describes the Algorithm 1. As can be seen in the figure, to compute the mutual dependency measure the statistic network T θ takes the two embeddings of the different batch B andB. Figure 3: Illustration of the method describes in Algorithm 1 for the different estimators derived from Th. 1. B andB stands for the batch of data sample from the joint probability distribution and the product of the marginal distribution respectively. Z avl denotes the fusion representation of linguistic, acoustic and visual (resp. l, a and v) modalities provided by a multimodal architecture f θe for the batch B . Z lav denotes the same quantity as described before for the batchB. A θp denotes the linear projection before classification or regression.

Hyperparameters selection
We use dropout (Srivastava et al., 2014) and optimise the global loss Eq. 5 by gradient descent using AdamW (Loshchilov and Hutter, 2017; Kingma and Ba, 2014) optimiser. The best learning rate is found in the grid {0.002, 0.001, 0.0005, 0.0001}. The best model is selected using the lowest MAE on the validation set. We U nroll to 10.

Architectures of T θ
Across the different experiment we use a statistic network with an architecture as describes in Tab. 4. We follow (Belghazi et al., 2018) and use LeakyRELU (Agarap, 2018;Xu et al., 2015) as activation function.   Fig. 2 we observe more robust representation to modality drop when jointly maximising the L W and L kl with the target loss. Fig. 4 shows no improvement when training with L f . This can also be linked to Tab. 2 which similarly shows no improvement in this very specific configuration. The ratio between the accuracy achieved with a corrupted linguistic modality Acc corrupt 2 and the accuracy Acc 2 without any corruption is reported on y-axis. The preserved modalities during inference are reported on x-axis. A, V respectively stands for the acoustic and visual modality.

Additional qualitative examples
Tab. 5 illustrates the use of T θ to explain the representations learnt by the model. Similarly to Tab. 4 we observe that high values correspond to complementarity across modalities and low values are

Spoken Transcripts
Acoustic and visual behaviour T θ but the m the script is corny high energy voice + headshake + (many) smiles L as for gi joe was it was just like laughing its the the plot the the acting is terrible high enery voice + laughts + smiles L but i think this one did beat scream 2 now headshake + long sigh L the xxx sequence is really well done static head + low energy monotonous voice L you know of course i was waithing for the princess and the frog smiles + high energy voice + + high pitch H dennis quaid i think had a lot of fun smiles + high energy voice H it was very very very boring low energy voice + frown eyebrows H i do not wanna see any more of this angry voice + angry facial expression H Table 5: Examples from the CMU-MOSI dataset using MAGXLNET trained with L W . The last column is computed using the statistic network T θ . L stands for low values and H stands for high values. Green, grey, red highlight positive, neutral and negative expression/behaviours respectively.