Adaptive Fusion Techniques for Multimodal Data

Effective fusion of data from multiple modalities, such as video, speech, and text, is challenging due to the heterogeneous nature of multimodal data. In this paper, we propose adaptive fusion techniques that aim to model context from different modalities effectively. Instead of defining a deterministic fusion operation, such as concatenation, for the network, we let the network decide “how” to combine a given set of multimodal features more effectively. We propose two networks: 1) Auto-Fusion, which learns to compress information from different modalities while preserving the context, and 2) GAN-Fusion, which regularizes the learned latent space given context from complementing modalities. A quantitative evaluation on the tasks of multimodal machine translation and emotion recognition suggests that our lightweight, adaptive networks can better model context from other modalities than existing methods, many of which employ massive transformer-based networks.


Introduction
Multimodal deep learning is an active field of research, where for a single event, one has information across multiple modalities, such as video, speech, and text. Human brains can easily and perpetually perceive the context of an event from such heterogeneous data; however, it is not a trivial task for a computer system. In order for the machine to gain a contextual understanding, heterogeneous inputs must be combined first. Combining, or more precisely, fusing multimodal inputs is, thus, a vital step for any multimodal task. Naturally, a better fusion method will help a multimodal system learn better, ultimately enhancing its performance for a given task.
The most common fusion technique used in the literature involves the concatenation of representations from all the available modalities. However, this results in a shallow network (Ngiam et al., 2011), and the network focuses more on learning intra-modal features, ignoring inter-modal dynamics altogether. Later, Zadeh et al. (2017) proposed Tensor Fusion Network (TFN), which models the unimodal, bimodal, and trimodal interactions using a 3-fold Cartesian product. TFN performs better than simple concatenation; however, it imposes high computational requirements since it projects all the information from input modalities to a dense 3-D space as-is, without any prior information extraction. The computational overhead grows exponentially with respect to the dimensionality of unimodal features. Liu et al. (2018) proposed a low-rank multimodal fusion technique (LMF) to address the previous problem. Such fusion techniques are useful but often result in a complex architecture. Moreover, the fusion methods mentioned above focus only on combining individual unimodal features rather than combining and extracting useful information simultaneously. This means that the final predictor module (decoder in a Seq2Seq network (Sutskever et al., 2014), for example) bears an additional responsibility of identifying useful signals to focus on. This paper addresses these issues by proposing adaptive fusion techniques that allow the model to decide "how" to combine multimodal data more effectively for an event. The first technique, Auto-Fusion, learns to compress multimodal information while preserving as much meaning as possible. The second technique, GAN-Fusion, employs an adversarial network that regularizes the learned latent space for a given target modality complying with the information presented by complementary modalities. Since our models are generic, the need to specify a pre-determined fusion operation such as Cartesian product is alleviated, and this further incentivizes the network to model multimodal interactions by itself. Moreover, our techniques are lightweight relative to the existing heavier counterparts (Vaswani et al., 2017;Grönroos et al., 2018), thereby preventing unnecessary computational load.
We evaluate our models on three benchmark datasets: 1. the How2 dataset (Sanabria et al., 2018) with multimodal input for English-Portuguese translation.
A quantitative evaluation shows that our models outperform the existing state-of-the-art methods in terms of BLEU scores (Papineni et al., 2002) for machine translation and Precision, Recall, and F1-score for emotion recognition. Our ablation studies also indicate that the learned multimodal representations are robust; they perform reasonably well even after removing information from a target modality. We now summarize our main contributions as follows: 1. We propose two lightweight, adaptive techniques for better multimodal fusion of data: Auto-Fusion and GAN-Fusion.
2. We propose a multi-task framework for endto-end training of multimodal networks (for both classification and generation).
The rest of the paper is structured as follows: Section 2 covers relevant work, Section 3 discusses the proposed methodologies and overall architecture, Section 4 describes the experimental setup, Section 5 shows results, and Section 6 contains our concluding remarks.

Related Work
In this section, we briefly review previous work related to our task. Most earlier works in multimodal deep learning focus on traditional shallow classifiers such as support vector machines (Cortes and Vapnik, 1995) and Naive Bayes classifiers (Morade and Patnaik, 2015) to exploit bimodal data. Inspired by the success of deep learning over the last decade across multiple tasks, Ngiam et al. (2011) train end-to-end deep graph neural networks to reconstruct missing modalities during inference. They demonstrate that better features for one modality can be learned if relevant data from different modalities is available at training time; however, they employ simple concatenation for fusion. Hence, the joint representation learned is shallow and is not guaranteed to model intermodal interactions. Their findings were later verified by Srivastava and Salakhutdinov (2012), who use a Deep Boltzmann Machine (Salakhutdinov and Hinton, 2009) to generate data from the image and text modality. Huang et al. (2018) construct a multilingual common semantic space to achieve better machine translation performance by extending correlation networks (Chandar et al., 2016). They use multiple non-linear transformations to reconstruct sentences from one language to another repeatedly and finally build a common semantic space for all the different languages. To address the shallowness exhibited by some earlier fusion methods, techniques such as TFN (Zadeh et al., 2017), LMF (Liu et al., 2018) and T2FN (Liang et al., 2019), were proposed that aim to capture both intra-and inter-modal dynamics simultaneously; however, the problem of effectively modelling context in multimodal samples remains unsolved.
More recently, Multimodal Transformer (MulT)  was proposed to align data from different modalities implicitly. On a high-level, MulT leverages cross-modal attention modules for each modality, each of which is responsible for aligning (or attending to) the target modality vector with the complementary modalities'. It also imposes substantial computational overhead due to the use of transformer networks (Vaswani et al., 2017). Our methods, as discussed in detail in Section 3, use much simpler components. For instance, we use at most one attention module compared to multiple self-attention heads in a transformer. Variational Mixture-of-Experts Autoencoders (Shi et al., 2019), a class of deep generative multimodal frameworks, were employed to learn a synergic shared representation for multiple modalities; however, scaling of such a model for all the modalities (video, speech, and text) simultaneously and for a more complex task as multimodal machine transla- Assuming that z d1 m1 , z d2 m2 , and z d3 m3 represent the video, speech, and text latent vectors respectively, we first concatenate them to obtain z k m . It is then passed through T which outputs the "autofused" vector z t m . We then obtain the reconstructed concatenated vectorẑ k m by passing the autofused vector through F c , another transformation layer. Finally, we optimize the loss betweenẑ k m and z k m . (b) GAN-Fusion module for the text modality: Assuming that z s , z v , and z t are the latent speech, video, and text vectors, respectively, we first autofuse z s and z v to give z tr . Simultaneously, we pass z t through the generator G, along with some noise, to get z g . The generator loss tries to match z tr and z g and discriminator D tries to distinguish between z tr and z g , the two sources of input. Note: denotes concatenation.
tion is currently unexplored.

Proposed methods
This section will discuss the proposed methodologies for effectively fusing inputs from multiple modalities and describe the overall architecture of our models for classification and generation. Most fusion techniques proposed in the literature, such as concatenation, and TFN, involve a deterministic operation for constructing the joint multimodal representation. For instance, in TFN, the 3-fold Cartesian product of unimodal features is used for prediction. The method focuses more on learning rich unimodal features. However, there is no such "learning" procedure for joint representation; they are simply constructed by combining unimodal features in a specific fashion (here, by Cartesian product.) In this paper, we will refer to such techniques as static fusion techniques. Since there is no particular learning procedure for the joint representation, it becomes challenging for the final predictor module to model the complex dynamics of multimodal features. In other words, the model is unable to utilize multimodal information effectively. On the other hand, fusion methods such as LMF and MulT are adaptive because they involve a cog-nitive feature processing step to construct the joint representation. In LMF, it is the decomposition module, and in MulT, it is the final feed-forward fusion mechanism. We refer the reader to Liu et al. (2018) and  for more detailed explanation of the models.
Our fusion methods involve the concatenation of unimodal embeddings as an initial step. To avoid any conflicts with past works, we will only consider steps after concatenation as a part of our fusion method because we do not use the concatenated vector for final prediction; it is only a preliminary step. Therefore, in order to mitigate the "staticness" of existing fusion methods, we propose two adaptive yet simple fusion techniques, Auto-Fusion and GAN-Fusion. They aim to effectively combine multimodal inputs and mitigate the problem of shallowness and computational overhead exhibited by prior fusion techniques.

Auto-Fusion
This method encourages the model to extract intermodal features by maximizing the correlation between multimodal inputs. In this method, we first concatenate individual unimodal features and then pass them through a transformation layer to get a Here, the solid and dashed lines at the bottom part represent input from the target and the complementary modalities, respectively, for G m f , the GAN-Fusion module with target modality m ∈ {t, s, v}. Furthermore, F m c represents the feed-forward layer that produces the fused multimodal representation z f use , which is, in turn, fed to the target decoder for generation networks and to the fully-connected network for classification networks.
autofused latent vector. We use appropriate learners for individual modalities (See Section 4). We then try to reconstruct the originally concatenated vector from the autofused latent vector. Finally, we minimize the Euclidean distance between the original and reconstructed concatenated vector. This process ensures that the learned autofused vector does not contain arbitrary signals from the input concatenated latent vector. Furthermore, training the model for a downstream task such as emotion recognition incentivizes it to "compress" information without losing any essential cues. In other words, it increases the correlation between the autofused and the concatenated latent vector. This generic procedure applies to any scenario where multiple features need to be combined. For example, it can even be used to combine the forward and backward hidden states of LSTMs (Hochreiter and Schmidhuber, 1997), instead of pooling methods such as 1D pooling, max pooling, sum pooling or even simple concatenation.
We now discuss the Auto-Fusion network in detail. We pose fusion of multimodal inputs as a com-pression problem, where we must retain as much information from the individual modalities as possible. Given n (≤ 3 in our case) d-dimensional multimodal latent vectors, z d 1 m 1 , z d 2 m 2 , . . . , z dn mn , we first concatenate them to obtain a vector, z k m , where k = n i d i . Then, we apply a transformation T to z k m , reducing its number of dimensions to t. Then, we use z t m to reconstruct the originally concatenated vectorẑ k m . Finally, we calculate the loss, J tr , betweenẑ k m , and z k m . The simplest version of Auto-Fusion network employs the mean squared error (MSE) loss function, which aligns with our motivation to compress multimodal features: filter out the less useful signals. These steps could be followed in Figure 1(a) and the MSE loss for Auto-Fusion network is given by: For Auto-Fusion, we consider the intermediate vector, z t m , as the fused multimodal representation.

GAN-Fusion
In addition to the "staticness" of existing methods, there is also the challenge of distinguishing between ambiguous cases. For instance, the sentence "Your joke blew my mind away, Kevin," could be said in a funny or sarcastic manner. Resolving ambiguity becomes especially important when working on social problems such as hate speech detection. Even when fed with the corresponding speech vector, existing methods cannot effectively distinguish between similar but different emotions such as happiness and calmness. We hypothesize that this is because they do not learn the conditional distribution of sentiment given an utterance (an utterance includes input from all available modalities).
To address this issue, we propose an adversarial training regime that is incentivized to learn the desired conditional distribution. For a task such as emotion recognition, the objective would be sentiment given an utterance. For a more challenging generation task, the model could learn a more complex behaviour, such as the association of different sentences based on how similar they sound and their polarity. Our experiments show that our GAN-based approach is better able to learn such multimodal dynamics compared to other methods.
We now describe GAN-Fusion's architecture in detail for target modality text (denoted by t.) For a given multimodal sample x, we first encode the in-puts from each modality (speech, visual and text) to get the respective latent vectors, z s , z v , and z t .T Choosing a target modality such as text, we pass z t (along with random normal noise,) through a generator to obtain z g = G(z t ) and autofuse the remaining latent vectors z s and z v simultaneously to obtain z tr . In the event where we have input from only one modality in addition to text, we do not need Auto-Fusion, and can simply treat the other modality's vector as z tr . Finally, we train the network in an adversarial fashion, labelling z tr as positive samples and z g as negative samples. The adversarial loss, J t adv , is given below: Overall, the generator G tries to align features of the target modality with features from the complementary modalities, and the discriminator tries to discern the source of its input. Such a translation between latent vectors has been shown to learn an "intermediate" latent vector denoting their joint representation (Pham et al., 2019;Gao et al., 2019). Learning the latent space in such an adversarial manner induces a clustering effect on the latent space, where texts associated with similar sounds and visuals are grouped together. We conjecture that the adversarial training helps the model learn the relative topology of the complementary modalities' latent space, which improves sampling from the target modality.
We elucidate this effect through the following example. For the sake of simplicity, let us consider only one complimentary modality (video) in this case and let text be the target modality. First, we make a reasonable assumption that video embeddings for, say, Soccer and Golf-falling under the general category of Sports-will be mapped closer to each other and farther from video embeddings from an unrelated topic such as Cooking. When learning the intermediate representation between video and text, the text latent space is constructed such that its relative topology partially reflects the video latent space. So, token embeddings in the text latent space related to videos of similar events (Soccer and Golf) will adopt similar relative positioning as followed by the video embeddings for Soccer and Golf in the video latent space. For multimodal machine translation, if the model is fed with a Golf Figure 3: Using proposed fusion techniques for generation/classification. Unimodal inputs x v , x s , x t are passed through their respective learners L v , L s , L t to obtain unimodal representations z v , z s , z t . Here, v, s, t correspond to visual, speech, and textual modalities respectively. The individual unimodal representations are then passed through the fusion module (either Auto-Fusion or GAN-Fusion,) which outputs the fused multimodal representation z f use . For generation, z f use is then passed through a compatible decoder, which generates outputs for the desired target modality. For classification, z f use is passed through a fully-connected layer instead, which predicts appropriate the class labels.
video and the source text as input, it may be better able to sample jargon words for Golf from the text latent space due to this topology inheritance. This ultimately improves translation-quality. This is also depicted in Figure 4. Figure 1(b) shows GAN-Fusion module for the text modality. The GAN-Fusion module, overall, has one such module for every modality. Total adversarial loss is, therefore, given by: where losses J s adv and J v adv for speech and video, respectively, are defined similarly as J t adv . Figure 2 shows an overall architecture of the GAN-Fusion module, which consists of G t f , G s f , and G v f , the respective fusion modules for text, speech, and video modalities. We pass the outputs of these modules through a feed-forward layer to obtain the final fused multimodal representation z f use .

Overall Architecture
In this section, we describe the end-to-end training process for using the proposed fusion methods for 1) Generation tasks (e.g. visual question answering, Figure 4: Visualizing the induced clustering effect of GAN-Fusion. The circles represent a cluster of words related to the indicated topic (text inside the circles) multimodal machine translation) and 2) Classification tasks (e.g. speech emotion recognition, hate speech detection.) Generation: Figure 3 shows the end-to-end pipeline integrating proposed fusion techniques for generation. We first pass raw inputs from different modalities through their respective learners to obtain their respective latent representations. They are then passed through the fusion module that outputs a fused representation to be used for decoding. We only validate this process for generating text, but this method could very well be used for generating outputs for different target modalities. Notably, our pipeline for generation looks very similar to a Seq2Seq network. We simply introduce a fusion module between the encoder and the decoder module. It should also be noted that all GAN-Fusion modules for all the target modalities (G t f , G s f , G v f ) are trained simultaneously with the rest of the network.
Classification: Figure 3 also shows how to adapt the previously described generation network for classification: simply replace the decoder with a fully-connected layer to predict appropriate class labels.
It is important to note that z f use = z t m for Auto-Fusion, and it is obtained as shown in Figure  2 for GAN-Fusion. The overall loss function for our networks can be generalized as follows: Here, J f usion refers to the loss function of the fusion network. It equals J tr (from equation 1) when using Auto-Fusion, and J adv (from equation 3) when using GAN-Fusion. Furthermore, J task refers to the task-specific loss, i.e., classification loss (such as max-margin loss) or generation loss (such as cross-entropy loss for Seq2Seq network). λ 1 and λ 2 are hyperparameters to tune.

Experimental Setup
We measure our models' effectiveness on two tasks: 1) multimodal machine translation and 2) multimodal emotion recognition. The subsequent sections describe our complete experimental setup, including datasets and baselines used, implementation details, and evaluation metrics.

Datasets
We choose three datasets for our experiments, which are described as follows: IEMOCAP: We use the benchmark Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset (Busso et al., 2008) for emotion recognition. We only use the textual and speech modalities for our emotion recognition experiments. The dataset is originally split into multiple utterances for each session, and we further split each utterance file based on the provided start and end timestamps to obtain wav files for each sentence. This results in a total of ∼10K audio files, which are then used to extract features for predicting a given utterance's emotion. Concretely, we identify the task as an emotion recognition problem, where, given a sentence and its audio, we aim to infer the correct emotion for that utterance.
How2: We evaluate our models on the multimodal How2 dataset (Sanabria et al., 2018), which comprises of 79,114 instructional videos, their Kaldi (Povey et al., 2011) audio features, and wordlevel time alignments of English-to-Portuguese translations. The How2 dataset is trimodal in comparison to other multimodal datasets (Matthews et al., 2002;Patterson et al., 2002). This makes it suitable to evaluate the contribution of each modality for different tasks. Further, as a large-scale multilingual dataset, it enables a convenient medium for neural machine translation in our work. Figure 5: A multimodal sample from the How2 dataset (Sanabria et al., 2018) Multi30K: In addition to the How2 dataset, we also run experiments on the bimodal Multi30K dataset (Elliott et al., 2016)-a benchmark dataset for machine translation-extended for German, where each sample has an image, its description in  Table 1: Results for machine translation on How2 dataset. 't', 's', 'v' represent the text, speech, and video modalities, respectively. Here, 'attn' refers to the word-level attention (Luong et al., 2015).   Table 3: Precision (P), Recall (R), F1-score (F), and Accuracy (A) for emotion recognition on IEMOCAP. the source language, and its translated version. We only run experiments on the En-Fr version, however.

Implementation details
Generation: We train the network shown in Figure 3 with Auto-Fusion and GAN-Fusion as fusion modules. We use an LSTM encoder with 256 hidden units as the learner for textual description in our generation experiments, unless otherwise stated. For the How2 dataset, we use the already provided 2048-dimensional feature vectors for video as raw input for the video learner, and we feed the Kaldi speech vectors to the speech learner, which is a simple feed-forward layer in this case. For the Multi30K dataset, we use a pre-trained VGG (Simonyan and Zisserman, 2015) to encode images, and we do not have a speech learner as there is no speech input in the dataset. The latent dimension is 100 for the fused vector.
Classification: For the task of speech emotion recognition, we train a multimodal classifier on the IEMOCAP dataset. We use LSTMs with 256 hidden units to encode text. For audio, we first pre-process the raw audio files to obtain a lowerdimensional feature vector and then use LSTMs with 50 hidden units as a learner. We predict emotion labels through a full-connected layer as shown in Figure 3 . The latent dimension for the fused vectors is 50 in this case.
When training the GAN-Fusion network, we followed numerous tricks (Goodfellow, 2016) to ensure training stability. A few that helped the most include input normalization, batch normalization, Leaky ReLU activation function (Maas et al., 2013) and Adam optimizer (Kingma and Ba, 2015) for the generator and discriminator networks. The presence of multiple auxiliary losses in our networks also helped.
All the networks in our experiments are implemented in PyTorch (Paszke et al., 2019). To train the different classification networks and generation networks on the Multi30K dataset, we use an Nvidia RTX 2080Ti with 12GB of RAM. However, to train our trimodal networks on the How2 dataset, we use Nvidia P100 with 16 GB RAM to fit video feature vectors in memory.

Baselines
How2: We use the following baselines for experiments on the How2 dataset: • Seq2Seq: A sequence-to-sequence with attention mechanism (Luong et al., 2015). It employs the previously described learners for each modality, and early fusion.
• Multi30K: We use the following baselines for the Multi30K dataset: • Unimodal Seq2Seq: A text-only NMT system used by Elliott et al. (2017).

IEMOCAP:
We use the following baselines for the IEMOCAP dataset: • LSTM (t): A unimodal LSTM classifier with attention mechanism trained using only text.
• LSTM ([s;t]): A bimodal LSTM classifier with attention mechanism on text only. We use the concatenation of speech and text features as the joint multimodal representation.
We compare the above baselines' performance with our two main models: GAN-Fusion and Auto-Fusion, which replace early fusion in the Seq2Seq baseline. We report the results in Table 1, 2, and 3.

Evaluation metrics
We use Precision, Recall, F-Score, and classification accuracy to evaluate our classification networks trained for speech emotion recognition. For experiments on the How2 and Multi30K dataset, we use BLEU (Papineni et al., 2002) to evaluate the quality of translated sentences. For How2, we compute BLEU1-BLEU4 scores of the different models under consideration. For Multi30K, we also use METEOR (Banerjee and Lavie, 2005), which is a weighted harmonic mean of unigram precision Figure 6: Ablation test on the How2 dataset. Word drop probability v/s BLEU 4. A sudden drop in BLEU scores as we move from 0.3 to 0.4 indicates that our model was able to compensate for ∼ 30% of the missing text. and unigram recall providing a better indication of translation quality. For all the mentioned evaluation metrics, a higher number denotes better performance.

Results
Quantitative analysis: Results of our experiments on the How2, Multi30K and IEMOCAP dataset are shown in Tables 1, 2 and 3, respectively. For speech emotion recognition, we observe that our models consistently perform well across all the evaluation metrics. For the relatively challenging multimodal machine translation task, we observe that our model outperforms all existing baselines in terms of BLEU scores. Compared against the best performing baselines, GAN-Fusion improves BLEU4 scores by 3.63 points and 0.13 points on the How2 and Multi30K datasets, respectively. This shows that the fusion module was better able to extract signals from all modalities. On the Multi30K dataset, our model is competitive in terms of METEOR. Such performance becomes more pronounced considering our models have only one attention module, in contrast to multiple self-attention heads in the transformer-based baselines.
The size of our best-performing models (employing GAN-Fusion) is roughly 6M and 11M for classification and generation, respectively, which is significantly lower than traditional transformer models. Fewer trainable parameters reduces the computational cost and the training time. This also indicates potential by-passing of a mechanism like distillation, which is used to reduce parameters in transformers (Sanh et al., 2019); however, more thorough experimentation is required to reach a concrete conclusion. Our models can also be used in conjunction with transformers, where we utilize the transformers to learn meaningful unimodal feature vectors initially and then employ the proposed fusion methods.
We also perform a comprehensive set of qualitative experiments on the How2 dataset to understand the capability of our fusion techniques. They are described as follows: Effect of introducing more modalities: To understand the effect of introducing new modalities separately, we perform experiments with different combinations of source modalities, including individual unimodal baselines (Refer to Table 1.) The results reveal that both auditory and visual modalities always contribute towards enhanced translation, but the contribution of visual modality is slightly lower (indicated by the lower increase in BLEU scores in Table 1.) This is also consistent with the findings of Grönroos et al. (2018).
Robustness of multimodal features: It is very important for the learned multimodal latent features to be robust, i.e., they should be able to exploit signals from complementary modalities to compensate for the presence of noise in one modality. Therefore, to gauge the robustness of learned multimodal features, we conduct an ablation test on the How2 dataset. We randomly replace some tokens in the test sentence with an <UNK> token and attempt to translate using our best performing model, GAN-Fusion. Figure 6 shows results of our ablation study. We can see that features from the complementary modalities are able to compensate for ∼ 30% of the missing text as we see a sharp drop in BLEU scores beyond that point. This shows that the model does not rely on just the textual description for translation; it also tries to gain a contextual understanding. Hence, it follows that the learned joint representation indeed contains rich information from other modalities.

Conclusion and Future Work
In this paper, we propose two adaptive fusion techniques that allow for effective multimodal fusion. Instead of "fixing" a fusion operation a priori, we let the model decide "how" to extract and effectively combine signals from different modalities. Moreover, the joint multimodal representations learned by such models are empirically shown to be robust, which allows the system to maintain good performance even in the absence of some information. Our results indicate that such adaptive models deliver without compromising performance than their more massive counterparts, such as transformers, which is a significant gain.
Our experiments indicate the importance of learning richer unimodal representations. This also suggests that using these methods in conjunction with transformers, which may learn richer unimodal representations, should further improve downstream tasks such as multimodal machine translation and speech emotion recognition. Currently, the attention mechanism is applied only on text. So, another simple way to improve performance would be to introduce visual and acoustic attention mechanisms as well; however, we would still need to address the core problem heterogeneity.
On training GANs: Training GANs is known to be difficult. However, we employed various tricks to ensure the training stability of models, especially the one's employing GAN-Fusion. In fact, a musing line of exploration towards learning a better adaptive model could be to probe the implicit assumptions of GANs themselves. GANs are known to exhibit numerous issues in practice (Arora et al., 2017;Sinn and Rawat, 2018). In Li and Malik (2018), the authors argue the need to return to the principle of maximum likelihood, insisting on full recall.
It is essential to note that much is unknown about these models. More concrete and sound reasoning for the success of these models will rely on two vital components: 1) understanding of the dynamics of the learned latent space, and 2) aligning multimodal features to address heterogeneity. Both these components require more interpretable representations of the otherwise black-box models.