MM-Align: Learning Optimal Transport-based Alignment Dynamics for Fast and Accurate Inference on Missing Modality Sequences

Existing multimodal tasks mostly target at the complete input modality setting, i.e., each modality is either complete or completely missing in both training and test sets. However, the randomly missing situations have still been underexplored. In this paper, we present a novel approach named MM-Align to address the missing-modality inference problem. Concretely, we propose 1) an alignment dynamics learning module based on the theory of optimal transport (OT) for missing data imputation; 2) a denoising training algorithm to enhance the quality of imputation as well as the accuracy of model predictions. Compared with previous generative methods which devote to restoring the missing inputs, MM-Align learns to capture and imitate the alignment dynamics between modality sequences. Results of comprehensive experiments on two multimodal tasks empirically demonstrate that our method can perform more accurate and faster inference and alleviate the overfitting issue under different missing conditions.


Introduction
The topic of multimodal learning has grown unprecedentedly prevalent in recent years (Ramachandram and Taylor, 2017;Baltrušaitis et al., 2018), ranging from a variety of machine learning tasks such as computer vision (Zhu et al., 2017;Nam et al., 2017), natural langauge processing (Fei et al., 2021;Ilharco et al., 2021), autonomous driving (Caesar et al., 2020) and medical care (Nascita et al., 2021), etc.Despite the promising achievements in these fields, most of existent approaches assume a complete input modality setting of training data, in which every modality is either complete or completely missing (at inference time) in both training and test sets (Pham et al., 2019;Tang et al., 2021;Zhao et al., 2021), as shown in Fig. 1a and 1b.
Such synergies between train and test sets in the modality input patterns are usually far from the realistic scenario where there is a certain portion of data without parallel modality sequences, probably due to noise pollution during collecting and preprocessing time.In other words, data from each modality are more probable to be missing at random (Fig. 1c and 1d) than completely present or missing (Fig. 1a and 1b) (Pham et al., 2019;Tang et al., 2021;Zhao et al., 2021).Based on the complete input modality setting, a family of popular routines regarding the missing-modality inference is to design intricate generative modules attached to the main network and train the model under full supervision with complete modality data.By minimizing a customized reconstruction loss, the data restoration (a.k.a.missing data imputation (Van Buuren, 2018)) capability of the generative modules is enhanced (Pham et al., 2019;Wang et al., 2020;Tang et al., 2021) so that the model can be tested in the missing situations (Fig. 1b).However, we notice that (i) if modality-complete data in the training set is scarce, a severe overfitting issue may occur, especially when the generative model is large (Robb et al., 2020;Schick and Schütze, 2021;Ojha et al., 2021); (ii) global attention-based (i.e., attention over the whole sequence) imputation may bring unexpected noise since true correspondence mainly exists between temporally adjacent parallel signals (Sakoe and Chiba, 1978).Ma et al. (2021) proposed to leverage unit-length sequential representation to represent the missing modality from the seen complete modality from the input for training.Nevertheless, such kinds of methods inevitably overlook the temporal correlation between modality sequences and only acquire fair performance on the downstream tasks.
To mitigate these issues, in this paper we present MM-Align, a novel framework for fast and effective multimodal learning on randomly missing multimodal sequences.The core idea behind the frame- work is to imitate some indirect but informative clues for the paired modality sequences instead of learning to restore the missing modality directly.The framework consists of three essential functional units: 1) a backbone network that handles the main task; 2) an alignment matrix solver based on the optimal transport algorithm to produce contextwindow style solutions only part of whose values are non-zero and an associated meta-learner to imitate the dynamics and perform imputation in the modality-invariant hidden spaces; 3) a denoising training algorithm that optimizes and coalesces the backbone network and the learner so that they can work robustly on the main task in missing-modality scenarios.To empirically study the advantages of our models over current imputation approaches, we test on two settings of the random missing conditions, as shown in Fig. 1c and Fig. 1d, for all possible modality pair combinations.To the best of our knowledge, it is the first work that applies optimal transport and denoising training to the problem of inference on missing modality sequences.In a nutshell, the contribution of this work is threefold: • We propose a novel framework to facilitate the missing modality sequence inference task, where we devise an alignment dynamics learning module based on the theory of optimal transport and a denoising training algorithm to coalesce it into the main network.
• We design a loss function that enables a contextwindow style solution for the dynamics solver.
• We conduct comprehensive experiments on three publicly available datasets from two multimodal tasks.Results and analysis show that our method leads to a faster and more accurate inference of missing modalities.
2 Related Work

Multimodal Learning
Multimodal learning has raised prevalent concentration as it offers a more comprehensive view of the world for the task that researchers intend to model (Atrey et al., 2010;Lahat et al., 2015;Sharma and Giannakos, 2020).The most fundamental technique in multimodal learning is multimodal fusion (Atrey et al., 2010), which attempts to extract and integrate task-related information from the input modalities into a condensed representative feature vector.Conventional multimodal fusion methods encompass cross-modality attention (Tsai et al., 2018(Tsai et al., , 2019;;Han et al., 2021a), matrix algebra based method (Zadeh et al., 2017;Liu et al., 2018;Liang et al., 2019) and invariant space regularization (Colombo et al., 2021;Han et al., 2021b).While most of these methods focus on complete modality input, many take into account the missing modality inference situations (Pham et al., 2019;Wang et al., 2020;Ma et al., 2021) as well, which usually incorporate a generative network to impute the missing representations by minimizing the reconstruction loss.However, the formulation under missing patterns remains underexplored, and that is what we dedicate to handling in this paper.

Meta Learning
Meta-learning, or learning to learn, is a hot research topic that focuses on how to generalize the learning approach from a limited number of visible tasks to broader task types.Early efforts to tackle this problem are based on comparison, such as relation networks (Sung et al., 2018) and prototype-based methods (Snell et al., 2017;Qi et al., 2018;Lifchitz et al., 2019).Other achievements reformulate this problem as transfer learning (Sun et al., 2019) and multi-task learning (Pentina et al., 2015;Tian et al., 2020), which devote to seeking an effective transformation from previous knowledge that can be adapted to new unseen data, and further fine-tune the model on the handcrafted hard tasks.In our framework, we treat the alignment matrices as the training target for the meta-learner.Combined with a self-adaptive denoising training algorithm, the meta-learner can significantly enhance the predictions' accuracy in the missing modality inference problem.
3 Method 1,1 , ..., x m k i,t } are input modality sequences and m 1 , m 2 denote the two modality types, some modality inputs are missing with probability p ′ .Following Ma et al. (2021), we assume that modality m 1 is complete and the random missing only happens on modality m 2 , which we call the victim modality.Consequently, we can divide the training set into the complete and missing splits, denoted as For the validation and test set, we consider two settings: a) the victim modality is missing completely (Fig. 1c), denoted as "setting A" in the experiment section; b) the victim modality is missing with the same probability p ′ (Fig. 1d), denoted as "Setting B", in line with Ma et al. (2021).We consider two multimodal tasks: sentiment analysis and emotion recognition, in which the label y i represents the sentiment value (polarity as positive/negative and value as strength) and emotion category, respectively.

Overview
Our framework encompasses a backbone network (green), an alignment dynamics learner (ADL, blue), and a denoising training algorithm to optimize both the learner and backbone network concurrently.We highlight the ADL which serves as the core functional unit in the framework.Motivated by the idea of meta-learning, we seek to generate substitution representations for the missing modality through an indirect imputation clue, i.e., alignment matrices, instead of learning to restore the missing modality by minimizing the reconstruction losses.To this end, the ADL incorporates an alignment matrix solver based on the theory of optimal transport (Villani, 2009), a non-parametric method to capture alignment dynamics between time series (Peyré et al., 2019;Chi et al., 2021), as well as an auxiliary neural network to fit and generate meaningful representations as illustrated in §3.4.

Architecture
Backbone Network The overall architecture of our framework is depicted in Fig. 2. We harness MulT (Tsai et al., 2019), a fusion network derived from Transformer (Vaswani et al., 2017) as the backbone structure since we find a number of its variants in preceding works acquire promising outcomes in multimodal (Wang et al., 2020;Han et al., 2021a;Tang et al., 2021).MulT has two essen-tial components: the unimodal self-attention encoder and bimodal cross-attention encoder.Given modality sequences x m 1 , x m 2 (for unimodal selfattention we have m 1 = m 2 ) as model's inputs, after padding a special token x m 1 0 = x m 2 0 =[CLS] to their individual heads, a single transformer layer (Vaswani et al., 2017) encodes a sequence through a multi-head attention (MATT) and feedforward network (FFN) as follows: where LN is layer normalization.In our experiments, we leverage this backbone structure for both input modality encoding and multimodal fusion.
Output Layer We extract the head embeddings z 12 0 , z 21 0 from the output of the fusion network as features for regression.The regression network is a two-layer feed-forward network: is the concatenation operation.The mean squared error (MSE) is adopted as the loss function for the regression task: (5)

Alignment Dynamics Learner (ADL)
The learner has two functional modules, named as alignment dynamics solver and fitter, as shown in Fig. 2. It also runs in two functional modes, namely learning and decoding.ADL works in learning mode when the model is trained on the complete data (marked by the solid lines in Fig. 2).The decoding mode is triggered when one of the modalities is missing, which happens in the training time on the missing splits and the entire test time (marked by the dashed lines in Fig. 2).
Learning Mode In the learning mode, the solver calculates an alignment matrix which provides the information about temporal correlations between the two modality sequences.Similar to the previous works (Peyré et al., 2019;Chi et al., 2021), this problem can be formulated as an optimal transport (OT) task: where A is the transportation plan that implies the alignment information (Peyré et al., 2019) and M is the cost matrix.The subscript ij represents the component from the ith timestamp in the source modality to the j th timestamp in the target modality.Different from Peyré et al. (2019) and Chi et al. (2021) which allow alignment between any two positions of the two sequences, we believe that in parallel time series, the temporal correlation mainly exists between signals inside a time-specific "window" (i.e., |j − i| ≤ W , where W is the window size) (Sakoe and Chiba, 1978).Additionally, the cost function should be negatively correlated to the similarity (distance), as one of the problem settings in the original OT problem.To realize these basic motivations, we borrowed the concept of barrier function (Nesterov et al., 2018) and define the cost function for our optimal transport problem as: where z m i is the representation of modality m at timestamp i and cos(•, •) is the cosine value of two vectors.We will show that such a type of transportation cost function ensures a context-window style alignment solution and also provide a proof in appendix C. To solve Eq. ( 6), a common practice is to add an entropic regularization term: The unique solution A * can be calculated through Sinkhorn's algorithm (Peyré et al., 2019): The vector u and v are obtained through the following iteration until convergence: After quantifying the temporal correlation into alignment matrices, we enforce the learner to fit those matrices so that it can automatically approximate the matrices from the non-victim modality in the decoding mode.Specifically, a prediction network composed of a gated recurrent unit (Chung et al., 2014) and a linear projection layer takes the shared representations of the complete modality as input and outputs the prediction value for entries: where ψ = {ψ r , ψ t } is the collection of parameters in the prediction network.T = { t1 , t2 , ..., tl } ∈ R l×(2W +1) are the predictions for A * and ti ∈ R 2W +1 is the prediction for the alignment matrix segment A * i,i−W :i+W , i.e., the alignment components which span within the radius of W centered at current timestamp i.We reckon the mean squared error (MSE) between "truths" generated from the solver and predictions to calculate the fitting loss: (13) where the summation is over the entries within context windows and we define A * ij = 0 if j ≤ 0 or j > l for better readability.
Decoding Mode In this mode, the learner behaves like a decoder that strives to generate meaningful substitution to the missing modality sequences.The learner first decodes an alignment matrix Â via the fitting network whose parameters are frozen during this stage.Afterward, the imputation of the missing modality at position j can be obtained through the linear combination of alignment matrices and visible sequences: We concatenate all these vectors to construct the imputation for the missing modality Ẑ2 in the shared space: where ẑ2 0 is reassigned by the initial embedding of the [CLS] token.The imputation results together with the complete modality sequences are then fed into the fusion network (Eq.( 1) ~(3)) to continue the subsequent procedure.

Denoising Training
Inspired by previous work in data imputation (Kyono et al., 2021), we design a denoising training algorithm to promote prediction accuracy and imputation quality concurrently, as shown in Alg. 1.In the beginning, we warm up the model on the complete split of the training set.We utilize two transformer encoders to project input modality sequences x m 1 and x m 2 into a shared feature space, denoted as Z 1 and Z 2 .Following Han et al.
Compute Lmain, Lcon by Eq. ( 1)~( 5), ( 16), ( 17) Compute A * by Sinkhorn algorithm according to Eq. ( 7)~( 11) Compute L f it according to (13); // Tune the dynamics learner Compute Lmain, Lcon according to Eq. ( 1)~( 5), ( 16), ( 17 Impute the representation sequences of the missing modality Ẑ2 i by Eq. ( 14) (15) and then Lmain by Eq. ( 1)~( 5), ( 16), ( 17) θ ← θ − ηmain∇ θ Lmain 17 end 18 end 2020) as the regularization term to force a similar distribution of the generated vectors Z 1 and Z 2 : where the summation is over the whole batch of size N b and ϕ is a score function with an annealing temperature τ as the hyperparameter: Next, the denoising training loop proceeds to couple the ADL and backbone network.In a single loop, we first train the alignment dynamics learner (line 9~11), then we train the backbone network on the complete split (line 12~13) and missing split (line 15~17).Since the learner training process uses the modality-complete split, and we found in experiments ( §4.4) that model's performance stays nearly constant if the tuning for the learner and the main network occurs concurrently on every batch, we merge them into a single loop (line 8~14) to reduce the redundant batch iteration.

Datasets
We utilize CMU-MOSI (Zadeh et al., 2016) and CMU-MOSEI (Zadeh et al., 2018) for sentiment prediction, and MELD (Poria et al., 2019) for emotion recognition, to create our evaluation benchmarks.The statistics of these datasets and preprocessing steps can be found in appendix A. All these datasets consist of three parallel modality sequences-text (t), visual (v) and acoustic (a).In a single run, we extract a pair of modalities and select one of them as the victim modality which we then randomly remove p ′ = 1 − p of all its sequences.Here p is the surviving rate for the convenience of description.We preprocess test sets as Fig. 1c (remove all victim modality samples) in setting A and Fig. 1d (randomly remove p ′ of victim modality samples) in setting B. Setting B inherits from Ma et al. (2021) while the newly added setting A is considered as a complementary test case of more severe missing situations, which can compare the efficacy of pure imputation methods and enrich the connotation of robust inference.We run experiments with two randomly picking p ∈ {10%, 50%}-dissimilar to Ma et al. (2021), we enlarge the gap between two p values to strengthen the distinction between these settings.

Baselines and Evaluation Metrics
We compare our models with the following relevant and strong baselines: • Supervised-Single trains and tests the backbone network on a single complete modality, which can be regarded as the lower bound (LB) for all the baselines.
• Supervised-Double trains and tests the backbone network on a pair of complete modalities, which can be regarded as the upper bound (UB).
• MFM (Tsai et al., 2018) learns modality-specific generative factors that can be produced from other modalities at training time and imputes the missing modality based on these factors at test time.
• SMIL (Ma et al., 2021) imputes the sequential representation of the missing modality by linearly adding clustered center vectors with weights from learned Gaussian distribution.
The characteristics of all these models are listed for comparison in Table 1.Previous work relies on either a Gaussian generative or sequence-tosequence formulation to reconstruct the victim modality or its sequential representations, while our model adopts none of these architectures.We run our models under 5 different splits and report the average performance.The training details can be found in appendix B.
We compare these models on the following metrics: for the sentiment prediction task, we employ the mean absolute error (MAE) which quantifies how far the prediction value deviates from the ground truth, and the binary classification accuracy (Acc-2) that counts the proportion of samples correctly classified into positive/negative categories; for emotion recognition task we compare the average F1 score over seven emotional classes.

Results
Due to the particularities of three datasets, We report the results of the smallest p values when most of these baselines yield 1% higher results than the lower bound in Table 2, 3 and 4. From them we mainly have the following observations: First, Compared with lower bounds, in setting A where models are tested with only the nonvictim modality, our method gains 6.6%~9.3%,2.4%~4.9%accuracy increment on the CMU-MOSI and CMU-MOSEI dataset and 0.6%~1.7%F1 increment on the MELD dataset (except A→V and A→T).Besides, MM-Align significantly outperforms all the baselines in most settings.These facts indicate that leveraging the local alignment information as indirect clues facilitates to performing robust inference on missing modalities.
Second, model performance varies greatly especially when the non-victim modality alters.It has been pointed out that three modalities do not play an equal role in multimodal tasks (Tsai et al., 2019).Among them, the text is usually the predominant modality that contributes majorly to accuracy, while visual and acoustic have weaker effects on the model's performance.From the results, it is apparent that if the source modality is predominant, the model's performance gets closer to or even surpasses the upper bound, which reveals that the predominant modality can also offer richer clues to facilitate the dynamics learning process than other modalities.Third, when moving from setting A to setting B by adding parallel sequences of the non-victim modality in the test set, results incline to be constant in most settings.Intuitively, performance should become better if more parallel data are provided.However, as most of these models are unified and must learn to couple the restoration/imputation module and backbone network, the classifier inevitably falls into the dilemma that it should adapt more to the true parallel sequences or the mixed sequences since both are included patterns in a training epoch.Hence sometimes setting B would not perform evidently better than setting A. Particularly, we find that when Modal-Trans encounters overfitting, MM-Align can alleviate this trend, such as T→A in all three datasets.Additionally, MM-Align acquires a 3~4× speedup in training.We record the time consumption and provide a detailed analysis in appendix D and E.

Ablation Study
We run our model under the following ablative settings on three randomly chosen modality pairs from the CMU-MOSI dataset in setting A: 1) removing the contrastive loss which serves as the invariant space regularizer; 2) removing the fitting loss so that the ADL only generates a random alignment matrix when running in the inference mode; 3) separating the single iteration (SI) over the complete split that concurrently optimizes the fitter and backbone network in Alg. 1 into two independent loops.The results of these experiments are displayed in Table 5.We witness a performance drop after removing the contrastive loss, and the drop is higher if we disable the ADL, which implies the benefits from the alignment dynamics-based generalization process on the modality-invariant hidden space.Finally, merging two optimization steps will not cause performance degradation.Therefore it is more time-efficient to design the denoising loop as Alg. 1 to prevent an extra dataset iteration.

Analysis
Impact of the Window Size To further explore the impact of window size, we run our models by increasing window size from 4 to 256 which exceeds the lengths of all sentences so that all timestamps are enclosed by the window.The variation of MAE and F1 in this process is depicted in Fig. 4.There is a dropping trend (MAE increment or F1 decrement) towards both sides of the optimal size.We argue that it is because when the window expands, it is more probable for the newly included frame to add noise rather than provide valuable alignment information.In the beginning, the marginal benefit is huge so the performance almost keeps climbing.The optimal size is reached when the marginal benefit decreases to zero.To explain this claim, we randomly select a raw example from the CMU-MOSI dataset.As shown in Fig. 3, the textual expression does not advance in a uniform speed.From the second to the third word 1.80 seconds elapses, while the last eight words are covered in only 2.53 seconds.Intuitively we can assume all the frames in the video that span across the pronunciation of a word are causally correlated with that word so that the representation mappings from the word to these frames are necessary and can benefit the downstream tasks.For example, for the word "I" present at t = 1 in text, it can benefit the timestamps until at least t = 5 in the visual modality.Note that we may overlook some potential advantages that could not be easily justified in this way and possess different effect scope, but we deem that those advantages would like-wisely disappear as the window size keeps growing.

Conclusion
In this paper, we propose MM-Align, a fast and efficient framework for the problem of missing modality inference.It applies the theory of optimal transport to learn the alignment dynamics between temporal modality sequences for the inference in the case of missing modality sequences.Experiments on three datasets of demonstrate that MM-Align can achieve much better performance and thus reveal the higher robustness of our method.We hope that our work can inspire other research works in this field.

Limitations
Although our model has successfully tackled the two missing patterns, it may still fail in more complicated cases.For example, if missing happens randomly in terms of frames (some timestamps within a unimodal clip) instead of instances (the entire unimodal clip), then our proposed approach could not be directly used to deal with the problem, since we need at least several instances of complete parallel data to learn how to map from one modality sequences to the other.However, we believe these types of problems can still be properly solved by adding some mathematical tools like interpolation, etc.We will consider this idea as the direction of our future work.
Besides, the generalization capability of our framework on other multimodal tasks is not clear.But at least we know the feasibility highly depends on the types of target tasks, especially the input formats-they have to be parallel sequences so that temporal alignment information between these sequences can be utilized.The missing patterns should be similar to what we described in section 2, as we discussed in the first paragraph.

A Dataset Statistics and Preprocessing
The statistics of the two datasets are listed in Table 6.MELD is originally a dialogue emotion detection dataset, where each dialogue contains many sentences.Since we want to make it compatible with tested models, we extract all sentences and remove those that lack at least one modality (text, visual, acoustic).Following previous work, for MOSI and MOSEI we use COVAREP (Degottex et al., 2014) and P2FA (Yuan et al., 2008) to respectively extract visual and acoustic features.

B Hyperparameter Search
All these models are trained on a single RTX A6000 GPU.We use Glove (Pennington et al., 2014) 300d to initialize the embedding of all the tokens.We perform a grid search for part of the hyperparameters as Table 7.

C.1 Visualization of Solutions
To verify our statement in Section 3.4 that the learned dynamics matrices are in the window style, we calculate and visualize the mean absolute values   for each entry.Due to various sentence lengths, the values are averaged over all matrices whose corresponding input sequences' lengths are no smaller than 20.We visualize the heat map of the average entry values in Fig. 5.It can be clearly viewed that the values outside the window stay nearly 0 (black squares), implying that they are always close to 0.

C.2 Proof of solution pattern
We formalize the window style solution in mathematical language.
Theorem 1.Given the optimal transport formulation as Eq. ( 6)~(8).All the entries a * ij that satisfy |i − j| > W in the optimal transport plan A * are 0, where W is the window size.
Proof.We use the proof by contradiction.Assume there is an entry A i ′ j ′ in A * outside the window, i.e., |i ′ −j ′ | > W , and A i ′ j ′ > 0. Then we have the cost which means A * is not the optimal transport plan and contradicts our basic assumption.Hence, by applying this kind of cost function we can obtain a window-style solution.

D Complexity Analysis
We conduct a simple analysis of the computational complexity of MM-Align and Modal-Trans.We are concern about the stage that occupies the most time in one training epoch-training on the missing split when the ADL works in the decoding mode.Suppose the average sequence length, the embedding dimension, the window size are l, d and w (here w stands for the value of 2W + 1 for simplicity), respectively.The complexity (number of multiplication operations) of the alignment dynamics fitter is the summation of the complexity from GRU and the linear projection layer: The time spent on the alignment dynamics solver can be ignored since it is a non-parametric module so that no gradients are back-propagated through it and the number of iterations required for convergence is very little (about 5).The complexity of the transformer decoder is the summation of the complexity from encoder-decoder attention, encoder & decoder self-attention, and linear projections: The last inequality is an empirical conclusion, since in our experiments l ≈ 10 while d = 32 in most situations.
Particularly, the complexity of encoder-decoder attention can be calculated by the summation of l times individual attention in the decoding procedure: It should be highlighted that the computation only counts the number of multiplications into account.Since sequence-to-sequence decoding can not be paralleled, it takes more time to train.

E Inference Speed
As we mentioned before, the most competitive baseline, Modal-Trans, is a variant of the most advanced sequence-to-sequence model.Apart from the performance improvement, MM-Align also speeds up the training process.To show this, we run and calculate the average batch training time between MM-Align and Modal-Trans.As shown in Table 8, MM-Align achieves over 3× training acceleration over Modal-Trans but can produce sequential imputation of higher quality.We also provide an estimation for the computational complexity in the appendix.

F Additional Results
In the main text, we present the results of the minimum p in both settings.Here we also provide the results when tested in setting A for the two preservation in Figure 1: Input patterns of different modality inference problems.Here visual modality is the victim modality that may be missing randomly.(a) modalities are both complete in train and test set; (b) modalities are both complete in the train set but the victim modality is completely missing in the test set; (c) victim modality is missing randomly in the train set but completely missing in the test set; (d) modalities are missing with the same probability in train and test set.

3. 1
Problem Definition Given a multimodal dataset D = {D train , D val , D test }, where D train , D val , D test are the training, validation and test set, respectively.In the training set D train (2021b), we apply a contrastive loss (Chen et al., Algorithm 1: Denoising Training Input: D train = {D train c , D train m }, learning rate η f it , ηmain, parameters of the backbone network θ = {θenc, θ f u , θout} and the alignment dynamics learner ψ = {ψ d }, batch size n b , λ // Warm-up Stage 1 for each warm-up epoch do 2 for each B = {∪ ) // Tune the backbone network 13 θ ← θ − ηmain∇ θ (Lmain + λLcons) 14 end // Train on the missing split 15 for each B = {∪

Figure 3 :
Figure 3: An example from CMU-MOSI dataset.The text below the time axis is aligned to the starting time of its pronunciation.The pictures are the central frame of each cluster that lasts the same time interval.The dashed lines connect each word with the frames of its appearance in the video.

Figure 4 :
Figure 4: Performance variation under different window sizes.The optimal sizes for the three pairs are 9, 10, 10.

Table 2 :
Results on the CMU-MOSI dataset (p = 10).The reported results are the average of five runs using the same set of hyperparameters and different random seeds."A → B" means the imputation from the complete modality A to the missing modality B at the test time.♮: results of our model are significantly better than the highest baselines with p-value < 0.05 based on the paired t-test.

Table 3 :
Results on the CMU-MOSEI dataset (p = 10).Notations share the same meaning as the last table.

Table 4 :
Results on MELD (p = 50%).Notations share the same meaning as the last table.

Table 5 :
Results of ablation experiments on CMU-MOSI dataset.
via mutual dependency maximisation.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 231-245, Online and Punta Cana, Dominican Republic.Association for Computational Linguistics.Ruichen Li, and Qin Jin.2021.Missing modality imagination network for emotion recognition with uncertain missing modalities.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2608-2618, Online.Association for Computational Linguistics.Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman.2017.Toward multimodal image-to-image translation.Advances in neural information processing systems, 30.

Table 6 :
Statistics of three datasets we use for experiments.

Table 7 :
The hyperparameter search for three datasets

Table 8 :
The average training time of the imputation module (seconds) per batch.