Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities

Multimodal fusion has been proved to improve emotion recognition performance in previous works. However, in real-world applications, we often encounter the problem of missing modality, and which modalities will be missing is uncertain. It makes the fixed multimodal fusion fail in such cases. In this work, we propose a unified model, Missing Modality Imagination Network (MMIN), to deal with the uncertain missing modality problem. MMIN learns robust joint multimodal representations, which can predict the representation of any missing modality given available modalities under different missing modality conditions.Comprehensive experiments on two benchmark datasets demonstrate that the unified MMIN model significantly improves emotion recognition performance under both uncertain missing-modality testing conditions and full-modality ideal testing condition. The code will be available at https://github.com/AIM3-RUC/MMIN.


Introduction
Automatic multimodal emotion recognition is very important to natural human-computer interactions (Fragopanagos and Taylor, 2002). It aims to understand and interpret human emotions expressed through multiple modalities such as speech content, voice tones and facial expression. Previous works have shown that these different modalities are complimentary for emotion expression, and proposed many effective multimodal fusion methods to improve the emotion recognition performance (Baltrušaitis et al., 2018;Tsai et al., 2019;Zhao et al., 2018). However, in real applications, many common causes can lead to the missing modality problem. For example, the camera is turned off or * Equal Contribution † Corresponding Author ...

I can't believe it!
Acoustic Visual Textual Missing Emotion Recognition Figure 1: Illustration of a missing modality scenario for multimodal emotion recognition systems. As shown in this video segment, we encounter the missing visual modality problem due to the person's face was obscured by her hands.
blocked due to privacy issues; the speech content is unavailable due to automatic speech recognition errors; the voice and text are missing due to the silence of the user; or the faces cannot be detected due to lighting or occlusion issues as shown in Figure 1. Existing multimodal fusion models trained on full-modality samples usually fail when partial modalities are missing (Aguilar et al., 2019;Pham et al., 2019;Cai et al., 2018;Parthasarathy and Sundaram, 2020).
The missing modality problem has attracted more research attention in the past years, and the existing solutions for this problem are mainly based on learning joint multimodal representation so that all modality information can be encoded.  propose a joint training approach that implicitly fuses multimodal information from auxiliary modalities, which improves the monomodal emotion recognition performance. The recent cross-modality sequential translation-based methods proposed in (Pham et al., 2019;Wang et al., 2020) learn the joint multimodal representations via translating a source modality to multiple target modalities, which improves the performance of the source modality as input at the test time. However, these methods can only deal with the scenario where the source modality is input to the trained model. Different models need to be built for different missing modality cases 1 . Additionally, the sequential translation-based models require translation and generation of videos, audios, and text, which are difficult to train especially with limited training samples Pham et al., 2019).
In this work, we propose a novel unified model, Missing Modality Imagination Network (MMIN), to address the above issues. Specifically, the proposed MMIN learns the robust joint multimodal representations through cross-modality imagination with Cascade Residual Autoencoder (CRA) (Tran et al., 2017) and Cycle Consistency Learning (Zhu et al., 2017) based on sentence-level modalityspecific representations, as the sentence-level representation is more reasonable for modeling the cross-modality emotion correlation. The imagination module aims to predict the sentence-level emotional representation of the missing modality from the other available modalities. To the best of our knowledge, this is the first work that investigates a unified model for multimodal emotion recognition with uncertain missing-modality.
Extensive experiments are carried out on two benchmark datasets, IEMOCAP and MSP-IMPROV, under both uncertain missing-modality and full-modality conditions. The proposed MMIN model as a unified multimodal emotion recognition model can learn robust joint multimodal representations and outperforms the standard multimodal fusion models on both benchmark datasets under both the uncertain missing-modality and the fullmodality conditions. Furthermore, to evaluate the imagination ability of our MMIN model, we visualize the distributions of the imagined representations of the missing modalities and its ground-truth representations and find they are very similar, which demonstrates that MMIN can imagine the representations of the missing modalities based on the representations of the available modalities.
In summary, the main contributions of this work are: 1) We propose a unified model, Missing Modality Imagination Network (MMIN), to improve the robustness of emotion recognition systems under uncertain missing-modality testing con-ditions. 2) We design cross-modality imagination based on paired multimodal data and adopt Cascade Residual Autoencoder (CRA) and Cycle Consistency Learning to learn the robust joint multimodal representations. 3) Extensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed model which improves the emotion recognition performance under both the uncertain missing-modality and the full-modality conditions.

Related Work
Multimodal Emotion Recognition Many previous works have focused on fusing multimodal information to improve emotion recognition performance. Temporal attention-based methods are proposed to use the attention mechanism to selectively fuse different modalities based on the frame-level or word-level temporal sequence, such as Gated Multimodal Unit (GMU) (Aguilar et al., 2019), Multimodal Alignment Model (MMAN) (Xu et al., 2019) and Multi-modal Attention mechanism (cLSTM-MMA) (Pan et al., 2020). These methods use different uni-modal sub-networks to model the contextual representations for each modality and then use the multimodal attention mechanism to selectively fuse the representations of different modalities. Liang et al. (Liang et al., 2020) propose a semi-supervised multimodal (SSMM) emotion recognition model which uses cross-modality emotional distribution matching to leverages unlabeled data to learn the robust representations and achieves state-of-the-art performance. Missing Modality Problem Existing methods for missing modality problem can mainly be divided into three groups. The first group features the data augmentation approach, which randomly ablates the inputs to mimic missing modality cases (Parthasarathy and Sundaram, 2020). The second group is based on generative methods to directly predict the missing modalities given the available modalities Cai et al., 2018;Suo et al., 2019;Du et al., 2018). The third group aims to learn the joint multimodal representations that can contain related information from these modalities (Aguilar et al., 2019;Pham et al., 2019;Wang et al., 2020).
Data augmentation methods: Parthasarathy et al. (Parthasarathy and Sundaram, 2020) propose a strategy to randomly ablate visual inputs during training at the clip or frame level to mimic realworld missing modality scenarios for audio-visual multimodal emotion recognition, which improves the recognition performance under missing modality conditions.
Generative methods: Tran et al. (Tran et al., 2017) propose Cascaded Residual Autoencoder (CRA) to utilize the residual mechanism over the autoencoder structure, which can take the corrupted data and estimate a function to well restore the incomplete data. Cai et al. (Cai et al., 2018) propose an encoder-decoder deep neural network to generate the missing modality (Positron Emission Tomography, PET) given the available modality (Magnetic Resonance Imaging, MRI), and the generated PET can provide complementary information to improve the detection and tracking of Alzheimers disease.
Learning joint multimodal representations: Han et al.  propose a joint training model that consists of two modality-specific encoders and one shared classifier, which implicitly fuse the audio and visual information as joint representations and improve the performance of the mono-modality emotion recognition. Pham et al. (Pham et al., 2019) propose a sequential translation-based model to learn the joint representation between the source modality and multiple target modalities. The hidden vectors of the source modality encoder work as the joint representations, which improve the emotion recognition performance of the source modality. Wang et al. (Wang et al., 2020) follow this translation-based method and propose a more efficient transformerbased translation model with parallel translation including textual features to acoustic features and textual features to visual features. Moreover, the above two translation-based models adopt the forward translation and backward translation training strategy to ensure that joint representations can retain maximal information from all modalities.

Method
Given a set of video segments S, we use x = (x a , x v , x t ) to represent the raw multimodal features for a video segment s ∈ S, where x a , x v and x t represent the raw features of acoustic, visual and textual modalities respectively. |S| represents the number of video segments in set S. We denote the target set where y i is the target emotion category of the video (available,missing) unified triplet format pairs 1 ( segment s i and |C| is the number of emotion categories. Our proposed method aims to recognize the emotion category y i for every video segment s i with full modalities, or with only partial modalities available, for the example shown in Figure 1, there exist only acoustic and textual modalities when visual modality is missing.

Missing Modality Imagination Network
In order to learn robust joint multimodal representations, we propose a unified model, Missing Modality Imagination Network (MMIN), which can deal with different uncertain missing-modality conditions in real application scenarios. Figure 2 illustrates the framework of our proposed MMIN model which contains three main modules: 1) Modality Encoder Network for extracting modality-specific embeddings; 2) Imagination Module based on the Cascade Residual Autoencoder (CRA) and Cycle Consistency Learning for imagining the representations of missing modalities given the representations of the corresponding available modalities. The latent vectors of the autoencoders in CRA are collected to form the joint multimodal representations; 3) Emotion classifier for predicting the emotion category based on the joint multimodal representations. We introduce each module in details in the following subsections.

Modality Encoder Network
The Modality Encoder Network is used to extract the modality-specific utterance-level embeddings based on the raw modality features x. As shown in Figure 2(b), we first pretrain the Modality Encoder Network in a multimodal emotion recognition model and it is further trained within MMIN model. We define the modality-specific embeddings of each modality as h a = EncA(x a ), where EncA, EncV and EncT represent the acoustic, visual and textual encoders respectively, and h a , h v and h t represent the modality-specific embeddings generated by the corresponding encoders respectively.

. . .
[ ]  (Table 1). (b) Modality Encoder Network. The modality encoder network is pretrained in the multimodal emotion recognition task on the full-modality data and then it is updated during the MMIN training as shown in the orange colored block in MMIN. The pretrained modality encoder network (gray colored block in MMIN) is similar to the modality encoder network, and the only difference is that it is fixed during training.

Modality Encoder Network
(c) Missing Modality Imagination Network (MMIN) at the inference stage (taking the visual modality missing condition as an example). MMIN can inference under different missing modality conditions.

Missing Modality Condition Creation
Given a training sample with all three modalities (x a , x v , x t ), there are 6 different possible missingmodality conditions as shown in Table 1. We can build a cross-modality pair (available, missing) under each missing-modality condition, where the available and missing mean the available modalities and the corresponding missing modalities respectively. In order to ensure a unified model that can handle various missing-modality conditions, we enforce a unified triplet input format for the modality encoder network as (x a , x v , x t ). Under the missing-modality conditions, the raw features of the corresponding missing modalities are replaced by zero vectors. For example, the unified format input of the available modalities under the visual modality missing condition (case 1 in Ta Under the missing-modality training conditions, the input includes the cross-modality pairs referring to available modalities and missing modalities in the unified triplet format (as shown in Ta-ble 1). The multimodal embeddings of these crossmodality pairs can be represented as (taking the visual modality missing condition as example): where h a miss , h v miss and h t miss represent the modality-specific embedding when the corresponding modality is missing, which is produced by the corresponding modality encoder with input zero vectors.

Imagination Module
We propose an autoencoder-based Imagination Module to predict the multimodal embeddings of the missing modalities given the multimodal embeddings of the available modalities. The Imagination Module is expected to learn the robust joint multimodal representations through the cross-modality imagination. As illustrated in Figure 2(a), we employ the Cascade Residual Autoencoder (CRA) (Tran et al., 2017) structure, which has sufficient learning capacity and more stable convergence than the standard autoencoder. The CRA structure is constructed by connecting a series of Residual Autoencoders (RAs). We further employ cycle consistency learning (Zhu et al., 2017;Wang et al., 2020) with a coupled net architecture with two independent networks to perform imagination in two directions, including the Forward (available → missing) and Backward (missing → available) imagination directions.
To be specific, we use a CRA model with B RAs and each RA is represented by φ k , k = 1, 2, . . . , B, and the calculation of each RA can be defined as: where h is the extracted multimodal embedding based on the available modalities in a unified crossmodality pair format (Eq.(1)) and ∆z k represents the output of the k th RA. Taking the visual modality missing condition as example (as shown in Figure 2(a)), the forward imagination aims to predict the multimodal embedding of the missing visual modality based on the available acoustic and textual modalities. The forward imagined multimodal embedding is expressed as: where imagine(·) represents the function of the Imagination Module. The backward imagination aims to predict the multimodal embedding of the available modalities based on the forward imagined multimodal embedding h (Eq. (3)). The backward imagined multimodal embedding is expressed as:

Classifier
We collect the latent vectors of each auto-encoder in the forward imagination module and concatenate them together to form the joint multimodal representation: R = concat(c 1 , c 2 , . . . , c B ), where c k is the latent vector of the autoencoder in the k th RA. Based on the joint multimodal representation R, we calculate the probability distribution q as: where f cls (·) denotes the emotion classifier that consists of several fully-connected layers.

Joint Optimization
The loss function for MMIN training includes three parts: the emotion recognition loss L cls , forward imagination loss L f orward , and backward imagination loss L backward : where p is the true distribution of one-hot label and q is the prediction distribution calculated in Eq. (5). H(p, q) is the cross-entropy between distributions p and q. h i andĥ i are the ground-truth representations extracted by the modality encoder network as shown in Eq.(1). We combine all the three losses into the joint objective function as below to jointly optimize the model parameters: where λ 1 and λ 2 are weighting hyper parameters for L f orward and L backward respectively.

Dataset
We evaluate our proposed model on two benchmark multimodal emotion recognition datasets, Interactive Emotional Dyadic Motion Capture (IEMO-CAP) (Busso et al., 2008) and MSP-IMPROV (Busso et al., 2016). The statistics of the two datasets are shown in Table 2. IEMOCAP contains recorded videos in 5 dyadic conversation sessions. In each session, there are multiple scripted plays and spontaneous dialogues between a male and a female speaker and 10 speakers in total in the database. We follow the emotional label processing in (Xu et al., 2019;Liang et al., 2020) to form the four-class emotion recognition setup.
MSP-IMPROV contains recorded segments videos in dyadic conversation scenarios with 12 actors. We first remove videos that are shorter than 1 second. Then we select the videos in the "Other-improvised" group which are recorded during the improvisation scenarios with happy, anger, sadness, or neutral labels to form the four-class emotion recognition setup.

Missing-Modality Training Set
We first define the original training set which contains all the three modalities as the full-modality training set. Based on the full-modality training set, we construct another training set that contains cross-modality pairs to simulate the possible missing-modality conditions and we define it as the missing-modality training set, which we use to train the proposed MMIN. Six different crossmodality pairs (Table 1) for each training sample are generated. Therefore, the number of the generated cross-modality pairs is six times as large as the number of the full-modality training samples.

Missing-Modality Testing Set
We first define the original testing set which contains all the three modalities as the full-modality testing set. To evaluate the performance of the proposed MMIN under the uncertain missingmodality conditions, we construct six different missing modality testing subsets corresponding to the six possible missing modality conditions respectively. For example, in the inference stage, under the missing visual modality condition as shown in Figure 2(c), the raw feature of a missingmodality testing sample in the unified format is (x a , x v miss , x t ). We combine all the six missingmodality testing subsets together and denote it as the missing-modality testing set.

Raw Feature Extraction
We follow feature extraction methods described in (Liang et al., 2020;Pan et al., 2020) and extract the frame-level raw features of each modality 2 .
Acoustic features: OpenSMILE toolkit (Eyben et al., 2010) with the configuration of "IS13 ComParE" is used to extract frame-level features, which have similar performance with the IS10 utterance-level acoustic features used in (Liang et al., 2020). We denote the features as "ComParE" and the feature vectors are in 130 dimensions.
Visual features: We extract the facial expression features using a pretrained DenseNet (Huang 2 To facilitate fair comparison with the sequential translation-based missing modality method MCTN, we adopt frame-level features which can be directly used in the MCTN method et al., 2017) which is trained based on the Facial Expression Recognition Plus (FER+) corpus (Barsoum et al., 2016). We denote the facial expression features as "Denseface". The "Denseface" are frame-level sequential features based on the detected faces from the video frames, and the feature vectors are in 342 dimensions.
Textual features: We extract contextual word embeddings using a pretrained BERT-large model (Devlin et al., 2019) which is one of the state-ofthe-art language representations. We denote the word embeddings as "Bert" and the features are in 1024 dimensions.

Higher-level Feature Encoder
To generate more efficient sentence-level modalityspecific representations for the Imagination Module, we design different modality encoders for different modalities. Acoustic Modality Encoder (EncA): We apply a Long Short-term Memory (LSTM) network (Sak et al., 2014) to capture the temporal information based on the sequential frame-level raw acoustic features x a . Then we use max-pooling to get utterance-level acoustic embedding h a based on the LSTM hidden states. Visual Modality Encoder (EncV): We adopt a similar method with EncA on the sequential frame-level facial expression features x v and get utterance-level visual embedding h v . Textual Modality Encoder (EncT): We apply a TextCNN (Kim, 2014) to get the utterance-level textual embedding as h t based on the sequential word-level features x t .

Recognition Baselines
Our baseline model takes the structure as shown in Figure 2(b), which is trained based on the fullmodality training set and we use it as our fullmodality baseline. To improve the system robustness against the missing modality problem, one intuitive solution is to add samples under the missingmodality conditions into the training set. We, therefore, pool the missing-modality training set and full-modality training set together to train the baseline model and use it as our augmented baseline.

Implementation Details
Table 3 presents our implementation details. We use the 10-fold and 12-fold speaker-independent cross-validation to evaluate the models on IEMO-CAP and MSP-IMPROV respectively. For the experiments on IEMOCAP, we take four sessions for   (Pan et al., 2020) 0.7394 -SSMM (Liang et al., 2020) 0.7560 0.7450 training, and the remaining session is split by speakers into the validation and testing sets. For MSP-IMPROV, we take the utterances of 10 speakers for training, the remaining 2 speakers are divided into validation set and testing set by speakers. We train the model with at most 100 epochs for each experiment. We select the best model on the validation set and report its performance on the testing set. To demonstrate the robustness of our models, we run each model three times to alleviate the influences of random initialization of parameters and apply a significance test for model comparison. All models are implemented with Pytorch deep learning toolkit and run on a single Nvidia GTX 1080Ti graphic card.
For the experiments on IEMOCAP, we use two evaluation metrics: weighted accuracy (WA) and unweighted accuracy (UA). Due to the imbalance of emotion categories on MSP-IMPROV, we use the f-score as the evaluation metric.

Full-modality Baseline Results
We first compare our full-modality baseline with several state-of-the-art multimodal recognition models under the full-modality condition. Results in Table 4 show that our full-modality baseline outperforms other state-of-the-art models, which proves that our modality encoder network can extract effective representations for multimodal emotion recognition.

Uncertain Missing-Modality Results
Table 5 presents the experimental results of our proposed MMIN model under different missingmodality testing conditions and full-modality testing condition. On IEMOCAP, comparing to the "full-modality baseline" results in Table 4, we see a significant performance drop under uncertain missing-modality testing conditions, which indicates that the model trained under the fullmodality condition is very sensitive to the missing modality problem. The intuitive solution "Augmented baseline", which combines the missingmodality training set with the full-modality training set to train the baseline model, does significantly improves over the full-modality baseline under missing-modality testing conditions, which indicates that data augmentation can help alleviate the problem of data mismatch between training and testing. More notably, our proposed MMIN significantly outperforms both the full-modality baseline and the augmented baseline under every possible missing-modality testing condition. It also outperforms the two baselines under the full-modality testing condition, even though the MMIN model does not use the full-modality training data. These results indicate that our proposed MMIN model can learn robust joint multimodal representation so that it can achieve consistently better performance under both the different missing-modality and the full-modality testing conditions. This is because our proposed MMIN method not only has the data augmentation capability, but also can learn better joint representation, which can preserve information of other modalities.
We further analyze the performance under different missing modality conditions. Our MMIN model achieves significant improvement under one modality available conditions ({a}, {v}, or {t}) compared with the augmented baseline, especially for the weak modalities {a} and {v}. It brings some improvements as well over the augmented baseline even for the strong modality combinations, such as {a, t}. These experimental results indicate that the learned joint representation via MMIN did learn complementary info from the other modalities to compensate for the weak modalities. Table 5 shows the performance comparison on the MSP-IMPROV dataset. Our proposed MMIN model again significantly outperforms the two baselines under different missing-modality and full-modality testing conditions, which demonstrates the good generalization ability of MMIN across different datasets.

The bottom block in
We also compare to the MCTN (Pham et al., 2019) model which is the state-of-the-art model for the missing modality problem. As MCTN can-

Ablation Study
We conduct experiments to ablate the contributions of different components in MMIN, including the structure of the imagination module and the cyclic consistency learning. Structure of the imagination module. We first investigate the impact of different network structures on the performance in the imagination module. Specifically, we compare the Autoencoder and the CRA structure in MMIN, and we adopt the same parameter scale to ensure the fairness of the comparison. As shown in Table 6, the performance of the imagination module with Autoencoder structure "MMIN-AE" is worse than that with the CRA structure under both different missing-modality and full-modality testing conditions. The performance comparison indicates that the CRA has a stronger imagination ability than the Autoencoder model.
Cycle Consistency Learning. To evaluate the impact of the cyclic consistency learning in MMIN, we conduct experiments using MMIN with or without cycle consistency learning. As shown in Table 6, the model trained without cycle consistency learning results in performance loss under all conditions, which indicates that the cycle consistency learning can enhance the imagination ability and learn more robust joint multimodal representations.

Analysis of MMIN Core Competence
We conduct detailed experiments on IEMOCAP to demonstrate the joint representation learning ability and the imagination ability of our MMIN model. Joint representation learning ability: Since the joint representation is expected to retain information of multiple modalities, we conduct experiments to evaluate the joint representation learning ability of MMIN. We compare MMIN to the baseline model under the matched-modality condition in which the training data and the test data contain the same modalities. As shown in Table 7, comparing to the baseline model, MMIN achieves on par with or even better performance, which demonstrates that MMIN has the ability to learn effective joint multimodal representations. We also notice that the data-augmented model cannot beat the corresponding matching partial-modality baseline model, which indicates the data-augmented model cannot learn the joint representation. Imagination ability: Figure 3 visualizes the distribution of the ground-truth multimodal embeddings (ĥ in Figure 2) and MMIN imagined multimodal embeddings (h in Figure 2) for a male speaker and female speaker using t-SNE (Maaten and Hinton, 2008). We observe that the distribution of   Table 7: Evaluation (UA) of the joint representation learning ability on IEMOCAP. "Baseline" denotes the results individually train with cross-entropy loss on partial modalities samples. "Augmented" and "MMIN" denote the evaluation results of our unified data-augmented baseline model and MMIN model under different test conditions, which are the same as in Table 5.
(a) A Female Speaker (b) A Male Speaker Figure 3: Visualization of the ground-truth and imagined multimodal embeddings. For example, a denotes the ground-truth multimodal embeddings of the acoustic modality. a imagined denotes the MMIN imagined multimodal embeddings of the acoustic modality based on visual and textual modalities.
the ground-truth embeddings and imagined embeddings are very similar, although the distribution of visual modality embeddings deviates a little, it is mainly because the quality of the visual modality is poor in this dataset. It demonstrates that MMIN can imagine the representations of the missing modalities based on the available modalities.

Conclusion
In this paper, we propose a novel unified multimodal emotion recognition model, Missing Modality Imagination Network (MMIN), to improve the emotion recognition performance under uncertain missing-modality conditions in real application scenarios. The proposed MMIN can learn the robust joint multimodal representations through cross-modality imagination via the Cascade Residual Autoencoder and Cycle Consistency Learning. Extensive experiments on two public benchmark datasets demonstrate the effectiveness and robustness of our proposed model, which significantly outperforms other baselines under both uncertain missing-modality and full-modality conditions. In the future work, we will explore ways to further improve the robust joint multimodal representation.