Unimodal and Crossmodal Refinement Network for Multimodal Sequence Fusion

Effective unimodal representation and complementary crossmodal representation fusion are both important in multimodal representation learning. Prior works often modulate one modal feature to another straightforwardly and thus, underutilizing both unimodal and crossmodal representation refinements, which incurs a bottleneck of performance improvement. In this paper, Unimodal and Crossmodal Refinement Network (UCRN) is proposed to enhance both unimodal and crossmodal representations. Specifically, to improve unimodal representations, a unimodal refinement module is designed to refine modality-specific learning via iteratively updating the distribution with transformer-based attention layers. Self-quality improvement layers are followed to generate the desired weighted representations progressively. Subsequently, those unimodal representations are projected into a common latent space, regularized by a multimodal Jensen-Shannon divergence loss for better crossmodal refinement. Lastly, a crossmodal refinement module is employed to integrate all information. By hierarchical explorations on unimodal, bimodal, and trimodal interactions, UCRN is highly robust against missing modality and noisy data. Experimental results on MOSI and MOSEI datasets illustrated that the proposed UCRN outperforms recent state-of-the-art techniques and its robustness is highly preferred in real multimodal sequence fusion scenarios. Codes will be shared publicly.

Effective unimodal representation and complementary crossmodal representation fusion are both important in multimodal representation learning. Prior works often modulate one modal feature to another straightforwardly and thus, underutilizing both unimodal and crossmodal representation refinements, which incurs a bottleneck of performance improvement. In this paper, Unimodal and Crossmodal Refinement Network (UCRN) is proposed to enhance both unimodal and crossmodal representations. Specifically, to improve unimodal representations, a unimodal refinement module is designed to refine modality-specific learning via iteratively updating the distribution with transformer-based attention layers. Self-quality improvement layers are followed to generate the desired weighted representations progressively. Subsequently, those unimodal representations are projected into a common latent space, regularized by a multimodal Jensen-Shannon divergence loss for better crossmodal refinement. Lastly, a crossmodal refinement module is employed to integrate all information. By hierarchical explorations on unimodal, bimodal, and trimodal interactions, UCRN is highly robust against missing modality and noisy data. Experimental results on MOSI and MOSEI datasets illustrated that the proposed UCRN outperforms recent state-of-the-art techniques and its robustness is highly preferred in real multimodal sequence fusion scenarios. Codes will be shared publicly 1 .
With the development of various network architectures, considerable progress has been achieved in multimodal fusion (Williams et al., 2018;Tsai et al., 2019;Mai et al., 2020;Zadeh et al., 2019), showing that multimodal representation outperforms unimodal ones on emotion and sentiment prediction tasks. Additionally, recent works seek to improve the efficacy of the multimodal fusion methods by assuming that one modality can be translated to another (Tsai et al., 2019;Mai et al., 2020), or modulated by pivot modality (Delbrouck et al., 2020), so as to align pairwise representations in a common space.
However, converting one modality to another modality appears to be inadequate in projecting all modalities into one feature space. A specific pattern may not coexist in all modalities (e.g., it expresses happiness in language/audio while showing neutrality on facial expression). Also, the conversion is usually followed by a downstream fusion to produce the final fusion result. Therefore, the downstream fusion may have already encompassed it as a byproduct making conversion redundant in terms of multimodal fusion tasks.
It has been frequently reported that lexical representation is a stronger predictor than audio and vision representations. Hence, (Delbrouck et al., 2020) leverages language feature to modulate others. As a result, it becomes the main contributor towards multimodal fused representation. On the contrary, through this process, those weak predictors are prone to undermine the modality-common representations. From this view, the strategy of direct modality modulation or pairwise translation may jeopardize learning rich-fused representation and lead to suboptimal results. Moreover, most of those network architectures require all modalities as input. As a result, the learned representations may perform poorly in the real world where complete modalities might not always be simultaneously available (e.g., some specific modality is missing or noisy). This may be because of over fusing or losing sight of addressing the importance of unimodal refinement.
Although the presence of multiple modalities provides additional information, there are two key challenges to be addressed when learning from multimodal data: 1) models must learn the complex intramodal and crossmodal interactions for predictions, and 2) models must be robust to unexpected missing or noisy modalities during testing.
To address the aforementioned problems, the Unimodal and Crossmodal Refinement Network (UCRN) is proposed, which takes both robust unimodal representation and efficient crossmodal representation into consideration. Our hypotheses are 1) a robust unimodal representation is essential for efficient multimodal fusion, 2) it is beneficial to reduce modality gaps before modality fusion, and 3) stacks of attention-based mechanism can efficiently select the most salient features within a representation, favoring robust representation learning.
Guided by the above hypotheses, the proposed UCRN consists of the following sub-modules with inputs of basic modality sequence representations: 1) a unimodal refinement module is proposed to yield robust modality-specific representations; 2) the robust unimodal representations are projected to different latent spaces with the aims of reducing representation gaps and 3) the latent spaces are concatenated to correlate modality-common features and to produce robust multimodal representation. The final predicted sequence label is the one with the highest probability in the network output.
To recap, the contributions of this paper can be summarized as follows: • Unimodal and Crossmodal Refinement Network (UCRN) is proposed to perform robust and efficient multimodal representation learning; • The unimodal representation refinement is proposed for improving crossmodal fusion; • The crossmodal refinement is explored to reduce modality gaps by progressively refining and fusing the flexibly concatenated unimodal representations; • Experiments are conducted on widely studied multimodal datasets. The results demonstrate that, compared with recent state-of-the-art (SOTA) works, the UCRN shows competitive performances and strong robustness against noisy and absent some unimodal inputs.
2 Related Work

Multimodal Sequence Learning
Multimodal sequence learning extracts the interand isntra-dependencies of multimodal data and uses complementary information to improve model performance. Many methods captured the sequential information by taking the advantage of LSTM (Hochreiter and Schmidhuber, 1997 To extract complementary information and to perform multimodal sequence fusion, many methods have investigated the fusion strategies. Early fusion performed fusion at the input level by simply concatenating multimodal features (Morency et al., 2011;Pérez-Rosas et al., 2013;Poria et al., 2016), which did not model the intra-modality dynamics efficiently because unimodal representations can be complicated and are not easy to learn in the whole model, as well as posing the potential of overfitting. Late fusion approaches trained unimodal classifiers individually and made fusion by voting (Wang et al., 2016;zad). Although this made intra-dynamic modeling more effective, simply applying a weighted average might not produce the best fusion results.

Translation-based Method
To better model the interactions among multimodal sequences, translation-based methods (Tsai et al., 2019;Pham et al., 2019;Mai et al., 2020) assumed that the representation of one modality can be converted to another, thus minimizing the gap between unimodal representations. For example, MulT (Tsai et al., 2019) proposed a multimodal transformer architecture that translated any two modalities to the remaining one, then combined the translated features for final fusion; MCTN (Pham et al., 2019) leveraged an encoder-decoder structure to convert one modality to another, as well as using a cyclic consistency loss to produce better modality translation results. AGFN (Mai et al., 2020) was proposed to learn a common embedding space via translating a modality to a target one, which takes adversarial training and graph-based fusion mechanism for prediction. CIA (Chauhan et al., 2019) implemented translation-based fusion on contextual attention modeling, where crossmodal auto-encoding was utilized to extract features. Translation-based methods directly converted one modality representation to another, which relied upon much reference information from pairwise modalities, imposing a limit on solving the modality missing issue.

Transformer and Self-attention
Transformer (Vaswani et al., 2017) is an effective and strong network to conduct sequence modeling. Different from recurrence modeling, it shows superiority in training and performance on many tasks (Yu et al., 2019;Naseem et al., 2020) based on attention mechanism. It transforms one sequence to another with an encoderdecoder structure, where the attention mechanism weighs the input sequence to decide which part is important at each step. By using Transformer encoding, TBJE (Delbrouck et al., 2020) proposed monomodal and multimodal variants; yet, they were not in a unified architecture, leading to a problem of model selection. Besides, the performance degraded when modulating the added visual information by language features, which was unsatisfactory in terms of multimodal fusion.
Different from the attention mechanism in Transformer, self-attention (Hu et al., 2018) was also a technique that has been widely used to extract contextual and correlated information within features.
The attention-based mechanism shows promising results of modeling sequences. Thus, it is also adopted in the proposed UCRN. However, instead of using modal translation or separated models, this paper highlights the unimodal representations refinement and the crossmodal representation refinement by regularizing the multi-modality inputs into a common space.

Unimodal and Crossmodal Refinement Network (UCRN)
In this paper, Unimodal and Crossmodal Refinement Network (UCRN) is proposed. As shown in Figure 1, UCRN is comprised of three main parts, where the first part conducts unimodal representation refinement, the second part refines all the previous information for fusion and learns a modality-common representation, and the last part performs prediction. The Unimodal Refinement Module (URM) takes unimodal features (i.e., language, audio, vision features) as input to learn the refined unimodal representations. Then the refined unimodal representations are mapped to a common latent space by imposing a Multimodal Jenson-Shannon (MJS) divergence regularizer. Following this, the Self-Quality Improvement Layers (SQIL) are used to further extract desired weighted unimodal representations for fusion. Lastly, the Crossmodal Refinement Module (CRM) integrates all information and extracts multimodal interactions.

Problem Definition
Suppose we have the i-th input feature X i , X i = {x i m ∈ R dm×tm ; m ∈ {l, a, v}}, where l, a, v represents language, audio, and vision, respectively, and d m and t m denote the dimensions of the modal feature and time sequence, respectively. Let K be the batch size. The goal of multimodal sequence fusion is to determine a deep fusion network F (X i ) so that the outputŷ i is expected to approximate the target y i . This can be achieved by minimizing the loss as (1)

Unimodal Refinement
Unimodal Refinement Module (URM) is designed to reinforce the modality-specific learning. High quality unimodal representations would benefit multimodal fusion. URM based on a transformer architecture takes a single modal feature as input to learn a robust and refined unimodal representation. As shown in Fig. 1, U m is corresponding to the URM for modality m. The U m is trained to learn the unimodal representation with a multilayer transformer-based network. Therefore, we have, where x U m represents the refined unimodal representation for modality m, while θ m stands for the parameters of U m . In fact, before inputting each unimodal feature to the URM, it is first sent to a projection layer to convert each feature to a specific dimension, which can simplify the subsequent unified operations. Then, the unimodal representation is passed to a multi-layer multi-head transformer. Five layers and three heads are used. Let H k , where k = {1, 2, 3}, denote the head in each layer and x dm×tm m be the projected feature. As presented in Fig. 1 (b), for x dm×tm m , d m is evenly split to d hk , which is the feature dimension in head k. The operations of the mutli-head transformer are described by the following equations, where F f c i (·), i ∈ {1, 2, 3}, is a fully connected layer, and LN (·) denotes a layer normalization.
x d hk ×tm m,hk is the input and H k (x d hk ×tm m,hk ) is the output from head k. ⊕ is the operation of concatenation over the output from each head. URM can learn the sequence representation by extracting the last time step in the time dimension as the key step, thus the time space collapses to one. Here, we have the overall definition:

Multimodal Jenson-Shannon Divergence Regularizer
Due to heterogeneity across divergent modalities, the fused multimodal representation follows an unknown yet complex distribution. In order to further enhance crossmodal refinement, we propose to regularize the distribution by explicitly adding a regularizer. It is well known that the Kullback-Leibler Divergence, D KL , can measure distribution differences. However, since the commutative consistency of a pair of modalities should be kept in our framework, the multimodal Jensen-Shannon divergence D M is employed, which is defined as: where (α, β) ∈ [(l, a), (a, v), (v, l)], p(α) and p(β) represent the probability distributions of the learned features for n classes: p(α) = {p 1 (α), p 2 (α), . . . , p n (α)} and p(β) = {p 1 (β), p 2 (β), . . . , p n (β)}. D M serves as a regularizer on x U m and aims to optimize the whole framework. To learn a common representation for fusion, the objective function L M regularizing the probability distributions of all modalities is defined as

Self-Quality Improvement Layers
Self-Quality Improvement Layers (SQIL) are added to further produce the desired unimodal representations for fusion. SQIL is a stack of simple self-attention layers that learns the weighted unimodal representations for fusion.
where F S represents the linear transformation and nonlinear activations (ReLU), and W S is the selfattention weight matrix. x U m is the output obtained from the URM as shown in Fig. 1 (a).

Crossmodal Refinement
Crossmodal Refinement Module (CRM) aims to learn effective crossmodal representations by integrating all refined unimodal representations. CRM (in Fig. 1 (a)) is also built based on a multi-head transformer, which takes the concatenation of the weighted unimodal representations as input. CRM becomes a bimodal or unimodal fusion module if one or two modalities are missing. CRM adaptively captures the dynamics of multimodal interactions and extracts key information among the inputs. It is a light yet effective fusion module. Specially, to ensure the proposed modal robust to noise and missing information in any modalities, flexible modality combinations are supported herein. Let C(x j ; θ c ) be the transformation of CRM with θ c being the parameters. x where j ∈ {l, a, v, (l, a), (a, v), (v, a), (l, a, v)} denotes any possible combinations from the three multimodal inputs. ⊕ denotes the concatenation operation and x S m are features obtained from Eq. (9). Note that x S m are from modalities in j. Then the fused representation can be represented as: Lastly, the fused feature x C j is passed to two fully connected layers F f c (·) before performing classification or regression with a loss function of L C . Here, either the cross-entropy loss L ce or the least absolute deviations L 1 is applied for diffident learning tasks. Specifically, where N is the number of classes, and y j,c is the ground-truth label for L ce , and M is the number data and z j,c is the target regression value for L 1 . The final UCRN objective function is where URM, SQIL, and CRM are jointly optimized, and λ m and λ c are two trade-off parameters for the two loss terms. They are set to 1 as default in our experiments. Extensive experiments in the following section demonstrate that UCRN not only improves the performance over multiple multimodal datasets but also is robust against modality missing and noise.

Datasets
CMU-MOSI (Zadeh et al., 2016) dataset is a multimodal opinion sentiment intensity analysis dataset, which consists of 2,199 short monologue video clips (opinion utterances). There are 35 facial action units that record facial muscle movement (Ekman et al., 1980;Ekman, 1992). Low-level acoustic features are extracted by CO-VAREP (Degottex et al., 2014). Language data are segmented by word and expressed as discrete word embedding (Pennington et al., 2014).    (Delbrouck et al., 2020) is preprocessed from original CMU-MOSEI dataset but using different feature extraction methods. For fair comparision with TBJE, the experimental results are also reported on this dataset.

TBJE-MOSEI
Metrics 2-class sentiment accuracy 2 , F1 score, MAE (mean square error, the lower the better) and Corr (Correlation) are used as performance indexes. 7-class sentiment accuracy and emotion classification results are also reported for comparing several strong benchmarks.

Implementation Details
Experimental Settings All the multi-head transformer-based architectures in the URM and CRM are implemented with 5 layers and 3 heads.
2 Please refer to the officially released evaluation metrics from https://github.com/A2Zadeh/CMU-MultimodalSDK In each transformer encoding layer in the URM, refined unimodal representation is trained by parsing query, key, and value the same input. In SQIL, the refined unimodal representation is first passed through a global average pooling on feature dimension, and then a fully connected layer to learn the correlation within features, resulting in a weighted unimodal representation. In CRM, the query, key and value of the transform inputs are the concatenated refined multimodal representation.
Training UCRN is trained in an end-to-end manner. It is light and easy to train. The batch size is set to 16 (32) and a basic learning rate is 1e-3 (2e-3) on MOSI (MOSEI).
Test The proposed model is tested on MOSI and MOSEI for sentiments and emotions. More details are reported in the supplementary materials for readers to reproduce.

Quantitative Results Compared with Benchmarks
Extensive experiments are conducted on several multimodal sentiment analysis and emotion prediction datasets. The methods compared in this work are the state-of-the-arts, among which MulT (Tsai et al., 2019), Mu-Net (Shenoy and Sardana, 2020), and TBJE (Delbrouck et al., 2020) are strong benchmarks. MulT (Tsai et al., 2019) used explicit source-target modality translation, Mu-Net (Shenoy and Sardana, 2020) adopted a pairwise attention mechanism for fusion, and TBJE (Delbrouck et al., 2020) implicitly used one modality (language) to modulate others. Table 1 lists both sentiment and emotion performance comparisons. Specifically, for sentiment analysis, 2-class and 7-class standard accuracies are both reported on CMU-MOSEI (marked by †) and TBJE-MOSEI (marked by *). UCRN shows competitive results on these two tasks. On the emo-   Table 4: Performance percentage drop comparison by masking a certain modality on CMU-MOSEI (marked by †) and TBJE-MOSEI (marked by *). WA indicates average weighted accuracy of emotion. 'L', 'A', and 'V' are abbreviations for language, audio, and visual modalities, respectively.

Method
Acc-2(%) Acc-7(%) Emot avg A(%) WA  tion classification task, for a fair comparison, the weighted accuracies (WA) are reported as Mu-Net and GRAPH-mfn adopted it. The unweighted results are compared with TBJE. UCRN shows the best average accuracy for emotion classification over all compared methods. Further comparisons on 2-class sentiment analysis are presented in Table 2 and Table 3. The compared methods include LSTM-based, translationbased, and pairwise-learning based. On both CMU-MOSI and CMU-MOSEI datasets, UCRN shows improvement over the compared methods and outperforms the strong benchmark methods in terms of average F1 score and accuracy.
The results indicate that the refined unimodal and crossmodal representations are of vital importance for multimodal fusion. UCRN shows a competitive performance owing to the ample exploration of unimodal dynamics from URM and the effective crossmodal representation from CRM. UCRN shows advantages to adaptively take any combination of input modalities. More results are presented in the following subsection.

Robustness Experiments
Missing modality and noise are ever-present in the real world. Due to non-alignment, missing, or incomplete modalities, the information expressed by unimodal features is disproportionate. Therefore, translation-based methods are inclined to become invalid in those cases. However, UCRN can alleviate these problems. Defining robustness as the percentage decrease in accuracy, i.e., (trimodal accuracy -masked or noisy modality accuracy) / trimodal accuracy, it allows us to objectively evaluate the robustness of UCRN. Two kinds of experiments were conducted to validate the robustness of UCRN against modality missing and noise with several strong translation-based and pairwise mapping based benchmarks.
Firstly, to simulate missing modality, features of a modality will be masked. Under such circumstances, UCRN still achieves a higher accuracy than its counterparts, which demonstrates its robustness. As shown in Table 4, masking one of the vision, audio, or language modality, UCRN gets more robust results on the average performance. Assuming that more modalities have a greater representation capability, the case of performance degradation given added modality should not be taken into account in the robustness aspect. Note that TBJE shows a degradation with trimodal input compared to its bimodal one (i.e., 2-class accuracy L+A+V of 81.5% and L+A of 82.4% according to (Delbrouck et al., 2020)). This is because TBJE cannot deal with the disparity among modalities or fully explore vision features, which results in the overall representation being impaired.
In spite of that, UCRN still outperforms TBJE in terms of accuracy (i.e., 2-class accuracy L+A+V of 84.36% and L+A of 83.35%). Therefore, we only compare the performance drops especially on the cases of L+V and A+V.
Secondly, to simulate the presence of noise during information acquisition in the real world, noise that follows a Bernoulli distribution is randomly added on the entire modality features with a probability of noise presence 0.5. Results in Table 5 Figure 2: Visualization for distributions of multimodal features in embedding space. Please zoom in for a better view.
show that UCRN is more robust for both sentiment prediction and emotion classification tasks.
UCRN is robust against missing modality and noise because it explores the refined unimodal representation and correlates the crossmodal features adaptively.
We argue that the translation-based and implicit modulation methods have a limitation on robustness due to pairwise interactions and source-target translation.

Ablation Study
UCRN is powerful to reduce modality gap. To show this, the visualization for distributions of multimodal features in embedding space is provided in Fig. 2. The t-SNE algorithm was utilized to transform feature vectors to the 2D maps. Fig. 2 (a) shows the feature embeddings right before 2-class sentiment prediction without SQIL and without L M , whereas Fig. 2 (b) is obtained from UCRN with those modules. The feature embeddings in Fig. 2 (b) become more clustered and separable. Fig. 2 (c) and Fig. 2 (d) show the distributions of unimodal features before crossmodal fusion with and without L M , respectively. Comparing with Fig. 2 (c), the distributions of features in Fig. 2 (d) are more regularized with a closer center distance (as can be seen from the red center points). They reveal that L M is beneficial to perform crossmodal refinement by reducing modality gap for better predictions.
The proposed UCRN emphasizes the importance of unimodal and crossmodal refinements. The contributions of different components are summarized in Table 6. We have the following observations: 1) URM greatly boosts the performance; 2) adding SQIL yields a better performance; 3) UCRN gains large improvement by adding the Multimodal JS divergence (MJS) and 4) CRM further adds values on top. The results substantiated all our assumptions that unimodal refinement has significant contributions and advantages to fusion and crossmodal refinement is efficacious in exploring modalitycommon information and reducing modality gap.

Size of Network Parameters
As listed in Table 7, UCRN is light-weight and can achieve competitive performance with much fewer parameters comparing with the several benchmark methods.

Conclusion
In this work, the Unimodal and Crossmodal Refinement Network (UCRN) is proposed for robust and efficient multimodal representation learning. We hypothesis that unimodal representation is better to be refined before crossmodal fusion, and it is beneficial to reduce modality gaps before crossmodal refinement. Following the line, the proposed network is designed with a unimodal refinement module, a multimodal JS divergence regularizer, self-quality improvement layers, and a crossmodal refinement module. The experimental results validated all our assumptions. In particular, the robustness experiments evinced high efficiency and validated that UCRN can handle the modality missing and noise issues. Experimental results showed UCRN achieves state-of-the-art results on multiple multimodal datasets.