CTFN: Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fusion Network

Multimodal sentiment analysis is the challenging research area that attends to the fusion of multiple heterogeneous modalities. The main challenge is the occurrence of some missing modalities during the multimodal fusion procedure. However, the existing techniques require all modalities as input, thus are sensitive to missing modalities at predicting time. In this work, the coupled-translation fusion network (CTFN) is firstly proposed to model bi-direction interplay via couple learning, ensuring the robustness in respect to missing modalities. Specifically, the cyclic consistency constraint is presented to improve the translation performance, allowing us directly to discard decoder and only embraces encoder of Transformer. This could contribute to a much lighter model. Due to the couple learning, CTFN is able to conduct bi-direction cross-modality intercorrelation parallelly. Based on CTFN, a hierarchical architecture is further established to exploit multiple bi-direction translations, leading to double multimodal fusing embeddings compared with traditional translation methods. Moreover, the convolution block is utilized to further highlight explicit interactions among those translations. For evaluation, CTFN was verified on two multimodal benchmarks with extensive ablation studies. The experiments demonstrate that the proposed framework achieves state-of-the-art or often competitive performance. Additionally, CTFN still maintains robustness when considering missing modality.


Introduction
Sentiment analysis has witnessed many significant advances in the artificial intelligence community, in which text (Yadollahi et al., 2017), visual (Kahou et al., 2016), and acoustic (Luo et al., 2019) modalities are primarily employed to the related research * * Equal contribution † † Corresponding author: Wanzeng Kong respectively, allowing to exploit the human emotional characteristic and intention effectively (Deng et al., 2018). Intuitively, due to the consistency and complementarity among different sources, the joint representation attend to reason about multimodal messages, which are capable of boosting the performance of the specific task (Pan et al., 2016;Gebru et al., 2017;Al Hanai et al., 2018). Multimodal fusion procedure is to incorporate multiple knowledge for predicting a precise and proper outcome (Baltrušaitis et al., 2018). Historically, the existing fusion has been done generally by leveraging the model-agnostic process, considering the early fusion, late fusion, and hybrid fusion technique (Poria et al., 2017a). Among those, early fusion focussed on the concatenation of the unimodal presentation (D'mello and Kory, 2015). On the contrast, late fusion performs the integration at the decision level, by voting among all the model results (Shutova et al., 2016). As to the hybrid fusion, the output comes from the combination of the early fusion and unimodal prediction (Lan et al., 2014). Nevertheless, multimodal sentiment sequences often consists of unaligned properties, and the traditional fusion manners are failed to take the heterogeneity and misalignment into account carefully, which raises a question on investigating the more sophisticated models and estimating emotional information. (Tsai et al., 2020;Niu et al., 2017).
Recently, Transformer-based multimodal fusion framework has been developed to address the above issues with the help of multi-head attention mechanism (Rahman et al., 2020;Le et al., 2019;Tsai et al., 2019). By introducing the standard Transformer network (Vaswani et al., 2017) as the basis, Tsai et al. (Tsai et al., 2019) captured the integrations directly from unaligned multimodal streams in an end-to-end fashion, latently adapted streams from one modality to another with the cross-modal (c) TransModality model. Figure 1: Comparison of CTFN with existing translation-based models. In our model, the cyclic consistency constraint is presented to improve the translation performance, allowing us directly to discard decoder and only embrace encoder of Transformer. This could contribute to a much lighter model. Due to the couple learning, CTFN is able to conduct bi-direction cross-modality intercorrelation parallelly, ensuring the robustness in respect to missing modalities.
attention module, regardless of the need for alignment. Furthermore, Wang et al. (Wang et al., 2020) proposed a parallel Transformer unit, allowing to explore the correlation between multimodal knowledge effectively. However, the decoder component of standard Transformer is employed to improve the translation performance, which may lead to some redundancy. Moreover, the explicit interaction among cross-modality translations were not considered. Essentially, compared to our CTFN, their architecture require access to all modalities as inputs for exploring multimodal interplay with the sequential fusion strategy, thus are rather sensitive in the case of multiple missing modalities.
In this paper, CTFN is proposed to model bidirectional interplay based on coupled learning, ensuring the robustness in respect to missing modalities. Specifically, the cyclic consistency constraint is proposed to improve the translation performance, allowing us directly to discard decoder and only embrace encoder of Transformer. This could contribute to a much lighter model. Thanks to the couple learning, CTFN is able to conduct bi-direction cross-modality intercorrelation parallelly. Take CTFN as a basis, a hierarchical architecture is established to exploit modality-guidance translation. Then, the convolution fusion block is presented to further explore the explicit correlation among the above translations. Importantly, based on the parallel fusion strategy, our CTFN model still provides flexibility and robustness when considering only one input modality.
For evaluation, CTFN was verified on two multimodal sentiment benchmarks, CMU-MOSI (Zadeh et al., 2016) and MELD (Poria et al., 2019). The experiments demonstrate that CTFN could achieve the state-of-the-art or even better performance compared to the baseline models. We also provide several extended ablation studies, to investigate intrinsic properties of the proposed model.

Related Work
The off-the-shelf multimodal sentiment fusion architecture comprises two leading groups: translation-based and non-translation based model.
Non-translation based: Recently, RNN-based models, considering GRU and LSTM, have received significant advances in exploiting the context-aware information across the data (Yang et al., 2016;Agarwal et al., 2019). bc − LST M (Poria et al., 2017b) and GM E − LST M (Chung et al., 2014) presented a LSTM-based model to retrieve contextual information, where the unimodal features are concatenated into a unit one as the input information. Similarly, MELD − base (Poria et al., 2019) leveraged the concatenation of audio and textual features on the input layer, and employed GRU to model sentimental context. In contrast, CHF usion (Majumder et al., 2018) employed the RNN-based hierarchical structure to draw fine-grained local correlations among the modalities, and the empirical evidence illustrates superior advances compared to the simple concatenation of unimodal presentation. On the basis of RNN, MMMU −BA (Ghosal et al., 2018) further employed multimodal attention block to absorb the contribution of all the neighboring utterances, which demonstrates that the attention mechanism  Figure 2: CTFN: X a and X v refer to the features of modality audio and video respectively. The blue line indicates the primal process, and the yellow line indicates the dual procedure. Note that the cyclic consistency constraint is presented to improve the translation performance, allowing us directly to discard decoder and only embrace encoder of Transformer. And thanks to couple learning, CTFN could combine primal and dual process into a coupled structure, ensuring the robustness in respect to missing modalities.
can utilize the neighborhood contribution for integrating the contextual information. However, all these methods are suitable for the low-level presentation within the single modality with a nontranslation manner, which may be easily sensitive to the noisy terms and missing information in the sources.
Translation-based model: Inspired by the recent success of sequence to sequence (Seq2Seq) models (Lin et al., 2019;?) in machine translation, (Pham et al., 2019) and (Pham et al., 2018) presented multimodal fusion model via the essential insight that translates from a source modality to a target modality, which is able to capture much more robust associations across multiple modalities. MCT N model incorporated a cyclic translation module to retrieve the robust joint representation between modalities in a sequential manner, e.g., the language information firstly associated with the visual modality, and latently translated into the acoustic modality. Compared with the MCT N, Seq2Seq2Sent introduced a hierarchical fusion model using the Seq2Seq methods. For the first layer, the joint representation of a modality pair is treated as an input sequence for the next Seq2Seq layer in an attempt to decode the third modality. Inspired by the success of the Transformer-based model, Tsai et al. introduced a directional crossmodality attention module to extend the standard Transformer network. Follow the basic idea of Tsai et al., Wang et al. provided a novel multimodal fusion cell which is comprised of two standard Transformers, embracing the association with a modality pair during the forward and backward translation implicitly. However, all existing models adopt sequential multimodal fusion architecture, which requires all modalities as input, therefore they can be sensitive to the case of multiple missing modalities. Moreover, the explicit interactions among cross-modality translations were not considered.

Methodology
In this section, we firstly present CTFN (Figure 2), which is capable of exploring bi-direction crossmodality translation via couple learning. On the basis of CTFN, a hierarchical architecture is established to exploit multiple bi-direction translations, leading to double multimodal fusing embeddings ( Figure 4). Then, the convolutional fusion block ( Figure 3) is applied to further highlight explicit correlation among cross-modality translations.

Preliminaries
The two benchmarks consist of three modalities, audio, video and textual modality. Specifically, the above utterance-level modalities are denoted as X a ∈ R Ta×da , X v ∈ R Tv×dv and X t ∈ R Tt×dt , respectively. The number of utterances is presented as T i (i ∈ {a, v, t}), and d i (i ∈ {a, v, t}) stands for the dimension of the unimodality features.

Coupled-Translation Fusion Network
For simplicity, we consider two unimodality presentation X a and X v explored from audio (A) and video (V), respectively. In the primal process of CTFN, we focus on learning a directional translator T ran A→V (X a , X v ) for translating the modality audio to video. Then, the dual process aims to learn an inverse directional translator T ran V →A (X v , X a ), allowing for the translation from modality video to audio. Inspired by the suc-  cess of Transformer in Natural Language Processing, the encoder of Transformer is introduced to our model as the translation block, which is an efficient and adaptive manner for retrieving the long-range interplay along the temporal domain. Importantly, the cyclic consistency constraint is presented to improve the translation performance. And due to the couple learning, CTFN is able to combine primal and dual process into a coupled structure, ensuring the robustness in respect to missing modalities.
For the primal task, X a ∈ R Ta×da is firstly delivered to a densely connected layer for receiving a linear transformation X a ∈ R Ta×La , where L a is the output dimension of the linear layer. And the corresponding query matrix, key matrix and value matrix are denoted as Q a = X a W Qa ∈ R Ta×La , , refers to the fake X v , and √ L a is the scale coefficient. Note that the input X a is directly delivered to the translation process, while the input X v is used to analyze the difference between real data X v and fake output , and the X a is only used to calculate the diversity between the real and reconstruct data.
Analogously, in the dual process, Essentially, T ran A→V and T ran V →A are implemented by several sequential encoder layers. During the translation period, we hypothesize that intermediate encoder layer contains the cross-modality fusion information and effectively balance the contribution of two modalities. Hence, the output of the middle encoder layer T ran A→V [L/2] and T ran V →A stand for the multimodal fusion knowledge, where L refers to the number of layers, and when L is odd number, then L = L + 1. As for the model reward, the primal process has an immediate reward r p = X a − T ran V →A (X v , ) F , and the dual step related reward is r d = X v − T ran A→V (X a , ) F , indicating the similarity between the real data and the reconstructed output of the translator. For simplicity, a linear transformation module is adopted to combine the primal and dual step reward into a total model reward, e.g., r all = αr p + (1 − α)r d , where α is employed to balance the contribution between dual and primal block. Additionally, the loss functions utilized in the coupled-translation multimodal fusion block are defined as follows: where l A→V (X a , X v ) and l V →A (X v , X a ) refer to the training loss of the primal and dual translator respectively, and l A↔V stands for the loss of bi-directional translator unit. Essentially, when the training process of all coupled-translation blocks  Figure 4: The hierarchical framework associated with three CTFNs during the training period. Each CTFN is utilized to explore the specific bi-direction cross-modality interplay. On the basis of this, three CTFN are stacked into a united one for exploiting multiple bi-direction translations, leading to double multimodal fusing embeddings. Then, multiple multimodal fusing embeddings are delivered to the multimodal convolutional fusion block.
are finished, our model only needs one input modality at predicting time, without the help of target modalities. Indeed, l A↔V indicates the cycle-consistency constraint in our couple learning model. The cycleconsistency is well-known, which refers to combination of forward and backward cycle-consistency. However, our goal is to solve missing modality problem in multi-modal learning, which cannot be achieved by applying cycle-consistency straightforward. This is because that introducing this strict cycle-consistency to CTFN fail to effectively associate primal task with dual task of the couple learning model. To solve this problem, we relaxed constraint of original cycle-consistency by using a parameter 'α' to balance the contribution of forward and backward cycle-consistency, leading to a much more flexible cycle-consistency. Thanks to the great flexibility of new proposed cycle-consistency, we could adaptively and adequately associate primal with dual task, resulting in much more balanced consistency among modalities.

Multimodal convolutional fusion block
Based on CTFN, each modality is treated as the source moment for (M − 1) times, which means that each modality holds (M − 1) directional translations, {T ran modality source→modality m } M m=1 , where M refers to the total number of modalities. For instance, given modality audio, we can retrieve the following two modality-guidance translations: [T ran a→v L/2 , video , ] = T ran a→v (audio, video) [T ran a→t L/2 , text , ] = T ran a→t (audio, text).
(3) Note that audio plays a key role in different crossmodality translations, and provides the strong guid-ance for capturing various cross-modality interplay. For blending the contribution of source modality (audio) effectively, a convolution fusion block is incorporated to explore explicit and local correlation among modality-guidance translations.
Initially, the two cross-modality intermediate correlations T ran audio→vedio L/2 and T ran audio→text L/2 are concatenated along the temporal domain into a unit representation, where the size of time sequence is equal (T a = T v = T t ), thus the concatenation is of size T a × (L v + L t ): Subsequently, the temporal convolution is employed to further retrieve explicit interactions among cross-modality translations. Specifically, we adopt a 1D temporal convolutional layer to exploit the local patten in a light manner: where K concat is the size of the convolutional kernel, and L d is the length of the cross-modality integration dimension. The temporal kernel is used to perform the convolutional operation along the feature dimension, allowing to further exploit local interplay among cross-modality translations. That is to say, the local interplay fully exploits the contribution from modality-guidance translations.

Hierarchical Architecture
On the basis of CTFN and convolutional multimodal fusion network, a hierarchical architecture was proposed for exploiting multiple bidirection translations, leading to double multimodal fusing embeddings. For instance, given M modalities, our model could achieve double Audio Input Modality Figure 5: We only employ a single input modality (audio) to do the multimodal fusion task during the predicting period. Initially, audio presentation X a is sent to the pre-trained translators T ran A→V and T ran A→T for retrieving X v , and X t , . Then, X v , is transmitted to T ran V →A and T ran V →T respectively. And, X t , is sent to T ran T →V and T ran T →A respectively. Hence, the tree structure only need one input modality to do the multimodal fusion task. V →T ], respectively. Subsequently, the convolutional fusion layer is used to further exploit explicit local interplay among modality-guidance translations associated with the same source/target modality, which can fully leverage the contribution of source/target modality.

Multimodal Convolution Fusion
Essentially, as demonstrated in Figure 4, our model has "12+1" loss constraints in total, which includes 3 CTFNs, each one has 4 training loss (primal & dual translator training loss), and 1 classification loss. However, we do not need to balance these targets together, which is achieved by our training strategy that 3 CTFNs are trained individually. For each CTFN, one hyper-parameter 'α' is introduced to balance the loss of primal translator and dual translator, and this hyper-parameter is shared among 3 CTFNs. Hence, 3 CTFNs only need 1 hyper-parameter to balance the training loss, which is easy to be tuned. The classification loss is used for training the classifier on the 3 CTFNs's outputs.

Experimental setups
Datasets. CMU-MOSI consists of 2199 opinion video clips from online sharing websites (e.g., YouTube). Each utterance of the video clip is annotated with a specific sentimental label of positive or negative in the range scale of [−3, +3]. The corresponding training, validation, and testing size refer to division set (1284,229,686). Additionally, the same speaker will not appear in both training and testing sets, allowing to exploit speakerindependent joint representations. MELD dataset contains 13000 utterances from the famous TVseries F riends. Each utterance is annotated with emotion and sentiment labels, considering 7 classes of emotion tag (anger, disgust, fear, joy, neutral, sadness, and surprise) and 3 sentimental tendency levels (positive, neutral, and negative). Hence, the original dataset can be denoted as MELD (Sentiment) and MELD (Emotion) with respect to the data annotation, we only verified our model on the MELD (Sentiment). Note that CMU-MOSI and MELD are the public and widely-used datasets which have been aligned and segmented already.
Features. For CMU-MOSI dataset, we adopt the same preprocess manner mentioned in MFN (Zadeh et al., 2018) to extract the low-level representation of multimodal data, and synchronized at the utterance level that in consistent with text modality. For MELD benchmark, we follow the related work of MELD, in which the 300dimensional GloVe (Pennington et al., 2014) text vectors are fed into a 1D-CNN (Chen et al., 2017) layer to extract textual representation, and audiobased descriptors are explored with the popular toolkit openSMILE (Eyben et al., 2010), while visual features were not taken into account for the sentiment analysis.
Comparisons. We introduced the translationbased and non-translation based models to this work as the baselines. Translation-based: Multimodal Cyclic Translation Network (MCTN), Sequence to Sequence for Sentiment (Seq2Seq2Sent)

Experiment results and analysis
Performance comparison with state-of-the-art models. Firstly, we analyzed the performance between state-of-the-art baselines and our proposed model. The bottom rows in Table 1 indicate the effectiveness and superiority of our model. Particularly, on CMU-MOSI dataset, CTFN exceeded the previous best TransModality on (video, audio) by a margin of 4.51. Additionally, on MELD (Sentiment) dataset, the empirical improvement of CTFN was 0.78. It is interesting to note that the improvement of (video, audio) is more significant than (text, video) and (text, audio). This implies that coupled-translation structure is capable of decreasing the risk of interference between video and audio efficiently, and further leverage the explicit con-sistency between auxiliary features. As for (text, audio, video), CTFN exceeds the previous best TransModality with an improvement of 0.06, leading to a comparable performance. Indeed, for the same tri-modality fusion task, TransModality needs 4 encoders and 4 decoders, while CTFN only requires 6 encoders. It should be emphasized that the cyclic consistency mechanism could contribute to a much lighter model, as well as the more effective bi-directional translation. In addition, compared to the bi-modality setting, the tri-modality case achieved the improvement of 0.61, indicating the benefits brought by hierarchical architecture and convolution fusion.   dio, video, text) refers to the process that CTFN only employs a single input modality (audio) to do the multimodal fusion task, shown in Figure 5.
Effect of CTFN with missing modalities. Existing translation-based manners focus only on the join representation between modalities, and ignore the potential occurrence of the missing modalities. Therefore, we analyzed how does missing modality may affect the final performance of CTFN and the sequential translation-based model SeqSeq2Sent. Note that SeqSeq2Sent only employs LSTM to analyze uni-modality rather than the translationbased method. Specifically, we take the hierarchical architecture combined with three CTFNs as the testing model. From the Table 2, we observe that compared to the setting (text, audio, video), the text-based settings { (audio, video, text), (audio, video, text), (audio,video, text)} seem to reach the comparable result with only a relatively small performance drop. On the contrast, when text was missing, the model has a relatively large performance drop, which implies that language modality contains much more discriminative sentimental message than audio and video, leading to the significantly better performance. Essentially, the performance of (audio,video, text) demonstrates that hierarchical CTFN is able to maintain robustness and consistency when considering only a single input modality. In other words, the cyclic consistency mechanism allows CTFN to fully exploit the crossmodality interplay, thus hierarchical CTFN could transmit the single modality to various pre-trained CTFNs for retrieving multimodal fusion message. Primal task Dual task  Figure 6: Effect of the translation direction.
Effect of the translation direction. In this paper, we propose a coupled-translation block, which aims to embrace fusion messages from the bidirectional translation process. Hence, we are interested to investigate the impact of translation direction. Figure 6 depicts the performance of various translations, considering (audio, text), (audio, video), and (text, video) translation. For the (audio, text) instance, the translation text→audio achieves better performance than audio→text . Similarly, the translation text→video surpasses the result of video→text. However, the performance of audio→video and video→audio seems to be quite similar. The superiority of text→video and text→audio may demonstrate that text modal-ity possesses much more sentimental information. Moreover, the prospects of text modality allow text to be the strong backbone of the translation.  Effect of the translator layer. As each translator is comprised of several sequential encoder layers. In this part, we assume that the output representation of a specific layer may affect the performance of the proposed model. For simplicity, we perform the related task on CMU-MOSI with the setting of (a, v, t), as well as the (t, a) on MELD (Sentiment). Initially, we retrieve the embedding from the specific layer, where the layer ranges from 1 to L (L is the total number of the layer). In Figure  7, it is interesting to note that the model reaches the peak value at layer 5 on CMU-MOSI, which means that the output of the fifth layer embraces the most discriminative fusion message. In comparison, on MELD (Sentiment), the model achieves the best performance at layer 1, which may imply that the simple translator associated with only one layer is able to capture the joint representation for the simple case (text, audio). In conclusion, the lower encoder layer may involve low-level characteristics of interplay, while the higher encoder layer may embrace the explicit messages. Additionally, the output of the specific layer of the encoder lies on the corresponding task and dataset. We tried also (text, audio) on MOSI, and CTFN maximizes the performance at layer 3. Compared to (text, audio, video), (text, audio) is the relatively simple case, thus the lower encoder layer may is sufficient to demonstrate the interaction between text and audio.
Effect of concatenation strategy of translation. In our work, those translations associated Source-based Target-based MOSI dataset Source-based Target-based with the same guidance (source) modality are concatenated along the feature domain. As each modality serves as the source and target modality in turn, we are interested to analyze the impact of the distinct concatenation strategies, e.g., concatenate the translations via the same source or target modality. As shown in Figure 8, it is obvious to find that audio-based target concatenation [(T→A) ⊕ (V→A)] performs significantly better than [(A→T)⊕(A→V)] with a large margin. Analogously, video-based target concatenation [(T→V)⊕(A→V)] works better than [(V→A)⊕(V→T)]. The above performance may indicate that joint presentation is able to achieve the significantly improved benefits with the help of guidance modality text. In conclusion, when text modality serves as the guidance modality, which may effectively leverage the contribution from audio and video, and further boost the task performance in a robust and consistent way.

Conclusion
In this paper, we present a novel hierarchical multimodal fusion architecture using coupled-translation fusion network (CTFN). Initially, CTFN is utilized for exploiting bi-directional interplay via couple learning, ensuring the robustness in respect to missing modalities. Specifically, the cyclic mechanism directly discards the decoder and only embraces the encoder of Transformer, which could contribute to a much lighter model. Due to the couple learning, CTFN is able to conduct bi-direction cross-modality intercorrelation parallelly. Based on CTFN, a hierarchical architecture is further established to exploit multiple bi-direction translations, leading to double multimodal fusing embeddings compared with traditional translation methods. Additionally, a multimodal convolutional fusion block is employed to further explore the complementarity and consistency between crossmodality translations. Essentially, the parallel fusion strategy allows the model maintains robustness and flexibility when considering only one input modality. CTFN was verified on two public multimodal sentiment benchmarks, the experiments demonstrate the effectiveness and flexibility of CTFN, and CTFN achieves state-of-theart or comparable performance on CMU-MOSI and MELD (Sentiment). For future work, we like to evaluate CTFN on more multimodal fusion tasks. The source code can be obtained from https://github.com/deepsuperviser/CTFN.