Multimodal Phased Transformer for Sentiment Analysis

Multimodal Transformers achieve superior performance in multimodal learning tasks. However, the quadratic complexity of the self-attention mechanism in Transformers limits their deployment in low-resource devices and makes their inference and training computationally expensive. We propose multimodal Sparse Phased Transformer (SPT) to alleviate the problem of self-attention complexity and memory footprint. SPT uses a sampling function to generate a sparse attention matrix and compress a long sequence to a shorter sequence of hidden states. SPT concurrently captures interactions between the hidden states of different modalities at every layer. To further improve the efficiency of our method, we use Layer-wise parameter sharing and Factorized Co-Attention that share parameters between Cross Attention Blocks, with minimal impact on task performance. We evaluate our model with three sentiment analysis datasets and achieve comparable or superior performance compared with the existing methods, with a 90% reduction in the number of parameters. We conclude that (SPT) along with parameter sharing can capture multimodal interactions with reduced model size and improved sample efficiency.


Introduction
The objective of multimodal sentiment analysis is to identify the polarity of one's attitude toward an entity through multimodal inputs such as audio, video, and text. For many applications such as personal assistants, social robots and virtual agents, the efficiency and scalability of a method are as important as accuracy. Such applications can have limited computational resources or large-scale deployment requirements. Multimodal understanding of constructs, such as sentiment, requires capturing information available in every modality in addi- * equal contribution tion to their potential interactions, e.g., an exaggerated smile combined with negative sentiment in language might signal irony. Modeling these interactions efficiently is still an open challenge (Baltrušaitis et al., 2018). Some work on this topic use the different networks for each modality followed by fusion methods, like concatenation (Tsai et al., 2019;Hazarika et al., 2020;Pan et al., 2020) and outer-product (Zadeh et al., 2017), to model the interaction of multimodal representations which largely increases the dimensionality of representations thus increasing computational cost. Rahman et al. (2020) rely on large pre-trained models, like BERT (Devlin et al., 2018) and XLNet . The computational cost of such approaches is high due to the over-parameterization of the models. And Transformer-based methods (Tsai et al., 2019;Rahman et al., 2020) suffer from quadratic complexity of self-attention.
The existing multimodal sentiment analysis datasets are rather small due to the laborious labeling process. The development of the existing datasets, such as (Zadeh et al., 2016;, involves data curation and annotation by multiple annotators. The limited dataset size raises the risk of over-fitting for over-parameterized models which motivated building models that can be trained with fewer data. Recent work, such as (Child et al., 2019;Beltagy et al., 2020;Zaheer et al., 2020), improve the efficiency of Transformers through Sparse-Attention. Compared to the full attention that calculates attention for all pairs of input elements, sparse attention only computes attention for a subset of element pairs. As a consequence, each element from one sequence attends only to a limited number of elements in the source sequence. Other work reduce attention matrix size by iterative processing only a shorter segment of the original long sequence at a time Rae et al., 2020).
In this paper, we propose a Sparse Phased atten-tion (SP) mechanism that uses a sampling function to compress a longer input sequence to a shorted sequence of hidden states and improve the efficiency of the attention computation. Multimodal SPT can capture multimodal interactions through a "Concurrent" network structure rather than a "Serial" structure of previous work (Tsai et al., 2019;Rahman et al., 2020;Huang et al., 2020;Pan et al., 2020). We improve the efficiency of SPT through parameter sharing and a Factorized Co-Attention.
We perform extensive experiments evaluating the performance of SPT on multimodal sentiment analysis and an ablation study on its structure, sampling function, parameter sharing approaches, and SP-Block. We compare our method with the stateof-the-art efficient Transformers, i.e., Performer (Choromanski et al., 2020). Our experimental results show that our model is able to achieve minimal or no performance loss with a significant reduction in model size. Other efficient Transformerbased approaches with linear efficiency result in a larger degradation of the performance. In comparison with the previous work on multimodal sentiment analysis, we reduce memory use in addition to training and inference time, with complexity decreasing from quadratic to linear, with only 10% of the parameters. The main contributions of this work are as follows.
• We introduce and evaluate a SP-Block that uses a sampling function and a short sequence of hidden states to attend to and compress a longer sequence. SP-Attention creates a sparse attention matrix that improves both computational and sampling efficiency.
• We propose Multimodal SP-Transformer that uses a concurrent structure of blocks in each sub-layer to allow multimodal signal to interact within every layer. SPT uses Input Attention on the source input sequence, Cross-Attention on the hidden state pairs of different modalities and Self-Attention on the hidden states of each modality.
• We leverage Factorized Co-Attention that use a factorized form of the attention computation based on an affinity matrix to further reduce the number of parameters for the cross attention block (Co-SP). And we further the improve efficiency of SPT by parameter sharing across all layers.
The code and data are publicly available in https://github.com/chengjunyan1/ SP-Transformer.
2 Related work Sentiment analysis Multimodal sentiment analysis involves leveraging the information from multiple modalities, e.g., text and vision, to recognize the polarity of expressed sentiment. Most of the existing work focuses on recognizing sentiment expressed in video recordings from social media reviewing products or movies (Zadeh et al., 2016;Kossaifi et al., 2019;Wöllmer et al., 2013).
Recent work on multimodal sentiment analysis has focused on the application of Transformer architectures. Tsai et al. (2019) introduces pairwise cross-modal attention on Transformers for multimodal sentiment analysis on audio, video, and text. Our model's architecture follows a similar design that first encodes unimodal inputs, then models cross-modal interactions and finally fuses multimodal information. The main difference is that our model enables a concurrent way to implements those steps. Hazarika et al. (2020) project multimodal input into modality-invariant and modalityspecific spaces, and use a Transformer encoder on the concatenated projected representations. Rahman et al. (2020) use pre-trained Transformers like BERT (Devlin et al., 2018), XLNET ) on a large corpus and perform transferlearning for multimodal sentiment analysis.
Multimodal learning Previous work use recurrent neural networks (RNN) or convolutional neural networks (CNN) on each modality and perform model-based, e.g., kernel-based fusion, with graphical models, and neural networks, and modelagnostic fusion, e.g., early, late or hybrid (Baltrušaitis et al., 2018). Wang et al. (2019) fuse multimodal representations with a Gated Modalitymixing Network, that model the fine-grained structure of non-verbal subword sequences and a Multimodal Shifting mechanism, that dynamically shifts word representations based on non-verbal cues. Pham et al. (2019) trains a sequence-to-sequence RNN to jointly perform inter-modality translations and sentiment analysis, the encoder output is a joint multimodal representation that is used for sentiment detection. Tensor fusion networks (TFN) use the outer product of representations for each modality concatenated with a constant value of "1" to generate a joint representation (Zadeh et al., 2017). Liu et al. (2018) propose to decompose the weight from the fusion layers into low-rank factors to reduce the large number of parameters in TFN.  use a system of LSTMs to learn modal-specific interactions, learn the crossmodal interactions with an attention mechanism and finally apply a Multi-view Gated Memory that fuses the multimodal representations through time. Alayrac et al. (2020) project multimodal features to fine-granularity and coarse-granularity "spaces" through Multi-Layer-Perceptron (MLP) Networks. Audio and video are aligned with text features in coarse-grained space, while audio and video features are aligned in fine-grained space. Unlike the previous methods that directly apply transformations on the multimodal inputs, we use a small sequence of hidden states to capture the features from multimodal inputs thus preserving raw input information while improving efficiency.
In the most similar work to ours, Jaegle et al. (2021) distill information from an input sequence to a fixed length of hidden states with an Autoregressive Transformer.

Efficient Transformers
One of the drawbacks of the Transformer architecture is the computational cost and the memory footprint of the selfattention mechanism. A number of efforts have been made to make Transformers more efficient (Child et al., 2019;Beltagy et al., 2020;Kitaev et al., 2020;Zhou et al., 2021;Zaheer et al., 2020). Such work use sparse attention to selectively attend to pairs of elements and lead to a reduction in memory use and computational complexity of the attention mechanism. A notable example, Performer, (Choromanski et al., 2020), improve the efficiency of the attention computation through "unbiased" and low-rank approximation of the attention matrix.
There are other attempts for reducing the sequence length to improve computational efficiency.  introduce "segment-level recurrence" that recurrently use the previous segment and current segment. Rae et al. (2020) extend this idea and compresses multiple previous segments into memory vectors. The hidden states in our method are based on a similar idea that caching the information from the input sequence in a shorter sequence can improve efficiency. We also applied the idea of sparse attention to achieve further improvements.

Method
The proposed method is an extension of the Transformer architecture for improved efficiency in multimodal learning. In this section, we will introduce the basic building block of our model, i.e., Sparse-Phased-Block (SP), and show how we extend SP in the context of multimodal learning to define SP-Transformer with Input Attention, Cross Attention and Self Attention sub-layers. We also leverage Factorized Co-Attention and parameter sharing for further optimizations of SP-Transformer.
Each SP-Block uses a sampling function to generate sparse attention patterns that guide a sequence to selectively compress information from elements of another sequence as shown in Figure 2. We use SP-Block to build our Multimodal SP-Transformer, in each layer, we first use an SP-Block that executes Input attention for each modality and compress information from each unimodal input sequence to the hidden states. The hidden state sequences for each pair of two modalities interact through Cross Attention by a Co-SP-Block using Factorized Co-Attention. Finally, the cross-modal information for each modality is fused by summation and distilled with Self Attention using an SP-Block. The full model is presented in Figure 3. Sparse Phased Block (SP-Block) accept a sequence X and a sequence of hidden state h as input and use sampling function ψ to compress X to h by selective attention between the two. ψ create Figure 2: SP-Block samples from a sequence X with hidden state h and with a fixed sampling function with sampling length r = 2. The hidden states of h attend at most to five input states of X. In a mixed sampling, the sampling interval for each hidden state would shift with a distance that depends on the layer (sliding), a periodic function with its index as parameter, and random perturbation, which makes it sample in a "dynamic" way as opposed to a static (fixed) sampling. a sparse attention mask for which h is used as the Query vector and X is Key and Value.

Sparse Phased Block
H is number of heads, d model is the dimension of the model and the query sequence h, d X is the dimension of the key and the value sequence X. F F N (x) = W 2 ReLU (W 1 x+b 1 )+b 2 +x is Feed-Forward Network (FFN) (Vaswani et al., 2017) where bias terms and include a residual connection to x.
The SP-Block is illustrated in Figure 2. For each block we apply layer normalization prior to FFN and Attention function. Sampling function ψ can define single or multiple interval in X and is used to compute a boolean attention mask G. Every element h i ∈ h attend to element X j ∈ X only if j ∈ ψ(i) and G ij = 1 ψ(i) (j). We experiment with four sampling functions that use Sliding, Periodic, Fixed and Random attention patterns.
Sampling function map every hidden state h i to an interval in X with a sampling length r ∈ N such that ψ and φ f = 0 for Fixed sampling function. A convolution operation is similar to the Fixed sampling found in Figure 2. Sliding sampling function shift the same interval at every layer such that that φ s (i) = αλ with magnitude α. Periodic sampling function map h i to multiple intervals that periodically span X such that φ p (i) = L X sin(β × i) with magnitude β. Random sampling function define a random interval in X such that φ r = U (−γ, +γ) and window γ. Mixed sampling combine the above pattern such that φ m = φ s + φ p + φ r and improve performance compared to each sampling function individually. An example of attention mask generated by sampling function is visualized in Figure 1.
Implementation of efficient sparse computation in GPU is known to be challenging (Zaheer et al., 2020). The sparse pattern introduced can be optimized for GPU, similar to previous work (Beltagy et al., 2020;Child et al., 2019;Zaheer et al., 2020) with an additional speed-up for custom CUDA kernels (Beltagy et al., 2020;Child et al., 2019). In our experiments, we do not consider custom CUDA kernels a discussed in Section 5.

Multimodal Sparse Phased Transformer
Multimodal SP-Transformer, is a stack of SP-Transformer layers composed of Input Attention, Cross Attention and Self Attention sub-layer applied in the same order. All blocks are identical in computation and are defined by SP-Block. SP-Transformer layer accept multiple sequence X m , where m ∈ M is the set of all modalities as well as hidden states h λ m for each modality at layer λ. The layer output updated hidden states h λ+1 m . The first layer use learnable embedding h 0 m . At every layer Input Attention attend to the original signal X m with the hidden states from the previous layer h λ m to compute updated hidden states for modality m,ĥ λ+1 m . Cross Attention is applied on the output of the two Input Attention blocks of different modalities. For every modality, Cross Attention attend to the hidden states between m → m , ∀m ∈ M \{m}. We extend Cross Attention to Co-SP-Block that shares the parameters for attention between the hidden states of modality m → m and m → m . We describe Co-SP-Block, in detail, later in this section. Finally, we sum the Cross Attention hidden states for modality m → ∀m and apply a Self Attention mechanism in the final vector that represent the hidden states learned for modality m defined as h λ+1 m . The output of this layer can be fed to another layer or be used in a downstream task. The architecture is illustrated in Figure 3.
Input Attention which compresses each unimodal input sequence into hidden states is defined as follows.ĥ λ+1 m = SP -Block(h λ m , X m ) Cross Attention models cross-modal interaction between hidden state sequences for two modalities, as follows.
Self Attention refines the representation fusing cross-modal information for each modality.
The use of hidden states is the main difference between our method and the previous Transformerbased methods (Tsai et al., 2019;Rahman et al., 2020;Hazarika et al., 2020). Information from longer input sequences is absorbed by a shorter hidden state sequence iteratively for every layer instead of only the first layer (Rahman et al., 2020;Tsai et al., 2019;Huang et al., 2020;Pan et al., 2020) or recurrently on segments of the input sequence Rae et al., 2020). In our experiments, we constraint the length of a hidden state sequence L hm = L Xm S where S is a hyperparameter to control the compression ratio.
Previous work (Tsai et al., 2019;Rahman et al., 2020;Huang et al., 2020;Pan et al., 2020) apply a set of sub-layers serially multiple times with the output of one stack as input to the next. We perform experiments on a Serial Structure and use Input Attention sub-layers ×N →, Cross attention sub-layers ×N → and Self Attention sub-layers ×N , available in Appendix D. For each modality, a Concurrent Structure use Input Attention blocks →, Cross Attention blocks → and Self Attention blocks with interaction happening only in the Cross Attention sub-layers. The same process is repeated N times. SPT uses concurrent structure that fuses cross-modal information with summation. We also experiment with a variant of a concurrent structure that uses concatenation (see the Appendix C).

Optimized Sparse Phased Blocks
Cross attention sub-layer models the bimodal interaction. We would need two SP-Blocks to model the interaction between modalities A → B and B → A. The number of SP-Blocks required to model pair-wise interactions has a quadratic growth with respect to the number of modalities. Factorized Co-Attention reduces the number of parameters by half and shares a SP-Block for a given pair of modalities without accuracy loss. We extend the idea of Co-Attention (Lu et al., 2016) to Factorized Co-Attention where an affinity matrix C represents the distance between two sequence X and Co-SP-Block applies Factorized Co-Attention and shares trainable parameters (F F N , W Q , W K , W V and W O ) between two SP-Blocks. We omit the multihead notation for clarity.
The pairs of modalities in cross attention have a quadratic growth (i.e., the pair-wise hidden state sequences for a Cross Attention sub-layer will be |M|(|M| − 1)). Previous work (Tsai et al., 2019;Hazarika et al., 2020) concatenate the representations of modalities to fuse cross-modal information. In contrast, we add the pair-wise hidden states for each modality, which reduces the size of the model and reduces its complexity.
SP-Transformer also shares parameters across layers which have shown to be effective by Lan et al. (2020) and Jaegle et al. (2021). We did an ablation experiment for all parameter sharing patterns, the results are presented in Section 4.3.

Experiment
We introduce the experimental setup, baseline methods and datasets. We present the results and additional evaluations through an ablation study.

Experimental setup
We evaluate our model on three multimodal sentiment and humor analysis datasets, namely, CMU-MOSI (Zadeh et al., 2016) 2020) for UR-FUNNY which is state-of-the-art for the Glove feature on text. The BERT feature reported by the same work is not publicly available and requires manual extraction and check. There are work that achieve higher performance for the aforementioned datasets but do not publish the preprocessed data or code (Sun et al., 2020;Hasan et al., 2021). It is not possible to directly compare with those methods.
We perform a grid search for some of the hyperparameters, consistent with previous work (Tsai et al., 2019;Rahman et al., 2020;Hazarika et al., 2020;Sun et al., 2020), and empirically select the remaining ones. Our hyper-parameter settings and optimization strategy are available in Appendix B.

Results
Our method achieves comparable results in MOSI and superior results in MOSEI and UR-FUNNY Figure 4: Sample efficiency test on unaligned CMU-MOSEI dataset in comparison with MulT. We gradually increase the size of the training set and use the same training set for both models for consistency. datasets. Sample efficiency define the efficiency of a model in leveraging information from a single training sample. We follow a similar methodology as previous work (Khandelwal et al., 2019) and use multiple identical training subsets from the unaligned CMU-MOSEI dataset to compare the sample efficiency between our model and MulT. Even though, the improvement is marginal our model uses a significantly lower number of parameters and has a higher sample efficiency.
We compare SPT ("Ours") with layer-wise parameter sharing, mixed sampling and summation for cross-modal interactions with other state-of-theart. Our model use 154K trainable parameter which is a reduction of 90% compared to Tsai et al. (2019) and 97% for Hazarika et al. (2020). Detailed result are listed in Table 1. The reduction in parameters can also explain the improved sample efficiency of our model as shown in Figure 4.

Additional Experiments
We perform an ablation study on SP-Transformer and quantitatively evaluate the memory use, inference time, and training time for our model. Results of ablation study are available in Table 2.
Ablation experiment on network structure We modify and experiment with the structure of SP-Transformer described in Section 3.2. We experiment on multimodal interactions with a Serial model and two variations of the Concurrent model, with summation ("Ours") and concatenation on the output of Cross Attention sub-layer. Summation use half the parameters compared with the concatenation with nearly identical accuracy. Concurrent structure improves accuracy compared with a serial structure which could be due to the richer multimodal interactions at every layer.
Parameter sharing We analyze the influence of parameter sharing strategies on the model performance. We perform experiments on a model that does not share parameters across layers ("Layer NS") and within cross attention sub-layer ("Cross NS").
Our results indicate that parameter sharing can decrease the model size by 71% with negligible impact on model accuracy. Layer-wise parameter sharing improves performance, this could be due to the fact that sharing reduces the risk of over-fitting. This is in accordance with the results reported in Jaegle et al. (2021).
We test two additional sharing strategies, sharing parameters between identical block types for the same modality ("Modal S") and for all SP-Block ("All Share"), across all layers and sub-layers. Due to the difference in the dimensionality of the sequence between each modality, we use a linear projection to map audio, video, text inputs to d model . Results show that further sharing reduces the size of the model by 70% compared with our model, with a 1.5% relative reduction in model accuracy. The trade-off between accuracy and model efficiency can be adjusted depending on the use case. The results demonstrate that parameter sharing has a small effect on model accuracy, in our approach.

Sampling Function
We perform experiments on the five attention mask patterns introduced in Section 3.1. "Ours" model use a Mixed sampling function and we experiment with a "Fixed", "Slide", "Period" and "Random" sampling function applied independently, as well as a "Fullattn." as introduced by Vaswani et al. (2017). A full attention mask is significantly slower but outperforms other sampling functions when used in isolation. A mixed sampling with "Ours" is a combination of Slide, Period, and Random mask which outperform full attention. At every layer, multiple hidden states will attend to the original input sequence for dif-ferent intervals (sliding sampling) or overlapping intervals (period sampling) and with a regularization effect (random sampling). The structured sparsity of mixed sampling can introduce an inductive bias that allows hidden states to learn compressed representations from the longer input signal.
Unimodal experiment We train SPT on the CMU-MOSEI unaligned dataset on a single modality of text. Results for the unimodal setting use a suffix "U" and can be found in Table 2.
We compare SPT with the result reported by MulT ("U") for a Unimodal setting. We also train a model that replaces the Transformer block with SP-Block in MulT ("MulT-SP"). SPT ("U") uses an Input Attention block followed by a Self-Attention block in each layer. Results show a substantial difference in the performance for SP-Block. The difference between MulT-SP and Unimodal SPT is in the downsampling by Conv1D as opposed to the compression by Input Attention block. The advantages of SP-Block lead to a 3.3% increase in performance and a 89% reduction in parameters.
Comparison with Performer To compare our method with the state-of-the-art efficient Transformer-based architectures, we compare our method with Performer (Choromanski et al., 2020) using the same architecture from MulT with Performer layers in both multimodal ("MulP") and unimodal ("MulP (U)") setting. Results indicate that our method could improve efficiency without the loss of accuracy, unlike Performer.

MOSEI-A MOSEI-UA
Results show that our model achieves linear complexity O( rL S ) on both memory use and inference time with respect to the sequence length L and with a slope determined by the compression ratio S and segment length r. The improvement is a result of the downsampling from the Input Attention, sparse attention from the sampling function, and a simplified model structure.
We test training time in unaligned CMU-MOSI, CMU-MOSEI, and UR-FUNNY datasets. All experiments use the largest batch size that can be executed on a single NVIDIA Tesla V100 with 16GB vRAM. Result are listed in Table 3. With a compression factor, S = 8 our method reduces

Future work
Sampling function Other work explore complex sparse patterns like dilated sliding window (Beltagy et al., 2020) that allows the segment to be "dilated" with gaps between sampled elements, routine-based (Kitaev et al., 2020) that samples the nearest neighbors for the hidden states in the sequence, probabilistic (Zhou et al., 2021) that samples based on KL divergence, global blocks (Zaheer et al., 2020) which allows few elements sampling the entire sequence or uses trainable parameters (Tay et al., 2020;Neil et al., 2016;Hu and Qi, 2017;Mei and Eisner, 2017) which enable the model learn to select the elements to sample. We consider such sampling patterns for future work. Hidden states are randomly initialized. Inductive bias can be introduced to further improve performance and sample efficiency. Melis et al. (2020) warmed up the hidden state in RNNs and allow interactions of the initial hidden state with the input prior to being used by the model.
Implementation methods rely on custom CUDA kernels (Beltagy et al., 2020;Child et al., 2019) to further optimize sparse computation in GPU. Our work does not implement any custom CUDA kernels, thus only achieve memory advantages from the sparse pattern. We expect that specialized GPU optimization should further improve our efficiency. Moreover, further optimization could be achieved by incorporating the factorization method from Choromanski et al. (2020).

Conclusions
In this paper, we propose a multimodal SP-Transformer that uses a sequence of hidden states to sample from longer multimodal input sequences. Compared with the previous Transformer-based models, our model has a reduced computational complexity through sparse attention. The concurrent structure also enables more effective capturing of the multimodal interaction, resulting in higher performance. The optimization through parameter sharing patterns leads to a significantly lighter model, with a lower number of parameters and improved sampled efficiency. As a result, the proposed model's performance is superior or comparable to the existing methods at a lower computational and memory cost. Our experiments show that our method has a good balance between accuracy and efficiency and has the potential to be deployed in real-world multimodal applications.  Table 4: Statistics for pre-processed CMU-MOSI and CMU-MOSEI datasets from Tsai et al. (2019). For each column, "A" represents aligned, "UA" represents unaligned. "A", "T", "V" in each row respectively represents average sequence lengths for these modalities. "Train", "Test", "Valid" represent the number of data points for each split. The sequences are pre-truncated and padded which makes all samples from one modality have the same length.

Range
Step size Distribution We use Optuna hyperparameter optimization framework (Akiba et al., 2019) to perform grid search on hyper-parameters. Optimized hyperparameters, search space, and distributions are listed in Table 6. r, α, β, γ for the three sub-layers are optimized independently with the same setting. Dropout rates for the attention layer, FFNs, and embedding layer are optimized independently with the same setting.
Attention head number is fixed to 8. d model is fixed to 32. The number of epochs for CMU-MOSI, CMU-MOSEI, UR-FUNNY are 100, 50, 100 respectively. We use the maximum batch size that can fit on a single NVIDIA Tesla V100 memory. Gradient clipping is done for norms of 0.8, 1.0, 1.0 for CMU-MOSI, CMU-MOSEI, UR-FUNNY respectively. We use Adam for optimization with the default hyper-parameters from PyTorch. We use a plateau learning rate scheduler that decreases the learning rate by a factor of 10 when the validation performance plateaus and with a patience of 20 epochs.