QAP: A Quantum-Inspired Adaptive-Priority-Learning Model for Multimodal Emotion Recognition

Multimodal emotion recognition for video has gained considerable attention in recent years, in which three modalities (i.e., textual, visual and acoustic) are involved. Due to the diverse levels of informational content related to emotion, three modalities typically possess varying degrees of contribution to emotion recognition. More seriously, there might be inconsistencies between the emotion of individual modality and the video. The challenges mentioned above are caused by the inherent uncertainty of emotion. Inspired by the recent advances of quantum theory in modeling uncertainty, we make an initial attempt to design a quantum-inspired adaptive-priority-learning model (QAP) to address the challenges. Specifically, the quantum state is introduced to model modal features, which allows each modality to retain all emotional tendencies until the final classification. Additionally, we design Q-attention to orderly integrate three modalities, and then QAP learns modal priority adaptively so that modalities can provide different amounts of information based on priority. Experimental results on the IEMOCAP and MOSEI datasets show that QAP establishes new state-of-the-art results.


Introduction
Multimodal emotion recognition (MER) has attracted more and more interest due to the rapid growth of multimedia information.MER aims to recognize the emotions of the speaker in the video.Multiple modalities enrich human emotional expression and they are all closely related to emotion.Generally, textual modality provides the most basic semantic information, visual modality provides emotional expressions, and acoustic modality provides the changing tone.
Three modalities also bring greater challenges to emotion recognition.Due to the different amounts of information related to emotions, the priority of more and more of a dirty word for teachers each modality varies from sample to sample.If different modalities are not discriminated in fusion, information related to emotion cannot be fully extracted.In the example on the left of Figure 1 (a), the dejected expression, wrinkled eyebrows and drooping corners of the eyes all show anger and disgust, so visual modality contributes more to emotion.In the example on the right, a rising tone shows the emotion of happiness, so acoustic modality has higher priority than visual modality.Most previous works (Tsai et al., 2019;Akhtar et al., 2019;Chauhan et al., 2020) treat modalities equally and do not pay attention to the important role of modal priority.Some other works (Li et al., 2022) integrate modalities in a certain order, but the order is not adaptively adjusted for different samples.In practical scenarios, a fixed order cannot fit all samples.
More seriously, there might be inconsistencies between the emotion of individual modality and the video.In the example in Figure 1 (b), happiness is expressed in the text modality, but the emotion of the video is anger.Some previous methods (Sun et al., 2022;Yang et al., 2022;Yuan et al., 2021) do not consider this issue and still integrate three modalities together, resulting in a negative impact on final emotions.Some other methods (Mittal et al., 2019) remove the modality with inconsistent emotions and replace it with other features, which lose the semantic information contained in the modality.
As part of human cognition, emotion is always in an uncertain state and constantly evolving until the final decision is made (Busemeyer and Bruza, 2012).Specifically, the emotion of a video is considered to be uncertain until it is measured and collapses to an eigenstate, and so does one of the modalities.Conceptually, in non-quantum models, the emotions in the video are pre-defined values and a measurement (classification) merely records them.In other words, three modalities are always aligned to a certain emotional label throughout the entire process before recognition.However, the generation of emotions is often spontaneous and intuitive so the cognitive system is fundamentally uncertain and in an indefinite state.In quantumlike frameworks, the emotion of each modality is treated as an indefinite state (Busemeyer and Bruza, 2012).The final quantum measurement creates a definite state and changes the state of the system.Despite the above advantages compared to previous models, it is also challenging to complete the MER task in a quantum-like framework due to the complex processes such as feature extraction and modal fusion.Technically, we must ensure that the model conforms to the evolution process of the quantum system and that the characteristics of the density matrix remain unchanged.
Inspired by the excellent performance of quantum-like networks in other tasks (Jiang et al., 2020;Liu et al., 2021;Li et al., 2019Li et al., , 2021)), we propose an adaptive-priority-learning model (QAP) for MER.QAP uses quantum states instead of traditional feature vectors to represent modal features from the initial feature extraction step to the final emotion classification step.Modal features in each step no longer correspond solely to the final emotional label, but are in a state where emotion is uncertain.In this way, the opposite emotion of a single modality will not affect the final emotion because all modalities are in an uncertain state with multiple emotions.
In MER, it is an inherent problem to effectively extract the features of the raw modalities.Previous works either use pre-extracted features with handcrafted algorithms or extract end-to-end features with pre-trained models.But these two features are not effectively combined together.In QAP, the complex-valued density matrix is used as the unit of modal representation due to the stronger representation ability (Balkır, 2014).By this means, end-to-end features and pre-extracted features are effectively combined by a non-linear method.
For the fusion in the quantum-like framework, Q-attention based on the density matrix is designed to orderly integrate the three modalities.After that, since three modalities can form several fusion orders, we use a quantum measurement operator to select the most appropriate fusion order.In this way, QAP can learn modal priority adaptively.Finally, we use another quantum measurement operator to collapse all states in the density matrix to the pure state representing emotion to recognize the emotion.
The main contributions of our paper are as follows: • We propose QAP, a quantum-inspired adaptive-priority-learning model for multimodal emotion recognition, where each modality is in a state where emotion is uncertain.So modalities with different emotions can be integrated.
• QAP utilizes the density matrix to represent modal features and two kinds of features can be combined effectively.Based on the density matrix, we design Q-attention to integrate modalities in order of priority and utilize a quantum measurement operator to select fusion order.So QAP can adaptively learn the modal priority.

Related Work
MER has attracted more and more attention, and many methods have been used to integrate modalities.Direct concatenation and outer product (Zadeh et al., 2017) are used as fusion methods in the early years.And then Zadeh et al. (2018) proposes a method based on recurrent neural network and designs a gate mechanism.In recent years, models based on attention mechanism (Vaswani et al., 2017;Tsai et al., 2019) are applied to MER and followed by later works.Rahman et al. (2020) proposes an attachment to enable pre-trained models to integrate multimodal information.Zhang et al. (2020) models the dependencies between labels and between each label and modalities for multilabel MER.Hu et al. (2022) presents a graph-based network to capture multimodal features and contextual dependencies.These works treat modalities equally and do not pay attention to modal priority.Li et al. (2022) integrates three modalities in a certain order, but cannot adaptively learn modal priority.In addition, the end-to-end models (Dai et al., 2021;Wei et al., 2022;Wu et al., 2022) are also proposed to make better use of the raw modal information.However, they introduce noise irrelevant to emotion and also ignore the importance of modal priority.The issues of inconsistent emotions and differentiated contributions have not been resolved in the above work, which negatively affects the performance of the model.In contrast, our approach can adaptively learn modal priority and modalities with more emotional information will make a greater contribution Quantum-inspired or quantum-like models have a good performance in different tasks.Sordoni et al. (2013) first applies the quantum-like model to the field of information retrieval.Li et al. (2019) and Zhang et al. (2018) design the quantum language models in the text matching task.Li and Hou (2021) combines the quantum-like model and the convolutional neural network, and gets an expected result in the sentiment analysis task.Gkoumas et al. (2021b) proposes the first quantum-like model for multimodal sentiment analysis, which is a decisionlevel fusion framework.Liu et al. (2021) uses quantum interference to integrate textual modality and visual modality.Gkoumas et al. (2021a) introduces the concept of quantum entanglement to multimodal fusion and Li et al. (2021) designs a quantum-like recurrent neural network to model context information.All these works prove that quantum-inspired networks have advantages in modeling human cognitive uncertainty.However, the modules of modal fusion in them are too simple to fully capture the inter-modality information.
Besides, integrating three modalities in a quantumlike framework is a challenging task, and we make an initial attempt in this field to make the modalities with opposite emotions be integrated effectively.

Preliminaries on Quantum Theory
The construction of a quantum-inspired model is based on quantum theory (QT) (Fell et al., 2019;Busemeyer and Bruza, 2012).In this section, we will briefly introduce the basic concepts of QT.The state vector in QT is defined on a Hilbert space H, which is an infinite inner product space over the complex field.With Dirac Notations, we denote a complex unit vector µ as a ket |u , and its conjugate transpose µ H is denoted as a bra u|.The inner product and outer product of two state vectors |u and |v are denoted as u|v and |u v|.

State
A quantum state |ψ is a complete description of a physical system and is a linear superposition of an orthonormal basis in the Hilbert space.The state of a system composed of a single particle is called a pure state.The mathematical form of |ψ is a complex column vector.
A pure state can also be expressed as a density matrix: ρ = |ψ ψ|.When several pure states are mixed together in the way of classical probability, we use the mixed state to describe the system.The density matrix can also represent a mixed state: ρ = n i=1 p i |ψ i ψ i |, where p i denotes the probability distribution of each pure state and n i=1 p i = 1.In MER, one modality is composed of several tokens, and each token can be regarded as a particle.Therefore, we use the density matrix to represent the modal features which can be viewed as mixed states.

Evolution
In QT, a state does not remain unchanged, but can evolve over time.The evolution is described by a unitary operator U .U is a complex unitary matrix satisfying U U H = I 2 .The evolution process is as follows: It can be proved that ρ is also a density matrix as long as ρ is a density matrix.We draw an analogy between the evolution process and the linear transformation process of a density matrix.

Measurement
Quantum measurement causes a pure state to collapse to a base with a probability.The measurement process is described by an observable M : where {|m j } are the eigenstates of the operator and also form an orthonormal basis in the Hilbert space.{λ j } are the eigenvalues corresponding to eigenstates.According to the Born's rule (Halmos, 2017), the probability of the pure state |ψ to collapse onto the basis state |m j is calculated as follows: where ρ = |ψ ψ|.For a mixed state, the probability of collapsing to an eigenstate is the weighted sum of the probability values of all pure states.We exploit quantum measurement to calculate the weight of different fusion orders and recognize the final emotions.

Model
In this section, we will describe the details of QAP.
The overall architecture of QAP is shown in Figure 2. QAP consists of three modules: Unimodal Complex-valued Representation, Adaptive Priority Learning and Emotion Recognition.Firstly, the complex density matrix of the single modality is constructed for modal representation(Section 4.1).
In the representation, end-to-end features and preextracted features are respectively aligned to the amplitude and the phase of the complex value.Secondly, Q-attention is designed to integrate three modalities orderly and we use a quantum measurement operator to select the appropriate order, then QAP can learn modal priority adaptively (Section 4.2).Finally, another measurement operator is employed to recognize the final emotion (4.3).

Unimodal Complex-valued Representation
Early works (Zadeh et al., 2018;Zeng et al., 2021) usually extract features with hand-crafted algorithms, but these pre-extracted features cannot be further fine-tuned on different tasks and have poor generalization.In recent years, some methods (Dai et al., 2021;Wei et al., 2022) utilize pre-trained models to extract more modal information, which can be fine-tuned on different tasks.However, fully end-to-end models may bring noise, such as the part outside the face in the image.These noises will cause semantic drift and affect the judgment of video emotion.
To alleviate this problem, we utilize the two kinds of modal features together with complexvalued representation.A complex value can be expressed in polar form: z = re iθ , where r is the amplitude and θ is the phase or argument.So a pure state can be expressed as: where is the element-wise product.By formula (4), a pure state can be decomposed from a complex vector into two real vectors: r = [r 1 , r 2 , . . ., r n ] T and θ = [θ 1 , θ 2 , . . ., θ n ] T .So we just need to construct these two real vectors.On the whole, the endto-end feature is used as r, and the pre-extracted feature is used as θ.
We use pre-trained models to extract end-toend features.ALBERT-base-v2 (Lan et al., 2019) is used for textual modality.We obtain the last hidden layer representation and project it to the Hilbert space with a linear layer: rt = W t • ALBERT (T ) + b t , where W t and b t are parameters.Then we normalize the outputs: r t = rt ||rt|| 2 .VGG (Simonyan and Zisserman, 2014) is used for visual and acoustic modalities.After the same processing as the textual modality, we obtain r v and r a .
Pre-extracted features are obtained by handcrafted algorithms for visual (OpenFace2 (Baltrusaitis et al., 2018)) and acoustic (openSMILE (Eyben et al., 2010)) modalities.Motivated by previous work (Akhtar et al., 2019) that the sentiment polarity of words helps emotion recognition, we exploit a sentiment dictionary (Baccianella et al., 2010;Miller, 1995) to make use of sentiment polarity for the textual modality.Due to the advantage of capturing long-distance dependencies, the Transformer Encoder is used to encode these pre-extracted features.
Modal pure states |ψ t , |ψ v , |ψ a are constructed by formula (4) and the density matrices ρ t , ρ v , ρ a are obtained by the outer product.

Adaptive Priority Learning
There are six fusion orders of three modalities.Based on the experimental results of previous work, tr tr textual modality usually contributes the most.Considering the computational cost, we only use two orders in our implementation: textual-visual-acoustic (t-v-a) and textual-acoustic-visual (t-a-v).
Taking the t-v-a order as an example, t and v are integrated first by Q-attention.The main process of Q-attention is shown in Figure 3. t is the basis, and v modality is to be added.ρ t is fed into two Q-Linear layers to output K and V respectively, and ρ v is also fed into a Q-Linear layer to output Q. Q-Linear is a linear layer designed for the density matrix analogous to quantum evolution: where U 1 , U 2 , U 3 are unitary matrices so K, V , Q are also density matrices.For pure states (vectors), attention scores can be calculated by the inner product, which cannot be directly applied to mixed states (density matrix).To solve this problem, we calculate the trace of the product of two density matrices: Formula ( 8) proves that tr(ρ a , ρ b ) is the inner product weighted sum of the pure states.In fact, this is a generalization of the inner product from vectors to density matrices, called trace inner product (Balkır, 2014;Zhang et al., 2018).Therefore, we calculate the attention score between K and Q by trace inner product: Then, the output is obtained by weighted summation of V : where ρt-v is the density matrix containing textual information and visual information.Inspired by Transformer (Vaswani et al., 2017), we also exploit the residual mechanism: where ρ t-v is the fusion feature of textual and visual modalities.In addition, Q-attention is a multilayer module.In the second and later layers, ρ t is still the basis and used as K and V ; while Q is the output of the previous layer and is continuously updated.So the whole process of Q-attention can be expressed by the following formula: Similar to the above procedure, acoustic modality can also be integrated by Q-attention.In the process, ρ t-v is taken as K and V , and ρ a as Q: where ρ t-v-a is the modal fusion feature in the order of t-v-a, and also a density matrix.In the same way, we can also obtain the modal fusion feature ρ t-a-v in the order of t-a-v.Then, a quantum measurement operator M 1 = {|m 1 j } n j=1 is utilized to select the most appropriate order for the current sample.The operator has n eigenstates so a n-dimensional probability distribution is calculated after the measurement of ρ t-v-a : We use a fully connected neural network to map the probability distribution to the weight of the tv-a.ρ t-a-v is also measured by M 1 and then the weight of the t-a-v order is obtained.We feed the two weights to a Sof tmax layer and get α and β, where α + β = 1.Finally, we sum the two density matrices: where ρ f is the multimodal fusion density matrix.

Emotion Recognition
We introduce another quantum measurement operator M 2 = {|m 2 j } n j=1 to recognize the emotions: where T is an n-dimensional vector representing the probability distribution of each eigenstate and F CN is a fully connection neural network.p e = [p e 1 , p e 2 , . . ., p e k ] T is the probability distribution of each emotion and k is the number of emotions.
During training, we use the BCEWithLogitsLoss function to calculate the loss.

Datasets and Metrics
We conduct experiments to verify the performance of QAP on two widely used datasets: IEMOCAP and CMU-MOSEI.Both original datasets cannot be directly used for end-to-end training, so Dai et al. (2021) reconstructs these two datasets.After reconstruction, IEMOCAP contains 151 videos and 7,380 utterances.The content of each video is a dialogue between two professional actors according to the script.There are 6 emotion labels in IEMOCAP: {angry, happy, excited, sad, frustrated, neutral}.Each utterance only corresponds to one label.CMU-MOSEI is collected from the opinion videos on YouTube.The reorganized CMU-MOSEI contains 20,477 utterances and 6 emotion labels: {happy, sad, angry, fearful, disgusted, sur-prised}.Utterances in CMU-MOSEI may correspond to multiple labels.Following (Dai et al., 2020), we split the datasets, and the statistics of datasets are shown in Appendix A.
To comprehensively evaluate the performance of the method, we follow previous work (Dai et al., 2021) to use different evaluation indicators for the two datasets for fairness.For IEMOCAP, we calculate the accuracy and F1-score of each emotion and the average values.For CMU-MOSEI, we calculate the weighted accuracy and F1-score of each emotion and the average values.

Training Details
We use two optimizers during training.For unitary matrix parameters, we design an independent optimizer following Wisdom et al. (2016) to make these parameters always be unitary matrices in the training process.The optimization process is shown in Appendix B. For regular parameters, we use the Adam optimizer (Kingma and Ba, 2014).The experiments are run on a Tesla V100S GPU with 32GB of memory.There are about 58M parameters in our model.The time to run one epoch is less than one hour.We perform a grid search on the Valid set to select the hyper-parameters.The hyper-parameters are shown in Appendix C. For each experiment, we run three times and take the average.

Baselines
We compare QAP with several advanced multimodal emotion recognition models: LF-LSTM: LSTM, a classical neural network, is used to encode modal features.It is a late fusion (LF) model.

LF-TRANS:
The Transformer model is used to encode modal features and then the results are integrated.It is also a late fusion model.
EmoEmbs (Dai et al., 2020): This approach uses pre-trained word embeddings to represent emotion categories for textual data and transfer these embeddings into visual and acoustic spaces.EmoEmbs can directly adapt to unseen emotions in any modality and perform well in the zero-shot and few-shot scenarios.
MulT (Tsai et al., 2019): For modalities unaligned, MulT uses cross-modal attention to integrate modalities in pairs and does not pay attention to modal priority as above baselines.
AMOA (Li et al., 2022): Three modalities are integrated in a certain order and the global acoustic feature is introduced to enhance learning.
FE2E (Dai et al., 2021): FE2E is the first end-to-end model for MER, which uses pre-trained models to extract unimodal features and then fuses them.MESM (Dai et al., 2021): Cross-modal attention and sparse CNN are utilized to integrate modalities and reduce computation based on FE2E.

Main Results
The experimental results on the IEMOCAP and CMU-MOSEI datasets are reported in Table 1 and  Table 2, respectively.The results show that QAP outperforms all baseline models on average and most emotion categories.In general, QAP attains an improvement of 1% -3% over other models, which indicates the advantage of QAP in MER.
The baseline models ignore the issue of inconsistent emotions, so they perform poorly in this situation.LF-LSTM, LS-TRANS, EmoEmbs and MulT are classic multimodal emotion recognition models but only use pre-extracted features.Besides, they treat modalities equally and do not pay attention to the important role of modal priority, so the performance is relatively poor.AMOA notices the importance of modal fusion order so the performance is improved compared with previous methods.However, AMOA cannot learn modal priority adaptively so the order is fixed.FE2E and MESM use end-to-end frameworks and can extract richer modal features, so they also perform well.But the two models also do not focus on modal priorities.QAP uses quantum states to model features so that modalities with inconsistent emotions can be effectively integrated.Besides, QAP learns modal priority adaptively and can adjust the modal fusion order based on priority, so outperforms all baselines.

Analysis
In order to further analyze the performance of QAP, we conduct extensive experiments on the IEMO-CAP and CMU-MOSEI datasets.

Effectiveness of complex-valued density matrix
To verify the role of the complex-valued density matrix, we change the unit of modal representation from the complex-valued density matrix to the pure state vector and conduct experiments.The results in Table 3 show that the performance of QAP decreases when the pure state is used.Besides, we try to directly concatenate end-to-end features and pre-extracted features rather than using complex representation.Experimental results show that this will also cause performance degradation.
We use complex value representation to combine pre-extracted features and end-to-end features.To verify the role of pre-extracted features, we remove the phase in the complex representation, that is, change the complex-valued matrix into the realvalue matrix with only end-to-end features.As shown in Table 3, the addition of pre-extracted features makes a great contribution to the improvement of model performance.We introduce the sentiment dictionary into MER, and it is not used by other models, so we conduct an ablation study on SenDic individually.Results in the last row of

Effectiveness of adaptive-priority-learning fusion
Almost all baselines (except AMOA) do not integrate the three modalities in order, while QAP integrates the modalities in the order of modal priority, so the performance is better than all baselines, as shown in Table 1 and 2. In addition, compared with AMOA, our model adds a mechanism to adaptively adjust the fusion order and learn modal priority by a quantum measurement operator.To prove the effectiveness of this mechanism, we fix the modal fusion order in QAP and conduct experiments.The results are shown in Table 4, and we can see that no matter which fusion order is fixed, the model performance decreases.Therefore, any fusion order cannot be suitable for all samples and it is nec-essary to adaptively adjust the order according to different samples.
For the selection method of two fusion orders, we adopt the Sof t selection by default, which utilizes information in two fusion orders in a dynamic proportion.Besides, we also attempt to use Hard selection, that is, to discard the order with a lower score.The results in Table 4 show that QAP with Sof t selection performs better.The reason is that there is little difference between the contributions of acoustic and visual modalities in some samples, and both orders have positive contributions to emotion recognition.
In order to reduce the calculation, we only reserve two (t-v-a, t-a-v) of the six fusion orders based on the experimental results of previous work.We also utilize more orders and conduct experiments.The results in Table 5 show that the addition of more fusion orders does not significantly improve the performance with the increase of computation.
In addition, we also try to keep the other two orders and conduct experiments and the results are shown in Table 5.When we use the orders of t-av and t-v-a, QAP achieves the best performance, which indicates that our initial selection of the two fusion orders is appropriate.

Effectiveness of quantum measurement
In QAP, we use quantum measurement operators to collapse the density matrix ρ f for classification (order selection and emotion recognition).This process unifies the entire classification process under a quantum-like framework and improves the interpretability of QAP.We also attempt to use two other non-quantum methods for classification to verify the effectiveness of quantum measurement.The first attempt is to use non-orthogonal eigenstates to form a measurement operator, which actually violates the concept of quantum measurement.The second attempt is to flatten rho f to a one-dimensional vector, followed by a softmax function.The results in Table 6 show that the decreased performance of non-quantum methods reveals the superiority of quantum measurement.

Role of single modality
In MER, each modality plays an important role.
And to verify the role, we separately remove one modality and conduct experiments.For example, when the textual modality is removed, the v-a and a-v orders are adopted and adaptively selected by a measurement operator.The results are shown in Table 7.When a modality is removed, the performance of QAP decreases in varying degrees.Specifically, when the textual modality is removed, the performance decreases most obviously, which is consistent with the results of previous work.

Conclusion
We propose QAP, a quantum-inspired adaptivepriority-learning model for multimodal emotion recognition.First, the quantum state is introduced to model the uncertainty of human emotion, which allows modalities with inconsistent emotions can be effectively integrated.Secondly, a novel mechanism Q-attention is designed to orderly integrate three modalities in a quantum-like framework.While selecting the appropriate fusion order, QAP learns modal priority adaptively.In this way, modalities make varying degrees of contributions based on priority.Experiments on two widely used datasets show that QAP establishes the new SOTA.

Limitations
We use the density matrix to represent modal features, and one of the advantages is that the matrix contains more information.However, the requirements for memory and large GPU resources also in-crease.Based on the best hyper-parameter setting, the shape of a pure state is 16×100×100, while the shape of a density matrix is 16×100×100×100.At the same time, the matrix will also increase the calculation and time cost.In future work, we will explore how to reduce the computational expense, and it is an idea to build the sparse density matrix.

Figure 2 :
Figure 2: The overall architecture of QAP.denotes the point-wise product and ⊕ denotes the element-wise addition.QM stands for Quantum Measurement.The dashed parts are not optimized through training.

Figure 3 :
Figure 3: The main components of the Q-attention module.

Table 3 :
Results of the ablation study of the complexvalued density matrix.-pure state means that the pure state is used to represent modal features instead of the density matrix.-concat means directly concatenating rather than using complex representation to combine two features.w/o phase means to remove pre-extracted features and only use real density matrix.

Table 4 :
Table 3 illustrate that the introduction of SenDic improves model performance.Experimental results of selection methods and fixed orders.Soft means to use Sof t selection and Hard means to use Hard selection.-fixed(t-v-a) means that the fixed fusion order is t-v-a, and others are similar.The average results are reported.

Table 5 :
Experimental results of orders reserved.The first three are cases where different two orders are reserved.-4 orders means to use the four orders of t-v-a, t-a-v, v-a-t, and v-t-a.-6 orders means to use all orders.

Table 7 :
Results of the ablation study of single modality.w/o means to remove this modality and only integrate the other two modalities.