What Does Your Smile Mean? Jointly Detecting Multi-Modal Sarcasm and Sentiment Using Quantum Probability

Sarcasm and sentiment embody intrinsic uncertainty of human cognition, making joint detection of multi-modal sarcasm and sentiment a challenging task. In view of the advantages of quantum probability (QP) in modeling such uncertainty, this paper explores the potential of QP as a mathematical framework and proposes a QP driven multi-task (QPM) learning framework. The QPM framework involves a complex-valued multi-modal representation encoder, a quantum-like fusion network and a quantum measurement mechanism. Each multi-modal (e.g., textual, visual) utterance is ﬁrst encoded as a quantum superposition of a set of basis terms using a complex-valued representation. Then, the quantum-like fusion network leverages quantum state composition and quantum interference to model the contextual interaction between adjacent utterances and the correlations across modalities respectively. Finally, quantum incompatible measurements are performed on the multi-modal representation of each utterance to yield the probabilistic outcomes of sarcasm and sentiment recognition. Experimental results show the state-of-the-art performance of our model.


Introduction
Multi-modal sarcasm and sentiment analysis, as a challenging problem, has attracted an increasing attention in the recent literature (Cai et al., 2019;Pan et al., 2020). Sarcasm is a subtle form of human language that intends to express criticism, humor or mock sentiments by means of hyperbole, figuration, etc (Castro et al., 2019). The literal meaning of an ironic expression differs from its real implication, which can completely flip the polarity of sentiment. Hence, sentiment comes into view and tightly couples with sarcasm in that one helps the * Yazhou Zhang and Yaochen Liu contribute equally and share the co-first authorship. † Corresponding author understanding of the other. Consequently, jointly detecting sarcasm and sentiment would bring benefits to each other.
Judging sarcasm and sentiment of human language, e.g., an utterance in a conversation, involves intrinsically uncertain human cognition processes (Carroll and Carroll, 1999). The uncertainty is rooted on the spontaneity of human subjective activities, where the generation of sarcasm and sentiment is often spontaneous and intuitive without a rational reasoning process. Meanwhile, human language is multi-modal in nature, involving multimodal (e.g., textual and visual) features that interact with each other and introduce extra cognitive complexity. Thus, it is essential to study sarcasm and sentiment from a general cognitive perspective.
Motivated by recent success in using quantum probability (QP) as a formal framework for modeling the intrinsic uncertainty in human cognition, we take the first step towards using QP to solve the joint multi-modal sarcasm and sentiment analysis problem. Originally as the mathematical foundation of quantum mechanics that describes the behaviors of particles, QP has been employed to formalize the uncertainty in various macro-tasks such as semantic analysis (Bruza et al., 2009;Uprety et al., 2020), question answering (Zhang et al., 2018a; and sentiment classification , with verified effectiveness and advantages. Different from these existing approaches, at the heart of our work are quantum inspired modeling of multimodal fusion in conversational context and exploring the inter-task correlations via quantum incompatible measurement. The reasons to use QP are four fold: (1) QP is advantageous in modeling the uncertainty in human cognition because it introduces the concept of complex probability amplitude, and models an utterance as a quantum superposition of basis words or pixels; (2) Quantum interference embodies a non-linear fusion of multi-modal features, due to an interference term for modeling two decision paths (e.g., textual and visual modalities) interfering with each other in reaching a final decision (e.g. sarcasm judgment); (3) Quantum contextuality reflects the intra-modality contextual interaction as quantum composition; (4) Quantum incompatible measurement describes the correlations across multiple tasks. Since sarcasm and sentiment are tightly coupled, we thus argue that they are incompatible, i.e., judging one will affect the judgment of the other. To sum up, we can intuitively discover some commonality between QP and mutli-modal sarcasm and sentiment analysis, and benefit from the unified and principled mathematics of QP. A detailed formal explanation is provided in Sec. 3.
In this paper, we propose a QP driven multitask (QPM) learning framework. Specially, QPM involves a complex-valued multi-modal representation encoder, a quantum-like fusion network and a quantum measurement mechanism. First, inspired by , each modality of utterance is described as a quantum superposition of a set of basis semantic units and represented by a complexvalued embedding. Then, we propose a quantumlike fusion network that leverages quantum state composition and quantum interference to capture intra-modal contextuality and inter-modal incongruity. The contextuality is described as the contextual interaction between adjacent utterances, which is mathematically encapsulated in a density matrix. The inter-modal incongruity is handled at the feature level with a quantum interference-like fusion approach. Finally, since all the information contained in one system is represented by the probability distribution of quantum measurement results, the final multi-modal features can be extracted via quantum incompatible measurement, while these features are passed to a fully connected layer to yield sarcasm and sentiment predictions.
Extensive empirical results on two benchmark datasets, MUStARD and Memotion, show that the effectiveness of QPM over state-of-the-art baselines. The major innovations of the work are: • The first QP driven multi-task learning framework for joint multi-modal sarcasm and sentiment analysis. • A quantum-like fusion network for modelling intra-modality contextuality and intermodality incongruity. • A quantum incompatible measurement ap-proach capturing inter-task dependency.

Quantum Probability Preliminaries
Quantum Superposition and Density Matrix. The mathematical base of quantum probability is established on a complex Hilbert Space, denoted as H. A quantum state vector u is expressed as a ket |u , its transpose is expressed as a bra u|. The inner product and outer product of two state vectors |u and |v are denoted as u|v and |u v|. Quantum superposition states that a pure quantum state can be in multiple mutually exclusive basis states simultaneously, with a probability distribution until it is measured. A quantum mixture of states gives rise to a mixed state represented by a density matrix, ρ = i p i |u u|, where p i denotes the probability distribution of each pure state.
Quantum Interference. In the double-slit experiment, two paths interfering with each other affects the probability distribution of the particle reaching the final position of the detection screen. We use the wave function ϕ(x) to interpret this behavior. The wave function represents the probability amplitude of a particle be at a position x, and the square of the wave function represents the possibility. The state of the photon is in a quantum superposition of the state of path 1 and path2, which is formulated as: ϕ p (x) = αϕ 1 (x) + βϕ 2 (x), where ϕ 1 (x) and ϕ 2 (x) are the wave function of path1 and path2. α and β are complex numbers. Its probability is: where φ is the interference angle. I = 2|αϕ 1 (x)βϕ 2 (x)| cos φ is the interference term, which describes the interaction between two paths.
Quantum Measurement. Quantum measurement is described by a set of measurement operators acting on the state space of the system being measured {M m }, where m represents the possible measurement outcomes. Suppose the quantum system is in a state of |u , then the probability to obtain the outcome m after the measurement is p (m) = u|M † m M m |u . The Gleason's Theorem (Sordoni et al., 2013) has proven the existence of a mapping function M (|u u|) = tr (ρ|u u|) for any event |u u|.

Theoretical Justification of the Proposed QPM Framework
Based on the general QP and a few previous studies , this section proposes theoretical justification of our QPM framework in the form of four claims.
Claim 1 Quantum probability is more general to capture the uncertainty in human language.
Assume z (x) represents a complex probability amplitude of an event x, where z (x) = re iθ . QP defines the modulus square of this complex probability amplitude to represent a classical probability p (x) = |z (x)| 2 = r 2 . It defines a many-to-one relationship between complex probability amplitude and probability.
For example, the probability of a word w is 0.5, i.e., p (x = w) = 1 2 , then the corresponding probability amplitude may be z ( The amplitude r links to the probability, while the phase θ may be associated with hidden sentiment or sarcasm orientations. The reasons are: (1) by using this formulation, two antonym words could have similar amplitudes but they may have different sentimental polarities represented in the phase term. (2) words often carry multiple dimensions (e,g., semantic and sentiment) of information. It is reasonable to use amplitudephase format to model the semantic and sentiment jointly. Then, an utterance could be represented in an amplitude-phase manner.
Claim 2 Quantum interference embodies a nonlinear multi-modal fusion.
Quantum interference describes a phenomenon that two propagation paths (e.g., textual and visual channels) interfering with each other affects the probability distribution of a particle (e.g., the author's attitude). Assume z (x) represents a complex probability amplitude of the modality x, the probability amplitude of multi-modality that consists of two modalities x 1 , x 2 can be formalized as: where α and β are complex coefficients. The probabilities of x 1 and x 2 are measured as: We can derive the probability of multi-modality: Hence, the probability of multi-modality is a nonlinear combination of the probabilities of two unimodalities, with an interference term determined by the relative phase φ. This provides a higher level of abstraction (Jiang et al., 2020;. Claim 3 Quantum composition captures the contextuality between utterances. Quantum contextuality describes the results of measurements on a particle depending on the measurement environment. This intuitively reflects the phenomena that the sarcastic and sentimental states of an utterance are decided by its contexts.
Assume u i and u j represent two adjacent utterances in a conversation, each of which is made up of two basis words: The contextual interaction between utterances u i and u j constructs the state space of a composite system H u i ,u j , which is defined as a tensor product of the individual state spaces |u i and |u j : Eq. 6 shows that the composition system consisting of utterances embodies the correlations between words, which inspires us to model the contextuality by a "global to local" way (Zhang et al., 2018b). Claim 4 Quantum incompatible measurement describes the correlations across multi-tasks.
Given two sets of G measurement operators for sarcasm and sentiment observables, If any crosstask pair of measurement operators satisfy the commutation rule 1 , i.e., M sar γ , M sen δ = 0 for all γ and δ, then the sarcasm and sentiment observables are called compatible, otherwise we say they are incompatible (Designolle et al., 2019). Here, sarcasm and sentiment are tightly intertwined and the judgment on one may affect the other. Thus we intuitively argue that they are incompatible, and check whether our hypothesis is tenable in the experiments (c.f. Sec. 5.8). We introduce quantum relative entropy to quantitatively analyze the intertask correlation, and help measure specific degree of correlation across different tasks.

Task Definition and Overall Network
Task Definition. Suppose the dataset has L multimodal samples. The ξ th sample X ξ is repre- Y ξ denote the i th conversational context, the multimodal utterance and the label respectively, and i ∈ [1, 2, ..., k], ξ ∈ [1, 2, ..., L]. Both the context and the multi-modal utterance consist of textual and visual modalities, i.e., , the task of multi-modal sarcasm and sentiment detection can be formulated as: where Θ represents the parameter set.
Overall Network. The overall architecture of the QPM framework is shown in Figure 1. (1) The ξ th textual utterance and its visual counterpart are represented by complex-valued embeddings, denoted as |u ξ t and |u ξ v .
(2) Then, |u ξ t and |u ξ v are fed into the quantum composition layer to capture the contextuality, where the results are encapsulated in two density matrices ρ text and ρ img . (3) We then fuse ρ text and ρ img for obtaining a multi-modal representation via the quantum interference layer. (4) We extract the final sarcastic and sentimental features via quantum incompatible measurement, and feed these features into a fully connected softmax layer to yield sarcasm and sentiment predictions.

Complex-valued Textual and Visual Embedding
Inspired by Li and Wang's work , for textual modality, an utterance can be seen as a collection of words. We assume that the textual Hilbert space H t is spanned by a set of orthogonal basis states |{w j t } n j=1 . With words as the basic semantic unit, the j th word w j t can be used as the basis state |w j t , represented by one-hot encoding, i.e., the j-th element being 1 and 0s elsewhere.
Then, we regard the ξ th target utterance u ξ t as a quantum superposition of a set of basis words |w 1 t , |w 2 t , ..., |w n .
For visual modality, the low-level visual features are seen as the basic unit. We assume that the visual Hilbert space H v is spanned by a set of orthogonal basis visual features {|w j v } n j=1 , where the visual part of the target utterance is represented as |u ξ v . The textual and visual embeddings of i th contextual utterance, |c i t and |c i v , can be calculated in the same way.

Learning Intra-modality Contextuality with the Quantum Composition Layer
Treating the target multimodal utterance as a quantum system, its contexts as the surrounding environments, we propose a quantum composition layer to learn the intra-modality contextuality. For text, given that the target utterance |u ξ t and its contexts |c 1 t . . . |c k t , the contextual interaction between them constructs a textual composite system Ψ ξ,k t , which is given by the tensor product of individual utterance embeddings. We aim to learn both long and short range contextual interactions, by constructing multiple composite systems with a variable number of contexts. The λ th composite system is computed as: where λ ∈ [1, k]. We can build k composite systems for k context utterances, i.e., Ψ t,k = These k composite systems are mathematically encapsulated in a textual density matrix ρ text , to obtain the representation of the target utterance u ξ t .
where p λ represents the weights to be learned during training. The density matrix unifies the target utterance and its contexts. For the visual part, we also build k composition system for k visual contexts, i.e., Ψ v,k = Figure 1: The architecture of the QPM framework. ⊗ denotes the tensor product operation. denotes an outer production to a vector. denotes point-wise multiplication. ⊕ refers to a element-wise addition. is the matrix multiplication.
√ refers to the square operation. M refers to the quantum measurement operation.
, and obtain the visual density matrix ρ img using Eq. 10.
Then, textual and visual density matrices ρ text and ρ img are flattened into two vectors |f t and |f v for multi-modal fusion via quantum interference.

Quantum Interference-like Fusion Layer
Based on Eq. 2, 3 and 4, we argue that the subjective attitude of a speaker is in a quantum superposition-like of textual and visual representations, expressed as: where z t (x) and z v (x) represent the complex probability amplitudes of textual and visual representations. f t (x) = |α| 2 |z t (x)| 2 and f v (x) = |β| 2 |z v (x)| 2 represent the corresponding probability distributions. The probability distribution of multi-modal representation is then written as: where x is the x th feature component of the multi-modal representation |f p .
.., f p (x n )) T represent the multi-modal fused features.

Quantum Measurement Layer
In QP, the properties of a system (e.g., an utterance's sarcastic information) can be depicted by the probability distribution of the measurement outcomes. The multi-modal representation |f p is shared across the two branches of our proposed QPM, and we propose to perform a sequence of quantum incompatible measurements on |f p , for obtaining the sarcastic and sentimental probabilistic features m sar and m sen .
Specifically, two sets of measurement operators M sar = M sar

Dense Layer
The sarcastic and sentimental outcomes m sar , m sen are forwarded through a fully connected layer and the softmax function to yield the sarcasm and sentiment predictions. We use cross entropy with L2 regularization as the loss functions ζ sar and ζ sen , and jointly minimize them with different weights, e.g., ζ = w sar ζ sar + w sen ζ sen . We receive gradients of error from two branches. and accordingly adjust the weights.

Experiment Settings
Datasets. We choose benchmark datasets that have textual and visual modalities with both sarcasm and sentiment labels. Only the extended version of MUStARD (MUStARD ext for short) 2 (Chauhan   (Sharma et al., 2020) datasets meet these criteria. MUStARD ext : The utterance in each dialogue is annotated with sarcastic or non-sarcastic labels. As an extended version of MUStARD, MUStARD ext re-annotate sentiment and emotion labels. Memotion: It consists of 6992 training samples and 1879 testing samples. Each memo data has been labelled with semantic dimensions, e.g., sentiment, sarcasm, humor, etc. Table 1 shows the detailed statistics for these two datasets. Evaluation metrics. We adopt precision (P), recall (R) and micro-F1 (M i -F1) as evaluation metrics in our experiments. We also introduce a balanced accuracy metric for an ablation test.
Hyper-parameter Setup. The textual and visual amplitudes are initialized with BERT and ResNet152 respectively. The phases are initialized with the pre-assigned sentiments using BERT. The quantum measurements are randomly initialized with an unit vector and is set to be trainable. The optimal hyper-parameters are listed in Table 2.

Baselines
A wide range of state-of-the-art baselines are included for comparison. They are: SVM+BERT (Devlin et al., 2019): It represents the textual utterances using BERT vectors and standard hyperparameter settings. We also concatenate the contextual features.  a RCNN in order to capture contextual information.

RCNN-RoBERTa
EfficientNet (Tan and Le, 2019): It uses a compound scaling method to create different models, which has achieved state-of-the-art performance on the ImageNet challenge.
UPB-MTL (Vlad et al., 2020): It is a multimodal multi-task learning architecture that combines ALBERT for text encoding with VGG-16 for image representation.
QMSA (Zhang et al., 2018c): It first extracts visual and textual features using density matrices, and feeds them into the SVM classifier.
A-MTL framework (Chauhan et al., 2020): It proposes an attention based multi-task model to simultaneously analyse sentiment, emotion and detect sarcasm.

Comparative Analysis
The experimental results are summarized in Table 3. Text-QPM and Image-QPM, which are singlemodality variants of QPM, do not perform well, demonstrating that text or visual modalities cannot be treated independently for multi-modal sarcasm and sentiment detection. The proposed QPM model achieves the best micro-F1 of 77.53% as compared to 76.57% of the state-of-the-art system (i.e., A-MTL) on MUStARD ext . QPM achieves a micro-F1 of 61.39% as compared to 59.85% of A-MTL on Memotion. The results show that the proposed QPM framework leverages the advantages of QP in modeling the uncertainty in human language. We attribute the main improvements to both quantum-like fusion network and quantum measurement mechanism, which ensures that QPM can model intra-modality contextuality and inter-  Table 4: Comparison with single-task learning (STL) and multi-task (MTL) learning frameworks. T: Text, V: Visual, T+V: QPM modality interference, and refine the final features.

STL v/s MTL Framework
We outline the comparison results between the multi-task (MTL) and single-task (STL) learning frameworks in Table 4. Bi-modal (T+V) shows a better performance over unimodal setups. For sarcasm detection, MTL outperforms STL by a large margin in text modality and bi-modal. The reason is that visual sarcasm detection involves a higher level of abstraction and more subjectivity. For sentiment analysis, MTL with sarcasm together achieves better performance than STL on all modalities. This indicates that sarcasm assists sentiment analysis through the sharing of knowledge, and vice versa. Our QP-based MTL framework could learn the inter-dependence between two related tasks and improves performance.

Effect of Context Range
Since the Memotion dataset does not involve contexts, we only report results on MUStARD ext in Tables 5 with different context scopes. "Zero context" means that we only use the target utterance, ignoring its context. "One context" denotes that we use one previous utterance to construct the density matrix. "Two contexts" means the use of previous two utterances as context.
The performance steadily increases as context range increases (with F1 scores of 66.03%, 68.75%, 72.54% and 77.53%), showing the importance of incorporating conversational context. QPM with zero context unsurprisingly performs worst. QPM with all contexts achieves the best F1 score, implying that incorporating all conversational contexts would be the best way to reach an optimal performance.

Ablation Study
We perform an ablation study to further study the effectiveness of different components of QPM: (1)   QPM-Real that does not consider the complex embedding, i,e., replacing utterance embeddings with their real counterparts; (2) QPM-Speaker Independent without modeling contextuality; (3) QPM-Concat that repalces the quantum interference-like fusion layer with multi-modal concatenation; (4) QPM-Trad that replaces quantum incompatible measurements with traditional softmax layers.
The results in Table 6 show that quantum incompatible measurement contributes the most to overall performance, as it effectively captures the interdependencies between tasks and extracts refined features. It is followed by the quantum-interference based fusion of multi-modalities and the modelling of contextuality. The complex-valued representation, which captures the uncertainty in human language, also plays an important role.

Error Analysis
We perform an error analysis and show a few misclassification cases (utterance+image), including the cases that MTL predicts correctly while STL fails, and that both setups fail to predict correctly. From Table 7 and Figure 2, we notice that misclassification for STL often happens in the situation where the literal meaning of an ironic expression differs from its real sentiment. Through utilizing the sentiment knowledge, MTL obtains a significant improvement. Moreover, we observe that MTL might struggle in intricate cases requiring external information.

Discussion on Inter-Task Incompatibility
For a more detailed exploration of the incompatible measurement, we train 1000 and 800 pairs of sentiment and sarcasm measurement operators for MUStARD and Memotion respectively, and calculate the commutation relation for each pair.
The results are visualized in Figure 3a and 3b. We can notice a violation of the commutation law, i.e., M sar γ , M sen δ = 0 for all pairs, implying sentiment and sarcasm are incompatible. To further validate this observation, we introduce quantum relative entropy 4 , which is a kind of "distance" measure between quantum states, the smaller quantum relative entropy show the closer correlation between sentiment and sarcasm operators. Average correlation and sample correlation scores are presented in Table 8 and Figure 3c, 3d, showing the two tasks are correlated. The result justifies the need of incompatible measurement and explains its effectiveness against traditional multi-task learning setting in Table 6.
Furthermore, an analysis of data shows that 84% of sarcasm samples in MUStARD express explicit sentiments while the proportion in Memotion is 74%. In MUStARD 38% of ironic utterances are 4 D(σ||ρ) = T rσlogσ − T rσlogρ. Here σ and ρ are two measurement operators, T r means the trace operation  also positive, and in Memotion it is 36%. These results support our hypothesis that sarcasm and sentiment are closely related.
6 Related Work Remarkable progress has been made in the current state-of-the-art. However, there is yet lack of mechanisms to capture the inherent uncertainty in multimodal human language for sarcasm and sentiment detection. Different from existing studies, we tackle the problem from a general cognitive perspective with a quantum probabilistic framework.

Conclusions
We have proposed a quantum probability driven multi-task learning framework. The main idea is to treat each utterance as a complex-valued vector. The contextual interaction between utterances and the correlations across modalities are modeled via quantum composition and quantum interference. Quantum incompatible measurement is performed to yield the probabilistic outcomes. The experimental results verify the effectiveness of the QPM.