Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement

Sarcasm is a linguistic phenomenon indicating a discrepancy between literal meanings and implied intentions. Due to its sophisticated nature, it is usually difficult to be detected from the text itself. As a result, multi-modal sarcasm detection has received more and more attention in both academia and industries. However, most existing techniques only modeled the atomic-level inconsistencies between the text input and its accompanying image, ignoring more complex compositions for both modalities. Moreover, they neglected the rich information contained in external knowledge, e.g., image captions. In this paper, we propose a novel hierarchical framework for sarcasm detection by exploring both the atomic-level congruity based on multi-head cross attentions and the composition-level congruity based on graph neural networks, where a post with low congruity can be identified as sarcasm. In addition, we exploit the effect of various knowledge resources for sarcasm detection. Evaluation results on a public multi-modal sarcasm detection dataset based on Twitter demonstrate the superiority of our proposed model.


Introduction
Sarcasm refers to satire or ironical statements where the literal meaning of words is contrary to the authentic intention of the speaker to insult someone or humorously criticize something. Sarcasm detection has received considerable critical attention because sarcasm utterances are ubiquitous in today's social media platforms like Twitter and Reddit. However, it is a challenging task to distinguish sarcastic posts to date in light of their highly figurative nature and intricate linguistic synonymy (Pan et al., 2020;Tay et al., 2018).
Early sarcasm detection methods mainly relied on fixed textual patterns, e.g., lexical indicators, syntactic rules, specific hashtag labels and emoji Figure 1: An example of sarcasm along with the corresponding image and different types of external knowledge extracted from the image. The sarcasm sentence represents the need for some good news. However, the image of the TV program is switched to bad news depicting severe storms (bad weather) which contradicts the sentence.
occurrences (Davidov et al., 2010;Maynard and Greenwood, 2014;Felbo et al., 2017), which usually had poor performances and generalization abilities by failing to exploit contextual information. To resolve this issue, (Tay et al., 2018;Joshi et al., 2015;Ghosh and Veale, 2017;Xiong et al., 2019) considered sarcasm contexts or the sentiments of sarcasm makers as useful clues to model congruity level within texts to gain consistent improvement. However, purely text-modality-based sarcasm detection methods may fail to discriminate certain sarcastic utterances as shown in Figure 1. In this case, it is hard to identify the actual sentiment of the text in the absence of the image forecasting severe weather. As text-image pairs are commonly observed in the current social platform, multi-modal methods become more effective for sarcasm prediction by capturing congruity information between textual and visual modalities (Pan et al., 2020;Xu et al., 2020a;Schifanella et al., 2016;Liu et al., 2021;Liang et al., 2021;Cai et al., 2019).
However, most of the existing multi-modal techniques only considered the congruity level between each token and image-patch (Xu et al., 2020a;Tay et al., 2018) and ignored the importance of multigranularity (e.g., granularity such as objects, and relations between objects) alignments, which have been proved to be effective in other related tasks, arXiv:2210.03501v1 [cs.CL] 7 Oct 2022 such as cross-modal retrieval (Li et al., 2021b) and image-sentence matching (Xu et al., 2020b;Liu et al., 2020). In fact, the hierarchical structures of both texts and images advocate for compositionlevel modeling besides single tokens or image patches (Socher et al., 2014). By exploring compositional semantics for sarcasm detection, it helps to identify more complex inconsistencies, e.g., inconsistency between a pair of related entities and a group of image patches.
Moreover, as figurativeness and subtlety inherent in sarcasm utterances may bring a negative impact to sarcasm detection, some works (Li et al., 2021a;Veale and Hao, 2010) found that the identification of sarcasm also relies on the external knowledge of the world beyond the input texts and images as new contextual information. Indeed, several studies extracted image attributes (Cai et al., 2019) or adjective-noun pairs (ANPs) (Xu et al., 2020a) from images as visual semantic information to bridge the gap between texts and images. However, constrained by limited training data, such external knowledge may not be sufficient or accurate to represent the images (as shown in Figure  1) which may bring negative effects for sarcasm detection. Therefore, how to choose and leverage external knowledge for sarcasm detection is also worth being investigated.
To tackle the limitations mentioned above, in this work, we propose a novel hierarchical framework for sarcasm detection. Specifically, our proposed method takes both atomic-level congruity between independent image objects and tokens, as well as composition-level congruity considering object relations and semantic dependencies to promote multi-modal sarcasm identification. To obtain atomic-level congruity, we first adopt the multihead cross attention mechanism (Vaswani et al., 2017) to project features from different modalities into the same space and then compute a similarity score for each token-object pair via inner products. Next, we obtain composition-level congruity based on the output features of both textual modality and visual modality acquired in the previous step. Concretely, we construct textual graphs and visual graphs using semantic dependencies among words and spatial dependencies among regions of objects, respectively, to capture composition-level feature for each modality using graph attention networks (Veličković et al., 2018). Our model concatenates both atomic-level and composition-level congruity features where semantic mismatches between the texts and images in different levels are jointly considered. Specially, we elaborate the terminology used in our paper again: congruity represents the semantic consistency between image and text. If the meaning of the image and text pair is contradictory, this pair will get less congruity. Atomic is between token and image patch, and compositional is between a group of tokens (phrase) and a group of patches (visual object).
Last but not the least, we propose to adopt the pre-trained transferable foundation models (e.g., CLIP (Radford et al., 2021(Radford et al., , 2019) to extract text information from the visual modality as external knowledge to assist sarcasm detection. The rationality of applying transferable foundation models is due to their effectiveness on a comprehensive set of tasks (e.g., descriptive and objective caption generation task) based on the zero-shot setting. As such, the extracted text contains ample information of the image which can be used to construct additional discriminative features for sarcasm detection. Similar to the original textual input, the generated external knowledge also contains hierarchical information for sarcasm detection which can be consistently incorporated into our proposed framework to compute multi-granularity congruity against the original text input.
The main contributions of this paper are summarized as follows: 1) To the best of our knowledge, we are the first to exploit hierarchical semantic interactions between textual and visual modalities to jointly model the atomic-level and compositionlevel congruities for sarcasm detection; 2) We propose a novel kind of external knowledge for sarcasm detection by using the pre-trained foundation model to generate image captions which can be naturally adopted as the input of our proposed framework; 3) We conduct extensive experiments on a publicly available multi-modal sarcasm detection benchmark dataset showing the superiority of our method over state-of-the-art methods with additional improvement using external knowledge.

Multi-modality Sarcasm Detection
With the rapid growth of multi-modality posts on modern social media, detecting sarcasm for text and image modalities has increased research attention. Schifanella et al. (2016) first defined multimodal sarcasm detection task. Cai et al. (2019) cre-ated a multi-modal sarcasm detection dataset based on Twitter and proposed a powerful baseline fusing features extracted from both modalities. Xu et al. (2020a) modeled both cross-modality contrast and semantic associations by constructing the Decomposition and Relation Network to capture commonalities and discrepancies between images and texts. Pan et al. (2020) and Liang et al. (2021) modeled intra-modality and inter-modality incongruities utilizing transformers (Vaswani et al., 2017) and graph neural networks, respectively. However, these works neglect the important associations played by hierarchical or multi-level cross-modality dismatches. To address this limitation, we propose to capture multi-level associations between modalities by cross attentions and graph neural networks to identify sarcasm in this work.

Knowledge Enhanced Sarcasm Detection
Li et al. (2021a) and Veale and Hao (2010) pointed out that commonsense knowledge is crucial for sarcasm detection. For multi-modal based sarcasm detection, Cai et al. (2019) proposed to predict five attributes for each image based on the pre-trained ResNet model (He et al., 2016) as the third modality for sarcasm detection. In a similar fashion, Xu et al. (2020a) extracted adjective-noun pairs(ANPs) from every image to reason discrepancies between texts and ANPs. In addition, as some samples can contain text information for the images, Pan et al. (2020) and Liang et al. (2021) proposed to apply the Optical Character Recognition (OCR) to acquire texts on the images. More recently, Liang et al. (2022) proposed to incorporate objection detection framework and label information of detected visual objects to mitigate modality gap. However, the knowledge extracted from these methods is either not expressive enough to convey the information of the images or is only restricted to a fixed set, e.g., nearly one thousand classes for image attributes or ANPs. Moreover, it should be noted that not every sarcasm post has text on images. To this end, in this paper, we propose to generate a descriptive caption with rich semantic information for each image based on the pre-trained Clipcap model (Mokady et al., 2021), which uses the CLIP (Radford et al., 2021) encoding as a prefix to the caption by employing a simple mapping network and then fine-tunes GPT-2 (Radford et al., 2019) to generate the image captions.

Methodology
Our proposed framework contains four main components: Feature Extraction, Atomic-Level Cross-Modal Congruity, Composition-Level Cross-Modal Congruity and Knowledge Enhancement. Given an input text-image pair, the feature extraction module aims to generate text features and image features via a pre-trained text encoder and an image encoder, respectively. These features will then be fed as input to the atomic-level cross-modal congruity module to obtain congruity scores via a multi-head cross attention model (MCA). To produce composition-level congruity scores, we construct a textual graph and a visual graph and adopt graph attention networks (GAT) to exploit complex compositions of different tokens as well as image objects. The input features to the GAT are taken from the output of the atomic-level module. Due to the page limitation, we place our illustration figure in Figure 6. Our model is flexible to incorporate external knowledge as a "virtual" modality, which could be used to generate complementary features analogous to the image modality for congruity score computation.

Task Definition & Motivation
Multi-modal sarcasm detection aims to identify whether a given text associated with an image has a sarcastic meaning. Formally, given a multimodal text-image pair (X T , X I ), where X T corresponds to a textual tweet and X I is the corresponding image, the goal is to produce an output label y ∈ {0, 1}, where 1 indicates a sarcastic tweet and 0 otherwise. The goal of our model is to learn a hierarchical multi-modal sarcasm detection model (by taking both atomic-level and composition-level congruity into consideration) based on the input of textual modality, image modality and the external knowledge if chosen.
The reason to use composition-level modeling is to cope with the complex structures inherent in two modalities. For example, as shown in Figure 2, the semantic meaning of the sentence depends on composing your life, awesome and pretend to reflect a negative position, which could be reflected via the dependency graph. The composed representation for text could then be compared with the image modality for more accurate alignment detection.

Feature Extraction
Given an input text-image pair (X T , X I ), where X T = {w 1 , w 2 , . . . , w n } consists of n tokens, we utilize the pre-trained BERT model (Devlin et al., 2019) with an additional multi-layer perceptron (MLP) to produce a feature representation for each token, denoted as T = [t 1 , t 2 , . . . , t n ], where T ∈ R n×d . As for image processing, given the image X I with the size L h × L w , following existing methods (Xu et al., 2020a;Cai et al., 2019;Liang et al., 2021;Pan et al., 2020), we first resize the image to size 224 × 224. Then we divide each image into r patches and reshape these patches into a sequence, denoted as {p 1 , p 2 , . . . , p r }, in the same way as tokens in the text domain. Next, we feed the sequence of r image patches into an image encoder to get a visual representation for each patch. Specifically, in this paper, we choose two kinds of image encoders including the pre-trained Vision Transformer (ViT) (Dosovitskiy et al., 2020) and a ResNet model (He et al., 2016), both of which are trained for image classification on ImageNet. Hence, the embedding of image patches derived by ViT or ResNet contains rich image label information. Here we adopt the features before the final classification layer to initialize the embeddings for visual modality. We further use a twolayer MLP to obtain the feature representations for {p 1 , p 2 , . . . , p r } as I = [i 1 , i 2 , . . . , i r ], where I ∈ R r×d .

Atomic-Level Congruity Modeling
To measure atomic-level congruity between a text sequence and an image, an intuitive solution is to compute a similarity score between each token and a visual patch directly. However, due to the huge gap between two different modalities, we propose to use cross attention mechanisms with h heads to firstly align the two modalities in the same space, which can be computed as where I ∈ R r×d and T ∈ R n×d are feature representations of the given text and image, respectively.
are query, key and value projection matrices, respectively, for head i ∈ R n× d h . It is worth noting that we also consider taking image as query, text as key and value for Equation (1). However, we empirically find that the performance is not desired in this case. We conjecture the reason to be the fact that the visual modality may not contain sufficient information and is less expressive compared to the textual modality to provide attentive guidance, which can lead to negative impact of the final performance.
Then, by concatenating all heads followed by a two-layer MLP and a residual connection, we obtain updated text representationsT ∈ R n×d after aligning with the visual modality as where "norm" denotes the layer normalization operation and " " denotes the concatenation operation. Next, to perform atomic-level cross-modal congruity detection, we adopt the inner product as for i-th row and j-th column representing the similarity score between the i-th token of the text and the j-th patch of the image. Intuitively, different words can have different influence on the sarcasm detection task. For example, noun, verb and adjacent words are usually more important for understanding sarcastic utterances. As such, we feed features of words to a fully-connected (FC) layer with a softmax activation function to model the token importance for sarcasm detection. The final atomic level congruity score s a can be obtained by a weighted sum of Q a with the importance score of each token as where W a ∈ R d×1 and b a ∈ R n are trainable parameters in the FC layer for token importance score computation. s a ∈ R r contains the predicted atomic-level congruity score corresponding to each of the r patches.

Composition-Level Congruity Modeling
The composition-level congruity detection considers the more complex structure of both the text and image modalities, compared to the atomic-level computations. To achieve that, we propose to first construct a corresponding textual graph and a visual graph for the input text-image pair. For the textual graph, we consider tokens in the input text as graph nodes and use dependency relations between words extracted by spaCy 1 as edges, which have been proved to be effective for various graphrelated tasks (Liu et al., 2020;Liang et al., 2021). Concretely, if there exists a dependency relation between two words, there will be an edge between them in the textual graph. For the visual graph, given r image patches, we take each patch as a graph node and connect adjacent nodes according to their geometrical adjacency. Additionally, both two kinds of graphs are undirected and contain self-loops for expressiveness. Then, we model the graphs in text and visual modalities with graph attention networks (GAT) (Veličković et al., 2018). GAT leverages selfattention layers to weigh the extent of information propagated from corresponding nodes. By using GAT, atomic-level semantic information will propagate along with the graph edge to learn composition-level representations for both textual modality and image modality. Here, we take the textual graph for illustration given as where k ∈ N (i) ∪ {i}, Θ l ∈ R d×d and v l ∈ R 2d are learnable parameters of the l-th textual GAT layer. α l i,j is a scalar indicating the attention score between node i and its neighborhood node j. t l i represents the feature of node i in the l-th layer, with t 0 i =t i initialized from the atomic-level fea-turesT. We useT = [t L T 1 , t L T 2 , . . . , t L T r ] witĥ T ∈ R n×d to represent the composition-level embeddings of the textual modality after L T GAT layers that incorporate complex dependencies among related tokens. In some cases, we may not be able to construct a reliable textual graph due to the lack of sufficient words in a sentence or errors from the parser. Hence, we further propose to concatenateT with a sentence embedding c ∈ R d which is computed by a weighted sum of each word embedding inT: with learnable W c ∈ R d×1 and b c ∈ R n . Likewise, we can obtainÎ = [i L I 1 , i L I 2 , . . . , i L I n ], I ∈ R r×d as the composition-level representations in the visual modality. At last, we compute 1 https://spacy.io/ composition-level alignment scores s p betweenT andÎ in a similar way as atomic-level congruity as where Qp = 1 √ d ([T c]Î ) ∈ R (n+1)×r is the matrix of composition-level congruity between textual modality and visual modality, W p ∈ R d×1 and b p ∈ R n+1 are trainable parameters. s p ∈ R r contains the final predicted composition-level congruity score for each of the r image patches.

Knowledge Enhancement
While using text-image pair can benefit sarcasm detection compared with only using a single modality , recent works have shown that it might be still challenging to detect sarcasm solely from a textimage pair (Li et al., 2021a;Veale and Hao, 2010). To this end, we explore the effect of fusing various external knowledge extracted from an image for sarcasm detection. For example, the knowledge could be image attributes (Cai et al., 2019), ANPs (Xu et al., 2020a) as they provide more information on the key concepts delivered by the image. However, such information lacks coherency and semantic integrity to describe an image and may introduce unexpected noise, as indicated in Figure 1. To address this limitation, we propose to generate image captions as the external knowledge to assist sarcasm detection. We further compare the effect of each knowledge form in the experiments.
To fuse external knowledge into our model, we treat knowledge X K as another "virtual" modality besides texts and images. Then the augmented input to the model becomes (X T , X I , X K ). As the knowledge is given in textual form, we follow the process of generating text representations to attain the knowledge features. Specifically, we first obtain the input knowledge representations as K = [k 1 , k 2 , . . . , k m ] using BERT with a MLP, which is analogous to T. Then, we propose to reason the congruity score between text and knowledge modalities at atomic-level by following the procedure of computing atomic-level congruity score between text and image modalities (as shown in Equations (1)-(3)) with another set of parameters. Concretely, for cross-modality attentions between texts and knowledge, we replace I in Equation (1) with K and T in Equation (1) withT, which is the updated text representations after aligning with the visual modality. Inheriting information from the image modality, usingT as the query to attend to knowledge enhances deeper interactions across all the three modalities. By further replacing T in Equation (2) withT, we denote the atomiclevel text representations after aligning with the knowledge byT k . The similarity matrix between texts and knowledge becomes Q k a = 1 √ d (T k K ). Then the atomic-level congruity score, denoted as s k a ∈ R m , can be obtained as By adopting the dependency graph for X K , we further generate the updated knowledge representa-tionsK via GAT and obtain the composition-level congruity score s k p ∈ R m between text and knowledge modalities, following the same procedure for text-image composition-level congruity score as described in Section 3.4.

Training & Inference
Given both the atomic-level and composition-level congruity scores s a and s p , respectively, the final prediction could be produced considering the importance of each image patch for sarcasm detection. pv = softmax(IWv + bv), (9) y = softmax(Wy[pv sa pv sp] + by), (10) where W v ∈ R d×1 , b v ∈ R r , W y ∈ R 2×2r and b y ∈ R 2 are trainable parameters, p v ∈ R r is a r-dim attention vector, is element-wise vector product. It is flexible to further incorporate external knowledge by reformulating Equation (10) to where s k a and s k p are atomic-level and compositionlevel congruity scores between post and external knowledge. p k ∈ R m measures the importance of each word in the knowledge obtained by p k = softmax(KW k v +b k v ). The entire model can be trained in an end-to-end fashion by minimizing the crossentropy loss given the ground-truth label y. We evaluate our model on a publicly available multi-modal sarcasm detection dataset in English constructed by Cai et al. (2019). The statistics are shown in Tabel 1. Based on our preliminary analysis, the average numbers of tokens and entities in a text are approximately 17 and 4, respectively, where complex compositions among atomic units have a higher chance of being involved. This finding provides the basis for our framework using atomic-level and composition-level information to capture hierarchical cross-modality semantic congruity.

Implementation
For a fair comparison, following the pre-processing in (Cai et al., 2019;Liang et al., 2021;Xu et al., 2020a), we remove samples containing words that frequently co-occur with sarcastic utterances (e.g., sarcasm, sarcastic, irony and ironic) to avoid introducing external information. The dependencies among tokens are extracted using spaCy toolkit.
For image preprocessing, we resize the image to 224 × 224 and divide it into 32 × 32 patches (i.e., p = 7, r = 49). For knowledge extraction, we extract image attributes following (Cai et al., 2019), ANPs following (Xu et al., 2020a) and image captions via Clipcap (Mokady et al., 2021). Next, we employ a pre-trained BERT-baseuncased model 2 as textual backbone network to obtain initial embeddings for texts and knowledge, and choose the pre-trained ResNet and ViT 3 modules as visual backbone networks to extract initial embeddings for images. These textual and visual representations are mapped to 200-dim vectors by corresponding MLPs. We use Adam optimizer to train the model. The dropout and early-stopping are adopted to avoid overfitting. The details of implementations are listed in Table 6 in Appendix. Our code is avaliable at https://github.com/ less-and-less-bugs/HKEmodel.

Results without External Knowledge
We first evaluate the effectiveness of our proposed framework by comparing with the baseline models as shown in Tabel 2. It is shown that our proposed model achieves state-of-the-art performance. Obviously, text-based models perform far better than image-based methods, which implies that text is more comprehensible and more informative than images. This supports our intuition of extracting textual knowledge from images as additional clues. On the other hand, multi-modal methods outperform all those models in single modality. This illustrates that considering information from both modalities contributes to the task by providing additional cues on modality associations. Note that compared with multimodal methods using ResNet as the visual backbone network, our model achieves a 0.97% improvement in terms of Accuracy and a 1.00% improvement in terms of F1-score over the state-of-art method Att-BERT. Besides, using ViT as the image feature extractor, our model outperforms the InCrossMGs model  with a 1.26% improvement in Accuracy and 1.25% improvement in F1-score. Our method can also achieve better performance with improvement of 0.82% based on Accuracy compared with the recent proposed CMGCN. The results demonstrate the effectiveness and superiority of our framework for sarcasm detection by modeling both atomiclevel and composition-level cross-modality congruities in a hierarchical manner.

Results with External Knowledge
We then evaluate the effectiveness of our method by considering external knowledge. Table 3 reports the accuracy and F1-score for our proposed sarcasm detection method enhanced by considering different types of knowledge. By incorporating image captions, the performance further improves compared with the original model (w/o external knowledge). On the contrary, Image Attributes and ANPs bring negative effects and deteriorate the performance. We conjecture two possible reasons, 1) image attributes and ANPs can sometimes be meaningless or even noisy for identifying sarcasm, 2) image attributes and ANPs are rather short, lacking rich compositional information for our hierarchical model. Last but not the least, it is worth mentioning that only exploiting texts and captions in textual modality (Image Captions w/o image) without the visual modality also achieves superior performance compared with all multi-modal baselines in Table  2. Such observation illustrates that the pre-trained models such as CLIP and GPT-2 can provide meaningful external information for sarcasm detection.

Ablation Study
Impact of Different Components. We conduct ablation studies using ViT as the visual backbone network without external knowledge to further understand the impact of different components of our proposed method. To be specific, we consider three The results are shown in Table 4. It is clear that our model achieves the best performance when composing all these components. It is worth noting that the removal of composition-level s p leads to significant performance drop, compared to atomiclevel removal. This indicates that the compositionlevel congruity plays a vital role for discovering inconsistencies between visual and textual modalities by exploiting complex structures through propagating atomic representations along the semantic or geographical dependencies. Moreover, the removal of MCA leads to slightly lower performance which indicates that cross attention is beneficial for modeling cross modality interactions and reducing the modality gap in the representation space. Impact of MCA Layers. We measure the performance change without external knowledge when varying the number of MCA layers from 1 to 8 in Figure 3a. As can be seen, the performance first increases along with the increasing number of layers and then decreases after 6 layers. This shows excessive MCA layers in atomic-level congruity module may overfit to textual and visual modality alignment instead of sarcasm detection. Impact of GAT Layers. We analyse the impact of the number of GAT layers for our proposed model and report Accuracies and F1 scores in Figure 3b. The results show that the best performance can be achieved when using a two-layer GAT model and the performance further drops when increasing the number of layers. We conjecture the reason to be the over-smoothing issue of Graph Neural Networks when increasing the number of propagation layers, making different nodes indistinguishable. Impact of Different Sentence Embeddings. We  perform experiments using Universal Sentence Encoder (USE) (Cer et al., 2018), CLIP (Radford et al., 2021), CLS Token of Bert (Devlin et al., 2019), and Word Averaging to extract sentence embedding (i.e., c in Eq 6) as shown in Tabel 5. Although CLIP outperforms other methods with a little margin, we prefer Word Averaging to keep our model concise and reduce the number of parameters.

Case Study
To further justify the effectiveness of external knowledge for sarcasm detection task, we provide case studies on the samples that are incorrectly predicted by our proposed framework with only textimage pair but can be accurately classified with the help of image captions. For example, intravenous injection depicted in Figure 4a can indicate a flu or hospitalization via image modeling, which aligns with flu or hospital in the text input. However, by generating an image caption expressing a bad mood indicated by suffered, it becomes easy to detect the sarcastic nature of this sample by contrasting fun in the text description and suffer in the image caption. As another example shown in Figure 4b, the image encoder only detects a human holding a torch without any contexts and wrongly predicts the sample as sarcasm because of the disalignment between the image and text description. By generating the image caption expressing an olympic athlete, the knowledge-fused model is able to detect the alignment and correctly classifies this sample. This reflects that by further utilizing CLIP (Radford et al., 2021) and GPT-2 (Radford et al., 2019) models pre-trained using large-scale data as an external knowledge source, the generated image captions are more expressive to understand some sophisticated visual concepts and to mitigate the furtiveness and subtlety of sarcasm.
We further illustrate the effectiveness of our hierarchical modeling by showing the congruity score maps in Figure 5. Given a sarcastic sample in Figure 5a, we visualize the congruity scores between the text and image in both atomic-level module s a (left side of Figure 5b) and composition-level module s p (right side of Figure 5b). The smaller the values, the less similar between the text and image (i.e., more likely to be detected as sarcasm). It can be shown that the atomic-level module attends to furniture in the image whereas the compositionlevel module down-weighs those patches, making the text and image less similar for sarcasm prediction. Correspondingly, our proposed hierarchical structure has the power to refine atomic congruity to identify more complex mismatches for multimodal sarcasm detection using graph neural networks.

Conclusion
In this paper, we propose to tackle the problem of sarcasm detection by reasoning atomic-level congruity and composition-level congruity in a hierarchical manner. Specifically, we propose to model the atomic-level congruity based on the multi-head cross attention mechanism and the compositionlevel congruity based on graph attention networks. In addition, we propose to exploit the effect of various knowledge resources on enhancing the discriminative power of the model. Evaluation results demonstrate the superiority of our proposed model and the benefit of image captions as external knowledge for sarcasm detection.

Limitations
We present two possible limitations: 1) we only use the Twitter dataset for evaluation. However, to the best of our knowledge, this dataset is the only benchmark for the evaluation of multi-modal sarcasm detection in our community. Nevertheless, we conduct extensive experiments with various metrics to show the superiority of our proposed method. We leave the construction of more high-quality benchmarks in our future work; 2) our knowledge enhancement strategy in Section 3.5 may not be suitable for ANPs and Image Attributes. We analyze the results in Section 4.5. Consequently, there is a pressing need for a more general and elegant knowledge integration method in view of the importance of external knowledge for multi-modality sarcasm detection.

Ethics Statement
This paper is informed by the ACM Code of Ethics and Professional Conduct. Firstly, we respect valuable and creative works in sarcasm detection and other related research domains. We especially cite relevant papers and sources of pre-trained models and toolkits exploited by this work as detailed and reasonable as possible. Besides, we will release our code based on the licenses of any used artifacts. Secondly, our adopted dataset does not include sensitive privacy individual information and will not introduce any information disorder to society. For precautions to prevent re-identification of data, we mask facial information in Figure 4b. At last, as our proposed sarcasm detection method benefits the identification of authentic intentions in multi-modal posts on social media, we expect our proposed method can also bring positive impact on related problems, such as opinion mining, recommendation system, and information forensics in the future.

A Model Overview
For illustration, we give a figure of the textimage branch that can capture atomic-level and composition-level congruity between textual and visual modalities for multimodal sarcasm detection. For our model, the max length of sarcasm text is set to 100 and the max length of generated image caption is set to 20. For the architecture, the number of the multihead cross-attention layer is set to 6 for text-image branch and 3 for text-knowledge branch to capture atomic-level congruity score. The head number is set to 5. The number of graph attention layer is set to 2 to obtain composition-level congruity score for both branches. We use Adam optimizer with a learning rate of 2e−5, weight decay of 5e−3, batch size as 32 and dropout rate as 0.5 to train the model.