MMSD2.0: Towards a Reliable Multi-modal Sarcasm Detection System

Multi-modal sarcasm detection has attracted much recent attention. Nevertheless, the existing benchmark (MMSD) has some shortcomings that hinder the development of reliable multi-modal sarcasm detection system: (1) There are some spurious cues in MMSD, leading to the model bias learning; (2) The negative samples in MMSD are not always reasonable. To solve the aforementioned issues, we introduce MMSD2.0, a correction dataset that fixes the shortcomings of MMSD, by removing the spurious cues and re-annotating the unreasonable samples. Meanwhile, we present a novel framework called multi-view CLIP that is capable of leveraging multi-grained cues from multiple perspectives (i.e., text, image, and text-image interaction view) for multi-modal sarcasm detection. Extensive experiments show that MMSD2.0 is a valuable benchmark for building reliable multi-modal sarcasm detection systems and multi-view CLIP can significantly outperform the previous best baselines.


Introduction
Sarcasm detection is used to identify the real sentiment of user, which is beneficial for sentiment analysis and opinion mining task (Pang and Lee, 2008).Recently, due to the rapid progress of social media platform, multi-modal sarcasm detection, which aims to recognize the sarcastic sentiment in multi-modal scenario (e.g., text and image modalities), has attracted increasing research attention.Specifically, as illustrated in Figure 1, given the text-image pair, multi-modal sarcasm detection system predicts sarcasm label because the image shows a traffic jam that contradicts the text "love the traffic".Unlike traditional sarcasm detection task, the uniqueness of multi-modal sarcasm detection lies in effectively modeling the consistency and sarcasm relationship among different modalities.Thanks to the rapid development of deep neural networks, remarkable success has been witnessed in multi-modal sarcasm detection.Specifically, Schifanella et al. (2016) makes the first attempt to explicitly concatenate the textual and visual features for multi-modal sarcasm detection.A series of works have employed attention mechanism to implicitly incorporate features from different modalities (Cai et al., 2019;Xu et al., 2020;Pan et al., 2020).More recently, graph-based approaches have emerged to identify significant cues in sarcasm detection (Liang et al., 2021(Liang et al., , 2022;;Liu et al., 2022), which is capable of better capturing relationship across different modalities, thereby dominating the performance in the literature.
While current multi-modal sarcasm detection systems have achieved promising results, it is unclear whether these results faithfully reflect the multi-modal understanding ability of models.In fact, when a text-modality only model RoBERTa is applied to multimodal sarcasm detection, its performance significantly surpasses the state-of-theart multi-modal model HKE (Liu et al., 2022) with 6.6% improvement (See Detailed Analysis § 4.3).Such observation suggests that the performance of current models may heavily depend on spurious cues in textual data, rather than truly capturing the relationship among different modalities, resulting in low reliability.Further exploration reveals that the characteristics of the MMSD benchmark (Cai et al., 2019) may be the cause of this phenomenon: (1) Spurious Cues: MMSD benchmark has some spurious cues (e.g., hashtag and emoji word) occuring in an unbalance distribution of positive and negative examples, which leads to the model bias learning; (2) Unreasonable Annotation: MMSD benchmark simply assigns the text without special hashtag (e.g., #sarcasm, etc.) as negative examples (i.e., not sarcastic label).We argue that this construction operation is unreasonable because sentence without #sarcasm tag can also express the sarcastic intention.Take utterance in Figure 2 as an example, the utterance without #sarcasm tag still belong to the sarcastic sample.Therefore, further chasing performance on current MMSD benchmark may hinder the development of reliable multi-modal sarcasm detection system.Motivated by the above observation, we shift our eyes from traditional complex network design work (Cai et al., 2019;Liang et al., 2021Liang et al., , 2022) ) to the establishment of reasonable multi-modal sarcasm detection benchmark.Specifically, we introduce MMSD2.0 to address these problems.To solve the first drawback, MMSD2.0 removes the spurious cues (e.g., sarcasm word) from text in the MMSD, which encourages model to truly capture the relationship across different modalities rather than just memorize the spurious correlation.Such operation can benefit the development of bias mitigation in multi-modal sarcasm detection studies.To address the second problem, we directly make our efforts to re-annotate the unreasonable data.Specifically, given each utterance labeled with "not sarcastic" in the MMSD, we use crowdsourced workers to check and re-annotate the label.This correction process results in changes to over 50% of samples in original MMSD.
In addition to the dataset contribution, we propose a novel framework called multi-view CLIP, which can naturally inherit multi-modal knowledge from pre-trained CLIP model.Specifically, multi-view CLIP utilizes different sarcasm cues captured from multiple perspectives (i.e., text, image and text-image interaction view), and aggregates multi-view information for final sarcasm detection.Compared to previous superior graphbased approaches, multi-view CLIP has the following advantages: (1) multi-view CLIP does not require any image pre-processing step for graph building (e.g., object detection); (2) multi-view CLIP does not require any complex network architecture and can naturally make full use of knowledge in VL pre-trained model for multi-modal sarcasm detection.
Contributions of this work can be summarized as follows: • To the best of our knowledge, we make the first attempt to point out the behind issues in current multi-modal sarcasm benchmark, which motivates researchers to rethink the progress of multi-modal sarcasm detection; • We introduce MMSD2.0, a correction dataset removing the spurious cues and fixing unreasonable annotation, for multi-modal sarcasm detection, which takes a meaningful step to build a reliable multi-modal sarcasm system; • We propose a novel multi-view CLIP framework to capture different perspectives of image, text and image-text interaction, which attains state-of-the-art performance.

Spurious Cues Removal
In our in-depth analysis, we observe that the spurious cues come from two resources: (1) hashtag words and (2) emoji words.Therefore, we will remove the two resources, respectively.
Hashtag Word Removal.In MMSD, as shown in Figure 3(a), we observe that distribution of hashtag word number in positive sample and negative sample is obviously unbalanced.As seen, the number of hashtag words in positive sample is on average more than 1 while less than 1 in negative samples in train, validation and test set.In other word, model only need to learn spurious correlation (hashtag word number) to make a correction   prediction rather than truly understand multi-modal correlation in sarcasm detection.
To address the issue, we remove hashtag words from text in the MMSD dataset.This allows the model to capture image features and using them to guide the final prediction, rather than relying on the hashtag word number as a spurious cue.
Emoji Word Removal.Similarly, we find that the distribution of emoji words between positive and negative samples is also unbalanced.Specifically, as shown in Figure 3(b), only 19.3% of which exist in both positive and negative samples while the rest 80.7% of emoji words only appear in one type of sample (e.g., positive sample or negative sample).This indicates that model can simply use emoji word distribution as a priority for predicting rather than truly capturing multi-modal cues.
To tackle this issue, we remove all the emoji words in text to force the model learning truly multi-modal sarcasm features rather than relying on spurious textual cues.

Unreasonable Samples Re-Annotation via Crowdsourcing
This section consists of two stages for reannotation: (1) sample selection stage to choose the unreasonable samples and (2) re-annotation via crowdsourcing stage to fix the unreasonable samples according to the sample selection stage.
Sample Selection Stage.MMSD simply considers the samples without special hashtags like "#sarcasm" as negative sample (i.e., not sarcasm).In this work, we argue that the process is unreasonable because samples without #sarcasm tag can also express sarcasm intention.Therefore, we select all negative samples in the MMSD dataset (over 50%) as potentially unreasonable samples for further processing.
Re-annotation via Crowdsourcing.For all selected samples in sample selection stage, we directly re-annotate them via crowdsourcing by hiring human experts.Given a sample, each annotator is required to annotate with labels among: (i) Sarcasm: if the sample express sarcasm intention; (ii) Not Sarcasm: if the sample not express sarcasm intention; (iii) Undecided: if the sample is hard to decide by the annotator.
After the whole annotation is done, the samples with label Undecided is re-annotated again by three experts to decide the final annotation.To control quality of the annotated dataset, we conduct two verification methods.
Onboarding Test.Before the whole annotation work, we require all the annotators to annotate 100 reasonable samples selected by model and the annotation result will be checked by 3 experts.Only those who achieve 85% annotation performance can join the annotation process.
Double Check.We randomly sample 1,000 annotated samples and ask another new annotator to re-annotate sarcasm label.Then, we calculated the Cohen's Kappa (McHugh, 2012) between the previous labels and new labels.Finally, we get a kappa of 0.811 indicating an almost perfect agreement (Landis and Koch, 1977).

Data Statistics
Table 1 provides a detailed comparison of the statistics of the MMSD and MMSD2.0 datasets.We can observe that the distribution of positive and negative samples in MMSD2.0 is more balanced.

Approach
This section first describes the basic architecture for CLIP ( §3.1) and then illustrates the proposed multi-view CLIP framework ( §3.2).

Multi-View CLIP
We introduce a novel multi-view CLIP framework for multi-modal sarcasm detection.Our framework consists of text view ( §3.2.1), image view ( §3.2.2) and image-text interaction view ( §3.2.3), which explicitly utilize different cues from different views in CLIP model, thereby better capturing rich sarcasm cues for multi-modal sarcasm detection.The overall framework is shown in Figure 5.

Text View
A series of works (Xiong et al., 2019;Babanejad et al., 2020) have shown textual information can be directly used for performing sarcasm detection.Therefore, we introduce a text view module to judge the sarcasm from text perspective.
Given samples D = (x (i) , y (i) ) N D i=1 where {x, y} is a text-image pair, CLIP text encoder T is used to output encoding representations T : where n stands for the sequence length of x.
Then, text-view decoder directly employ t CLS for multi-modal sarcasm detection: where y t is output distribution.

Image View
Image information can be considered as a textual supplement feature for sarcasm detection (Schifanella et al., 2016) , which motivates us to propose an image CLIP view module to detect the sarcasm from image perspective.

Multi-modal Sarcasm Detection Image View
Text View Image-Text Interaction View

Image-View Decoder
Text-View Decoder Key-less Attention Transformer Encoder Specifically, image view CLIP first leverage visual encoder V to generate image representation:

CLIP Text Encoder
where m denotes the number of image patches.
Similarly, v CLS is used for prediction: where y v is output distribution.

Image-text Interaction View
Modeling relationship across text and image modality is the key step to multi-modal sarcasm detection.We follow Fu et al. (2022) to use a transformer encoder (Vaswani et al., 2017) for sufficiently capturing interaction across different modalities.Specifically, given the yielded text representation T and image representation I, we first concat them using F = (v CLS , v 1 , . . ., v m , t 1 , . . ., t n , t CLS ) = Concat(T , I).Then, we apply different linear functions to obtain the corresponding queries Q, keys K and values V , and the updated representation F can be denoted as Given the updated image-text representation F = (v CLS , v1 , . . ., vm , t1 , . . ., tn , tCLS ), we use keyless attention mechanism (Long et al., 2018) to further fuse the image-text interaction feature f by calculating: Finally, f is utilized for prediction: where y f is output distribution.

Multi-view Aggregation
Given the obtained y t , y v , y f , we adopt a late fusion (Baltrusaitis et al., 2019) to yield the final prediction y o : where y o can be regarded as leveraging rich features from different perspectives of text view, image view and image-text interaction view.

Model Training
We use a standard binary entropy loss for image, text, and image-text interaction view, and a joint optimization for training the entire framework: where ŷi is the gold label.It is worth noting that we can directly use the final gold label for training image-view CLIP and text-view CLIP, which does not bring any annotation burden.
All experiments are conducted at Tesla V100s.
(iii) For multi-modality methods, we compare multi-view CLIP with the following state-of-theart baselines: (1) HFM (Cai et al., 2019) is a hierarchical fusion model for multi-modal sarcasm detection; (2) D&R Net (Xu et al., 2020) proposes a decomposition and relation network to model crossmodality feature; (3) Att-BERT (Pan et al., 2020) applies two attention mechanisms to model the textonly and cross-modal incongruity, respectively; (4) InCrossMGs (Liang et al., 2021) is a graph-based model using in-modal and cross-modal graphs to capture sarcasm cues; (5) CMGCN (Liang et al., 2022) proposes a fine-grained cross-modal graph architecture to model the cross modality information; ( 6) HKE (Liu et al., 2022) is a hierarchical graph-based framework to model atomic-level and compositionlevel congruity.For a fair comparison, CMGCN and HKE are the versions without external knowledge.
Table 2 (left part) illustrates the results on MMSD.We have the following observations: (1) Text-modality methods achieve promising performance and RoBERTa even surpasses multi-modality approaches, which indicates the sarcasm detection model can only rely on the text features to make a correct prediction, supporting the motivation of our work; (2) In multi-modality approaches, multi-view CLIP attains the best results, demonstrating the effectiveness of integrating features from different modality views.cues and can be used for a more reliable benchmark;

Performance on MMSD2.0
(2) Additionally, the previous graph-based baselines such as CMGCN and HKE do not make good performance on MMSD2.0, even worse than the Att-BERT.We attribute it to the fact that such graph-based approaches highly rely on the spurious cues (e.g., hashtag and emoji word) when constructing the text semantic dependency graph, which cannot be accessed in MMSD2.0;(3) Lastly, multi-view CLIP not only attains the best results in multi-modality approaches but also in both text and image modality approaches, which further verifies the effectiveness of our framework.

Analysis
To understand the multi-view CLIP in more depth, we answer the following research questions: (1) Does each modality view contribute to the overall performance of the model?(2) What effect do the training strategies of CLIP have on the model performance?(3) What impact do different interaction approaches have on the model performance?(4) Does the multi-view CLIP approach remain effective in low-resource scenarios?(5) Why does the multi-view CLIP work?

Answer1: All Views Make Contribute To Final Performance
To analyze the effect of different modality views of our framework, we remove the training object of text view L T , image view L V and image-text interaction view L F separately.Table 3 illustrates the results.We can observe the accuracy decrease by 1.46%, 1.95% and 3.20% when removing L T , L V and L F , respectively.As seen, our framework attains the best performance when combining all these views.This suggests that all views in our framework make contribute to the final performance.It is worth noting that the removal of L F leads to a significant performance drop, which suggests that modeling the multi-modality interaction features is crucial for multi-modal sarcasm detection.Acc.(%) F1(%) Figure 7: Different interaction methods.Cross attention is based on self-attention but obtains query from one modality and key, value from another modality.MLP denotes applying a feed-forward layer after concatenating the information from different modalities.

Answer2: Full Finetuned Gains The Best Performance
To investigate the influence of training methods of backbone network CLIP, we conduct experiments on different training combination of V and T.
The results are shown in Figure 6.It reveals that full finetuning of CLIP leads to the best performance.An interesting observation is that freezing all of CLIP leads to almost the same performance as freezing only T or V.This can be attributed to the fact that the text and image representations in CLIP are aligned, and only training a single part can break this alignment property.

Answer3: Transformer Interaction Fuse Cross-modal Information Deeper
To further verify the effectiveness of our framework, we explore different interaction methods in text-image interaction view, including cross attention (Pan et al., 2020) and the MLP approach which concatenates information from different modalities and uses feedforward layer to fuse them.Figure 7 illustrates the results.We can observe that our transformer interaction is more efficient than other two interactions.We attribute it to the fact that the transformer interaction is able to fuse the information from different modals in a more depth way, hence attaining the best performance.

Answer4: Multi-view CLIP can Transfer to Low-resource Scenario
To explore the effectiveness of multi-view CLIP in low-resource scenario, we experiment with different training sizes including 10%, 20% and 50%.
The results are shown in Figure 8, indicating that the multi-view CLIP approach still outperforms other baselines in low-resource settings.In particular, when the data size is 10%, the multi-view CLIP approach outperforms the baselines by a large margin in the same setting, and even surpasses Att-BERT when trained on the full dataset.We attribute this to the fact that the knowledge learned during the CLIP pre-training stage can be transferred to low-resource settings, indicating that our framework is able to extract sarcasm cues even when the training corpus is limited in size.

Answer5: Multi-view CLIP Utilize Correct Cues
To further explain why our framework works, we visualize the attention distribution of visual encoder V to show why our framework is effective for multimodal sarcasm detection.As shown in Figure 9, our model can successfully focus on appropriate parts of the image with sarcasm cues.For example, our framework pays more attention on the no-fresh beans in Figure 9(a), which as the important cues to contradicted with "fresh" in text.Meanwhile, our framework focus on the bad weather regions in Figure 9(b), which also an important cues incongruent with "amazing" in text.This demonstrates that our framework can successfully focus on the correct cues.

Related Work
Sarcasm detection identifies the incongruity of sentiment from the context, which first attracts attention in text modality.Early studies focused on using feature engineering approaches to detect in-  congruity in text (Lunando and Purwarianti, 2013;Bamman and Smith, 2015).A series of works (Poria et al., 2016;Zhang et al., 2016;Xiong et al., 2019)  In contrast to their work, we make the first attempt to address the bias issues in traditional MMSD dataset, towards to building a reliable multi-modal sarcasm detection system.In addition, we introduce MMSD2.0 to this end, aiming to facilitate the research.To the best of our knowledge, this is the first work to reveal the spurious cues in current multi-modal sarcasm detection dataset.

Conclusion
This paper first analyzed the underlying issues in current multi-modal sarcasm detection dataset and then introduced a MMSD2.0 benchmark, which takes the first meaningful step towards building a reliable multi-modal sarcasm detection system.Furthermore, we proposed a novel framework multi-view CLIP to capture sarcasm cues from different perspectives, including image view, text view, and image-text interactions view.Experimental results show that multi-view CLIP attains state-of-the-art performance.

Limitations
This work contributes a debias benchmark MMSD2.0 for building reliable multi-modal sarcasm detection system.While appealing, MMSD2.0 is built on the available MMSD benchmark.In the future, we can consider annotating more data to break through the scale and diversity of the original MMSD.and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Section Ethics Statement D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Section Ethics Statement D4.Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Section Ethics Statement

I
love the traffic in this city so much … # traffic_jam

Figure 1 :
Figure 1: Multi-modal sarcasm example.Box and words in red color denote the correlated sarcastic cues.Word #traffic_jam with hash symbol # is a hashtag.

Figure 2 :
Figure 2: Overall process of construction MMSD2.0 dataset.Given the example in (a), Spurious Cues Removal stage first remove the spurious cues in text including hashtag word (#terrible_food) and emoji word (emoji_39) to acquire (b), then unreasonable samples re-annotation via crowdsourcing (human re-annotation) stage re-annotates the unreasonable samples to get final reasonable example (c).

Figure 5 :
Figure 5: The overall framework of our multi-view CLIP.A pre-trained CLIP model encodes the inputs texts and images.Image view and text view utilize the information of image-only and text-only to capture sarcasm cues.Image-text interaction view fuses the cross-modal information.The three views is aggregated for final prediction.
(a) <user> good to see you keep your beans fresh ... (b) weather 's looking amazing today ...
explored deep learning network (e.g., CNN, LSTM and self-attention) for sarcasm detection.Babanejad et al. (2020) extended BERT by incorporating into it effective features for sarcasm detection.Compared to their work, we aim to solve sarcasm detection in the multi-modal scenario while their work focus on text sarcasm detection.As the rapid popularization of social medial platform, multi-modal sarcasm detection attracts increasing research attention in recent years.Schifanella et al. (2016) explored the multi-modal sarcasm detection task for the first time and tackled this task by concatenating the features from text and image modalities respectively.Cai et al. (2019) proposed a hierarchical fusion model to fuse the information among different modalities and resealed a new public dataset.Xu et al. (2020) suggested to represent the commonality and discrepancy between image and text by a decomposition and relation network.Pan et al. (2020) implied a BERT architecture-based model to consider the incongruity character of sarcasm.Liang et al. (2021) explored constructing interactive in-modal and crossmodal to learn sarcastic features.Liang et al. (2022) proposed a cross-modal graph architecture to model the fine-grained relationship between text and image modalities.Liu et al. (2022) proposed a hierarchical framework to model both atomiclevel congruity and composition-level congruity.

C2.
Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section 4 Experiments C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 4 Experiments C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Not applicable.Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?Section Ethics Statement D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Section Ethics Statement D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students)

Table 2 :
Liu et al. (2022)line results on dataset MMSD are taken fromLiu et al. (2022).Results with -denote the code is not released.Results with † stand for that we re-implement the model.
Fr ee ze A ll Fr ee ze V E Fr ee ze TE Fu ll Fi ne tu ne d Different fine-tuning methods.Freeze All stands for both V and T are frozen, Freeze VE and Freeze TE indicate V and T are frozen respectively, and Full Finetuned means the whole CLIP is trainable.