Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models

The age of social media is rife with memes. Understanding and detecting harmful memes pose a significant challenge due to their implicit meaning that is not explicitly conveyed through the surface text and image. However, existing harmful meme detection approaches only recognize superficial harm-indicative signals in an end-to-end classification manner but ignore in-depth cognition of the meme text and image. In this paper, we attempt to detect harmful memes based on advanced reasoning over the interplay of multimodal information in memes. Inspired by the success of Large Language Models (LLMs) on complex reasoning, we first conduct abductive reasoning with LLMs. Then we propose a novel generative framework to learn reasonable thoughts from LLMs for better multimodal fusion and lightweight fine-tuning, which consists of two training stages: 1) Distill multimodal reasoning knowledge from LLMs; and 2) Fine-tune the generative framework to infer harmfulness. Extensive experiments conducted on three meme datasets demonstrate that our proposed approach achieves superior performance than state-of-the-art methods on the harmful meme detection task.


Introduction
The development of social media platforms has given rise to a new form of multimodal content known as: meme.A meme typically comprises a picture that is combined with a concise text component.Memes possess the capacity to quickly spread across the internet, especially on social media platforms, due to their ease of dissemination.While memes are often seen as humorous, there is a potential for harm when the combination of images and texts is strategically used to promote political and sociocultural divisions.For instance as in Figure 1(a), during the COVID-19 pandemic, a widely circulated meme falsely claimed that the mRNA vaccine would alter human genetic code (DNA) 1 .Such multimodal disinformation spread caused vaccine safety and effectiveness concerns, hindering the formation of strong immune defenses in impacted areas globally (Basch et al., 2021;Lin et al., 2022).Besides, another meme example shown in Figure 1(b) perpetuates harmful stereotypes and generalizations about Asians.Therefore, it is necessary to develop automatic approaches to facilitate harmful meme detection for unveiling the dark side of memes.
Harmful memes2 are generally defined as "multimodal units consisting of an image and accompanying text that has the potential to cause harm to an individual, an organization, a community, or the whole society" (Sharma et al., 2022).Previous studies (Kiela et al., 2020;Pramanick et al., 2021a,b) attempted to straightforwardly utilize pretrained vision-language models (Li et al., 2019;Lu et al., 2019) for harmful meme detection by training additional task-specific classification layers.More recently, Cao et al. (2022) proposed a prompt-tuning method with the meme text and image caption as the prompt for masked language modeling (Devlin et al., 2019;Liu et al., 2019).
However, existing harmful meme detection approaches oversimplified the problem as an end-toend classification paradigm, which only recognizes the superficial signals conveyed through the surface text and image.But more in-depth investigation and cognition on the implicit meaning is required especially when the image and text are not obviously correlated (Pramanick et al., 2021b).Intuitively, the key to harmful meme detection is to excavate rich correlations beneath the surface of the seemly uncorrelated text and image in the meme: 1) For example as in Figure 1(b), the image and the text are not harmful when considered in isolation, but are harmful when taken as a whole.
A human checker should cognize that, the "biting" action in the image of a young woman with her pet dog ridicules Asians' "dog-eating" behavior, which corresponds to the "asian" word in the text.
2) In contrast, some harmful signals (e.g., "die" or "villain") are observed in the text of Figure 1(c), but the meme itself actually does not promote hate or discrimination against a particular group of people.
Because the text is a quote from a popular movie and is often used as a philosophical statement about the choices people make in life.And the image further adds a celebratory and joyful tone to the overall message.In comparison, conventional detection methods just focused on recognizing shallow harmindicative signals without such multimodal reasoning and essential background knowledge consideration, so the social dynamics of different races or the origin of the meme text from classical movie lines may not be well-cognized.Unlike such recognitionlevel detection models, we argue that establishing reasonable thought between textual and visual information can further improve meme understanding with background knowledge for better harmful meme detection.
Inspired by the success of LLMs for reasoning at the cognition level with contextual background knowledge (Wei et al., 2022;Kojima et al., 2022;Zhang et al., 2022), we propose a novel approach: MR.HARM, by leveraging the Multimodal reasoning knowledge distilled from LLMs for Harmful meme detection.To this end, we first prompt LLMs for abductive reasoning, and then propose a two-stage generative framework based on smaller language models to learn reasonable thoughts from LLMs for better multimodal fusion and lightweight fine-tuning.More specifically, we incorporate the meme text and image into a twostage training paradigm: 1) Reasoning Distillation: In the first stage, we fine-tune our smaller language models with the interaction of language and vision features to distill multimodal reasoning knowledge from LLMs, which empowers our framework with the ability to conduct cognitive reasoning for the harmfulness prediction.2) Harmfulness Inference: In the second stage, we exploit the fine-tuned small language models to infer the final harmfulness prediction.In this manner, we augment the harmful meme detection model with multimodal reasoning knowledge to unmask the implicit meaning hidden in holistic multimodal information from memes.
We evaluate our proposed approach based on three public meme datasets.The results not only show that our method outperforms strong harmful meme detection baselines by a large margin, but also provide fine-grained analysis for interpreting how our approach works.Our contributions are summarized as follows in three folds: • To our best knowledge, we are the first to alleviate the issue of superficial understanding for harmful meme detection by explicitly utilizing commonsense knowledge, from a fresh perspective on harnessing advanced LLMs.3 • We propose a novel generative framework to fine-tune smaller language models augmented with the multimodal reasoning knowledge distilled from LLMs, which facilitates better multimodal fusion and lightweight fine-tuning for harmfulness prediction.
• Extensive ablations on three meme datasets confirm that our method could yield superior performance than state-of-the-art baselines for the harmful meme detection task.
2 Related Work

Harmful Meme Detection
Harmful meme detection is a rapidly growing area in the research community, driven by the recent availability of large meme benchmarks (Kiela et al., 2019;Suryawanshi et al., 2020;Pramanick et al., 2021a).The Hateful Memes Challenge organized by Facebook (Kiela et al., 2020) further encouraged researchers to develop solutions for detecting harmful memes in hate speech (Das et al., 2020).
More recently, Pramanick et al. (2021a) firstly defined the harmful meme concept and demonstrated its dependence on contextual factors.The complex nature of memes, which often rely on multiple modalities, makes them challenging and struggle to yield good performance only using unimodal detection methods (Simonyan and Zisserman, 2014;He et al., 2016;Devlin et al., 2019).Therefore, recent studies in this area attempted to apply multimodal approaches on the harmful meme detection task.Previous studies have employed classical twostream models that integrate text and vision features, which are learned from text and image encoders, using attention-based mechanisms and multimodal fusion techniques for classifying harmful memes (Kiela et al., 2019(Kiela et al., , 2020;;Suryawanshi et al., 2020).Another branch was to fine-tune pre-trained multimodal models specifically for the task (Lippe et al., 2020;Muennighoff, 2020;Velioglu and Rose, 2020;Hee et al., 2022).Recent related efforts have also sought to explore the use of data augmentation techniques (Zhou et al., 2021;Zhu et al., 2022), ensemble methods (Zhu, 2020;Velioglu and Rose, 2020;Sandulescu, 2020) and harmful target disentanglement (Lee et al., 2021).More recently, Pramanick et al. (2021b) proposed a multimodal framework by using global and local perspectives to detect harmful memes which achieves state-ofthe-art performances.A follow-up prompt-based approach (Cao et al., 2022) attempted to concatenate the meme text and extracted image captions to fine-tune masked language models (Liu et al., 2019) for harmful meme detection.However, existing solutions only capture the superficial signals of different modalities in memes in an end-to-end manner, which largely ignore explicit deductive reasoning to guide the model for understanding background knowledge about the complex and diverse relations between the visual and textual elements.

Large Language Models
Recently, LLMs have demonstrated remarkable capability in complex reasoning (Brown et al., 2020;Thoppilan et al., 2022;Rae et al., 2021;Chowdhery et al., 2022), such as generating intermediate inference procedures before the final output (Nye et al., 2021;Wei et al., 2022;Kojima et al., 2022;Zhang et al., 2022).Unfortunately, the large size of LLMs restricts their deployment on detecting harmful memes with different modalities, regardless of how they are enhanced with strategetic text prompting.Knowledge distillation has been successfully used to transfer knowledge from larger, more competent teacher models into smaller student models affordable for practical applications (Buciluǎ et al., 2006;Hinton et al., 2015;Beyer et al., 2022).However, existing researches on knowledge distillation from LLMs (Wang et al., 2022;Ho et al., 2022;Magister et al., 2022) only consider the language modality, they are not suitable for harmful meme detection because harmful memes can convey holistic synergy information through multimodal features.In this work, we conduct abductive reasoning with LLMs, which further advocates a multimodal reasoning paradigm to fine-tune smaller language models (LMs) for harmful meme detection.

Our Approach
Problem Statement We define a harmful meme detection dataset as a set of memes where each meme M = {y, I, T } is a triplet representing an image I that is associated with a text T , and a ground-truth harmfulness label y ∈ {harmful, harmless}.In this work, to investigate multimodal reasoning distilled from LLMs, we convert the harmful meme detection task into a natural language generation paradigm, where our model takes the text T and image I as the input and generates a text sequence that contains the label y to clearly express whether the meme is harmful.
Our core idea is to reason and evolve with the cognition-level rationale beyond the recognitionlevel perception (Davis and Marcus, 2015) by capturing the inter-relationship between visual and textual elements in memes.The overview of our framework is shown in Figure 2

LLM
Given a text: my black boy friend, which is embedded in an image: a woman holds a baby gorilla, please provide a rationale for how the meme is reasoned as the harmfulness label: harmful

Rationale:
The text could be seen as objectifying or reducing a person to their race.While the image of a woman holding a baby gorilla could be interpreted as a comparison between the black boyfriend and an animal, reinforcing harmful stereotypes about race.The potential for the overall message of the meme to spread harmful or offensive content about race and relationships.

Vision Extractor Fusion Fusion
Vision Extractor

Distillation
The meme is harmful The First Fine-tuning Stage The Second Fine-tuning Stage Abductive Reasoning with LLMs LM Encoder

LM Decoder
Figure 2: The overall pipeline of our method.We first conduct abductive reasoning with LLMs to extract harmfulness rationales (pink) by the prompt consisting of the meme text (green), the image caption (blue), and the label (orange).
We then use the generated rationales to train small task-specific models with multimodal inputs as the first fine-tuning stage and feed the same inputs to the updated model for harmfulness inference as the second fine-tuning stage.

Abductive Reasoning with LLMs
In this paper, we propose to utilize abductive reasoning with multimodal inputs to train smaller downstream models.LLMs can produce natural language rationales unveiling the implicit meaning beneath the surface of the memes to justify the reason why the meme is harmful or not.This shares a similar intuition as heuristic teaching (Pintrich and Schunk, 2002) where a teacher who has rich experience and knowledge can impart to students the correct way of thinking and reasoning based on questions with corresponding answers.The students then learn how to deduce their own ways to the correct answers from questions accordingly.Thus we aim to activate explicit reasoning knowledge in LLMs as a teacher model, e.g., contextual and cultural information related to memes, to guide our model to strengthen harmfulness prediction.
Given a meme sample M = {y, I, T } from the training data, to prompt large language models in uniform language modality, we first extract the text caption Ĩ of the image I by off-the-shelf captioning models (Mokady et al., 2021).Then we curate a template p that consists of a triplet {y, Ĩ, T } as observed attributes, to prompt the LLMs to generate a rationale r that elicits the reasoning knowledge about how to infer the harmfulness label y based on the interplay of the meme text T and the image caption Ĩ as illustrated in Figure 2. Specifically, we design p as: "Given a Text: [T ], which is embedded in an Image: [ Ĩ]; and a harmfulness label [y], please give me a streamlined rationale associated with the meme, without explicitly indicating the label, for how it is reasoned as [y]." As we clarify the ground-truth harmfulness label in the observed attributes of the prompt, the hallucination issue (Bang et al., 2023) of LLMs could be effectively alleviated.Because the rich contextual background knowledge could be activated by abductive reasoning based on the ground truth and invalid rationales are naturally filtered out.

Reasoning Distillation with Small LMs
Since we utilize image captions to represent the meme images, we could perform abductive reasoning with large language models pre-trained with language modality.However, only using the captions as opposed to original vision features may suffer from a lack of mutual synergy in the representation space of different modalities in memes due to the inductive bias of possible information loss in the captioning process.On the other hand, LLMs can be used to conduct abductive reasoning only for the training data whose harmfulness label is given in prior but is challenging to be fine-tuned for this task due to the huge amount of model parameters.To facilitate the interactions between the meme text and the image, we propose to fine-tune a smaller language model for the harmful meme detection task, which allows flexibility in adjusting model architectures to incorporate multimodal features and is more lightweight for task-specific fine-tuning.
In this section, we train a small language model as a student model distilled from the LLMs with multimodal reasoning knowledge.Specifically, we leverage generated rationales from LLMs as informative supervision, to fine-tune a smaller pretrained language model to excavate the rich interrelationship between language and vision modali-ties of memes.
Encoding For a meme sample M from the training data, we first encode the text T and the image I to obtain their embedding vectors as follows: where TE(•) denotes the text embedding layer of the LM Encoder.And H 0 T ∈ R m×d is the token embeddings in the Transformer encoder (Vaswani et al., 2017) where m is the text token length and d is the dimension of the hidden states.VE(•) is the Vision Extractor implemented as frozen pre-trained vision Transformers (Radford et al., 2021) to fetch the patch-level features of the image with n patches, which is projected into the visual representations H I ∈ R n×d .Next, to support semantic alignment between the text and the image for better context understanding, we exploit a cross-attention mechanism (Luo et al., 2022) for multimodal fusion of the textual and visual information: where H i T is the input hidden states of each LM Encoder layer and H i I is the attended visual features.Then we can fuse H i I with H i T to attain the interplay representations for a meme: where LME i (•) is the i-th layer of the LM Encoder, W i * denotes the linear projection, b i * is the bias, and Ĥ = H L T is the final interplay representations after going through an L-layer LM Encoder fused with the visual features.
Decoding Finally, we feed the interplay representations Ĥ ∈ R m×d into the LM Decoder, implemented as a Transformer-based decoder, to generate the reasonable rationale.Overall, the smaller language model f is trained by minimizing the following distillation loss: where CE(•) denotes the cross-entropy loss (Sutskever et al., 2014) between the predicted text and the target rationale r generated by LLMs.In this way, multimodal reasoning knowledge about the meme could be explicitly distilled from LLMs and injected into the smaller language model specific to harmful meme detection.

Harmfulness Inference
During the first fine-tuning stage, we conducted explicit deductive reasoning to empower our model with the capability of multimodal reasoning distilled from LLMs.As the goal of this task is to determine whether the meme is harmful or not, we conduct the second fine-tuning stage for Harmfulness Inference, which shares the same model architecture, parameters, and encoding procedure as Sec.3.2 but differs in the decoding output.To make the output consistent with harmfulness prediction, the smaller model f is further trained by minimizing the following inference loss: where the cross-entropy loss is computed between the generated text and ground-truth harmfulness label y.With the generative objective (Raffel et al., 2020)  MaskPrompt (Cao et al., 2022).We use the accuracy and macro-averaged F1 score as the evaluation metrics.More implementation details and baseline descriptions are provided in Appendix.

Harmful Meme Detection Performance
Table 1 shows the performance of our proposed method versus all the compared methods on the Harm-C, Harm-P and FHM datasets.It is observed that 1) The performance of the baselines in the first group is obviously poor due to only unimodal features like text-only or image-only being captured.
In comparison, the other baselines exploit the multimodal features from both the text and image in memes.
2) The multimodal models in the second group outperform the unimodal ones.The earlyfusion models with multimodal pre-training (i.e., VisualBERT COCO and ViLBERT CC) outperform that of the simple fusion with unimodal pretraining (i.e., Late Fusion and MMBT) on Harm-C/P datasets, while MOMENTA performs best in the second group by considering global and local information of memes.3) However, as the images in FHM dataset are more informative and highquality, MaskPrompt yields the best performance among all the baselines by incorporating additional extracted entities and demographic information of the image into the masked language models, besides just captioning the image into the prompt.Our proposed MR.HARM improves over the best baselines by 2.63%, 1.31%, and 9.86% in terms of Macro-F1 score on Harm-C, Harm-P, and FHM datasets, respectively.We observe that 1) the improvement on the Harm-P dataset is relatively milder than that on the other two datasets.Meanwhile, all the baselines just have tiny differences among their performances on Harm-P.We speculate the reason falls into the smaller dataset scale of Harm-P which only contains politics-related harmful memes.2) A similar trend can also be observed in Harm-C and FHM datasets: the more challenging the dataset is, the greater performance improvement MR.HARM achieves.Our model performs flexibly and stably across all datasets with its keen judgment on harmful memes.This is because all the baselines are only designed at the recognition level, but MR.HARM is further empowered with multimodal reasoning knowledge distilled from LLMs to unearth harmful content from the seemly uncorrelated text and image modalities of memes.

Ablative Study
We perform ablative studies on several variants of MR.HARM: 1) w/o Reasoning Distillation: Simply fine-tune the smaller language models in the stage of Harmfulness Inference without the stage of Reasoning Distillation based on LLMs; 2) w/o Visual Features: Discard the features from the meme image while keeping those from the meme text; 3) w/o Multimodal Fusion: Instead of the fusion mechanism on the multimodal features in our language model, we only append the lingual features from image captioning together with the meme text  As demonstrated in Table 2, the ablative models suffer different degrees of performance degradation, indicating the effectiveness of our proposed components for harmful meme detection with multimodal reasoning distilled from LLMs.Specifically, the performance of MR.HARM significantly decreases in the 'w/o Reasoning Distillation' setting due to the lack of multimodal reasoning knowledge transferred from LLMs about the seemly uncorrelated modalities in memes.The 'w/o Visual Features' setting also achieves worse performance than MR.HARM, suggesting that the visual representations are complementary to the meme text for harm-indicative pattern extraction in the language model.MR.HARM makes improvements over 'w/o Multimodal Fusion', which implies the promoting role of our fusion mechanism that incorporates original vision features into the language model, hardly compromised when there could be severe information loss in the captioning process.Moreover, the 'w/o Two-stage Training' setting leads to large-margin performance degradation, which verifies the effectiveness of our two-stage training paradigm.This is because this setting causes mutual interference between intermediate reasoning and final prediction, which affects the convergence effect of harmfulness inference and damages the model's performance and stability.Compared with MR.HARM, the performance of 'w/o Fine-tuning Small LMs' also significantly decreases, highlighting the importance of abductive reasoning with LLMs to alleviate the hallucination issue during Intermediate Reasoning: the meme contains a political message that promotes a political agenda.the text suggests that the coronavirus is nothing compared to windmill cancer, which is a harmful and offensive message.the image of a politician in a suit and tie, possibly related to the inauguration of president trump, further reinforces the harmful message.the use of a political message in this context could be seen as insensitive and disrespectful to those who have been affected by the pandemic.

Intermediate Reasoning:
the text in the meme contains a derogatory and offensive statement "goat humper" towards a particular group of Muslim individuals, associated with an image of a Muslim man praying in front of a mosque, which is highly offensive and discriminatory.The phrase "looks like a bunch of pigs just walked over your grave" can be interpreted as a disrespectful reference to the Muslim practice of abstaining from pork.
Meme text: it's the goat humper, looks like a bunch of pigs just walked over your grave Intermediate Reasoning: based on the text and image it appears that the meme is making an inflammatory statement about the democrat party.the image shows a close up of a confederate flag with stars, which could be interpreted as an implication of the democrat party.the text also mentions the oldest haté group in the united states, which could be interpreted as a reference to the democrat party being associated with racism and white supremacy.deductive reasoning for harmfulness prediction.

Cognition-view Reasoning Analysis
Note that our smaller language model is explicitly in the Reasoning Distillation stage for rationale generation to distill multimodal reasoning knowledge from LLMs.Although intermediate reasoning is not the final target output for harmful meme detection, after the first fine-tuning stage, we elicit reasonable thoughts from our smaller language model with the test samples as input, to understand the cognition view of our proposed MR.HARM on the test meme samples more transparently and intuitively, as exemplified in Figure 3.
From the visualized intermediate reasoning, we observe that 1) our model could understand the multimodal information related to the meme text (in green) and image (in blue) with commonsense knowledge.For example, in Figure 3(a), the recognized "politician" in the image could be related to "president trump", which could be linked to the "AMERICA" in the text; in Figure 3(b), the recognized "flag" in the image could be cognized to satire "the democrat party" in the text; and in terms of Figure 3(c), the "goat humper" and "pigs" in the text could be associated with the attacks to "a Muslim man" recognized in the image.2) Furthermore, our model learns to cognize the interplay (in pink) of multimodal information with advanced reasoning.Benefitting from the rich multimodal understanding of the memes, the perpetuates harmful stereotypes could be reasoned over the context to the target like "who affected by the pandemic" in Figure 3(a), "the democrat party" in Figure 3(b), and "the Muslim" in Figure 3(c).In this way, the rich correlation beneath the surface of the meme text and image could be excavated to facilitate harmfulness inference with better reasoning knowledge by harnessing advanced LLMs.Such readable pieces of rationales are also potentially valuable for aiding human checkers to verify the final answer predicted by our model.

Error Analysis
To better understand the behavior of our model and facilitate future studies, we conduct an error analysis on the wrongly predicted memes by our proposed framework.We found that the major error exists in that our framework still cannot fully recognize the images that require rich background knowledge though we exploited the advanced crossattention mechanism by incorporating visual features into the language model.Figure 4 shows two examples of memes wrongly classified by MR.HARM.For the harmful meme in Figure 4(a), the phrase "and for my next class project!" suggests that the image is being used for an academic or educational purpose, which can be seen as glorifying Intermediate Reasoning: based on the image and text presented in the meme, it appears to be a harmless and relatable scenario.the image shows a group of women in traditional dress walking on a beach, which is a common and harmless activity.the text, "and for my next class project," seems to be a playful and lighthearted comment about the topic of the class project.there is no indication of any harmful or offensive content in the image or text.
Meme text: and for my next class project!Intermediate Reasoning: the text in the meme contains a derogatory and offensive statement about a politician, jimmy carter, who created the department of education in 1979.the use of such language and imagery can be considered harmful and offensive to individuals who identify with the politician.additionally, the image of an older man with a funny expression on his face can be seen as promoting a negative and harmful attitude towards the politician.or normalizing the behavior depicted in the image.The image features "a group of Ku Klux Klan members walking on a beach", which is a symbol of white supremacy and racism.The combination of the phrase in the text and the use of imagery associated with hate groups can contribute to the glorification of harmful behaviors and the perpetuation of negative stereotypes, which makes the meme harmful.However, due to the lack of related background knowledge about the Ku Klux Klan members and their wear, our framework cannot well recognize the image correctly during the original vision feature extraction, which leads to error propagation for wrongly concluding that the meme is harmless.Also, in terms of the harmless meme in Figure 4(b), the image that "Jimmy Carter with a smile on his face" is mistakenly recognized as "an older man with a funny expression on his face", furthermore, the model hallucinates that the meme text "can be considered harmful and offensive to individuals who identify with the politician", resulting in the wrong prediction that the meme is harmful.Therefore, it is possible to improve MR.HARM by incorporating more informative vision features and improving language-vision interaction to be capable of understanding the images with more complex background knowledge.

Discussion
As our two-stage training paradigm requires distilling the reasoning knowledge and leveraging original vision features, we utilize the T5 encoderdecoder architecture (Raffel et al., 2020;Chung et al., 2022) to initialize our generative framework.
To test the generality of the benefits of our approach to different versions of the backbone, we alter the underlying LMs to other variants in different sizes.
As shown in Table 3, one interesting phenomenon is that our model has already achieved outstanding performance on the three benchmarks with the Small (about 60M parameters) or Base ((about 220M parameters)) version as the backbone, which has a smaller size than the state-of-the-art baseline MaskPrompt (over 300M parameters).The Large version of our backbone generally achieved better performance than the other two backbone versions because the larger the fine-tuned LMs, the more it alleviates the hallucination issue (Ji et al., 2023).
Overall, the results show that our framework does not rely excessively on the size of the backbone to improve performance and is generally effective with different versions of the backbone model.

Conclusion and Future Work
In this paper, we propose to capture implicit meaning that is not explicitly conveyed through the surface of the text and image in memes for harmful meme detection.We first conduct abductive reasoning with LLMs.Then we present a novel generative framework to distill multimodal reasoning knowledge from LLMs, which includes two training stages: 1) reasoning distillation and 2) harmfulness inference.Results on three meme benchmarks confirm the advantages of our proposed framework.For future work, since it is harder to judge the quality of the intermediate reasoning, where the evaluation is necessarily qualitative, we plan to do some sort of systematic study towards explainable harmful meme detection to claim explainability through a human subjects study for evaluation.

Limitations
There are multiple ways to further improve this work: • Despite this work focusing on performance improvement of harmful meme detection, it is harder to judge the quality of the intermediate reasoning, where the evaluation is necessarily qualitative.Considering that our framework could generate readable snippets for cognitionview reasoning, we plan to do some sort of systematic study to claim explainability for the evaluation, which would be another more targeted research.
• New benchmarks to evaluate the reasoning ability of our framework are demanded.We are going to further exploit LLMs toward explainable harmful meme detection from the perspectives like dataset construction and automatic evaluation.
• We only use the textual prompt to conduct abductive reasoning with accessible LLMs pretrained with the language modality.We would further update our framework by leveraging visual LLMs if accessible in the future to improve the visual feature extraction for exploring better multimodal reasoning knowledge distillation, and avoid several common deficiencies of existing language models including hallucination and limited generalization as much as possible.

A Datasets
The detailed statistics of the three datasets are shown in Table 4.

B Implementation Details
To separate the text and image in the memes, we first in-paint the memes by combining MMOCR (Kuang et al., 2021) with SAM (Kirillov et al., 2023) to extract the text and pure image.
Then during the captioning process, since the focus of this work is primarily on the multimodal reasoning for harmful meme detection from a fresh perspective on harnessing LLMs, we apply a pretrained image captioning model ClipCap (Mokady et al., 2021) used in recent work (Cao et al., 2022), to generate textual descriptions about the dominant objects or events in the memes' image, which is utilized as the inputs into LLMs for abductive reasoning.To generate the rationale for each meme, we employed ChatGPT (Ouyang et al., 2022), a widely used LLM developed by OpenAI, specifically utilizing the "gpt-3.5-turbo"version.To make our results reproducible, we set the temperature as 0 and the maximum length as 256.
For the system prompt to the "gpt-3.5-turbo"model, we design the message as: "You have been specially designed to perform abductive reasoning for the harmful meme detection task.Your primary function is that, according to a harmfulness label about an image with a text embedded, please provide a streamlined rationale, without explicitly indicating the label, for how it is reasoned as the given harmfulness label.The image and the textual content in the meme are often uncorrelated, but its overall semantics is presented holistically.Thus it is important to note that you are prohibited from relying on your own imagination, as your goal is to provide the most accurate and reliable rationale possible so that people can infer the harmfulness according to your reasoning about the background context and relationship between the given text and image.".
Moreover, to prompt the LLMs to generate rea- sonable rationales with the triplet {y, Ĩ, T } as observed attributes, we design the template p for the user prompt as: "Given a Text: [T ], which is embedded in an Image: [ Ĩ]; and a harmfulness label [y], please give me a streamlined rationale associated with the meme, without explicitly indicating the label, for how it is reasoned as [y].".
Our MR.HARM model utilizes the T5 encoderdecoder architecture (Raffel et al., 2020;Chung et al., 2022) as its foundational framework, specifically utilizing the "flan-t5-base" version.For the extraction of image features, following previous work (Pramanick et al., 2021b), we adopted the state-of-the-art vision Transformer known as CLIP-ViT-B/32 (Radford et al., 2021), and this module remains static throughout the training process.To effectively integrate the multi-modal information, we incorporated a simple one-head cross-attention mechanism in each layer of the T5 encoder.During the fusion process, the text features are utilized as the query, while the image features act as the key and value.It is noteworthy that these fusion modules were initialized randomly.For the finetuning phase, we provide a comprehensive list of the hyper-parameters in Table 5. Results are averaged over ten random runs.All experiments were conducted using a single V100 32GiB GPU.

C Baselines
We compare our model MR.HARM with several state-of-the-art harmful meme detection systems:     LMs' setting in Table 2, they are challenging to serve in practice that requires at least 350GB GPU memory using specialized infrastructure for a single 175 billion LLM.This work presents a novel paradigm to leverage the reasoning ability and rich background knowledge of LLMs for better harmful meme detection, but just need to fine-tune the small language model even with a smaller size than the state-of-the-art baseline.Table 6 illustrates the detailed results on the three meme datasets with different versions of our fine-tuned backbone model.Table 7 illustrates the comparison of characteristics between MR.HARM and the state-of-the-art baselines like MOMENTA and MaskPrompt.

E Discussion about One-stage Training
We further investigate the one-stage training to figure out the intrinsic property of the chain-ofthought reasoning.We compare the performance with two proposed variants for the one-stage training: 1) Explanation where the rationale is utilized for explaining the harmfulness inference; 2) Reasoning where harmfulness inference is conditioned to the rationale.As shown in Tabel 8, the reasoning setting performs worse than the explanation setting with a large margin.We conjecture that this is because the reasoning setting in the one-stage training could lead to error propagation if our small language model generates hallucinated rationales that mislead the harmfulness inference, which however could be well avoided by the two-stage training paradigm.Meanwhile, as there exists mutual interference between rationale generation and harmfulness prediction, the explanation setting could give the harmfulness inference higher priority in the sequence generation so that it performs better than the reasoning setting.We argue that such a one-stage training paradigm could be improved in the future by applying a filtering mechanism, e.g., using only the effective chain-of-thought reasoning to infer the harmfulness of memes and get rid of irrelevant rationales.In summary, both settings in the one-stage training paradigm suffer different degrees of performance degradation, which reaffirms the necessity of our two-stage training paradigm.

F Future Work
We will explore the following directions in the future: • Considering that our framework could generate readable snippets for cognition-view reasoning, we plan to do some sort of systematic study to claim explainability (possibly through a human subjects study) for the evaluation.
• In this work we target exploring the underlying reasoning process to empower the harmful meme detection model with the ability of explicit reasoning, to arrive at correct harmfulness predictions.We are going to further exploit LLMs toward explainable harmful meme detection from perspectives like dataset construction on social media with propagation structure (Lin et al., 2021;Ma and Gao, 2020;Ma et al., 2020), automatic evaluation, and human evaluation.
• We would further update our framework by leveraging visual LLMs if accessible in the future to improve the visual feature extraction for better multimodal reasoning, and avoid several common deficiencies of existing language models including hallucination and limited generalization as much as possible.

Figure 1 :
Figure 1: Examples of harmful and harmless memes.Meme text: (a) Chance a virus with a 99.97% recovery rate; Alter my DNA from an experimental vaccine, with NO liability, from a corrupt industry.(b) when you date an asian boy and you trynna get his family to accept you.(c) you either die a hero, or live long enough to become the villain.
during encoding; 4) w/o Two-stage Training: Concatenate the rationales generated from LLMs and golden harmfulness label as the target for model training, to replace the two-stage training paradigm; 5) w/o Fine-tuning Small LMs: Directly prompt the representative large language model ChatGPT based on InstructGPT (Ouyang et al., 2022) for harmful meme detection.

Figure 4 :
Figure 4: Examples of wrongly predicted memes by our proposed framework with the ground truth (a) harmful and (b) harmless.

1)
Text BERT: BERT(Devlin et al., 2019) is utilized as the unomodal text-only model; 2) Image-Region: a unimodal visual-only model that processes meme images using Faster R-CNN(Ren  et al., 2016)  with ResNet-152(He et al., 2016) to feed into a classification layer; 3) Late Fusion: a multimodal model uses the average prediction scores of BERT and ResNet-152 for harmful meme detection(Pramanick et al., 2021a); 4) MMBT: a multimodal Bi-Transformer(Kiela et al., 2019) that captures the intra-modal and inter-modal dynamics of the two modalities; 5) VisualBERT COCO: Visual BERT (Li et al., 2019) pre-trained on the COCO dataset(Lin et al., 2014); 6) ViL-BERT CC: Vision and Language BERT(Lu et al., 2019) trained on an intermediate multimodal objective(Sharma et al., 2018) for task-agnostic joint representations of image and text; 7) MOMENTA: a multimodal harmful meme detection system (Pramanick et al., 2021b) that takes the global and local information in two modalities of memes into account; 8) MaskPrompt: a prompt learning approach(Cao et al., 2022) that converts harmful meme detection as a masked language modeling problem based on RoBERTa-large(Liu et al., 2019).We use accuracy and macro-averaged F1 score as the evaluation metrics, where the macro-averaged F1 is the more important metric owing to the imbalanced class prevalence (see Table4), to capture competitive performance beyond the majority class.While LLMs offer strong zero/few-shot performance as shown in the 'w/o Fine-tuning Small

Figure 5 :
Figure 5: The details of our Multimodal Fusion module.

Figure 5
Figure5illustrates the details of our multimodal fusion module in the encoding phase of MR.HARM.

Table 1 :
adapted to the previous Reasoning Distillation stage, the prior reasoning knowledge absorbed in Reasoning Distillation could be well induced for Harmfulness Inference.Accuracy Macro-F 1 Accuracy Macro-F 1 Harmful meme detection results on three datasets.The accuracy and macro-averaged F1 score (%) are reported as the metrics.The best and second results are in bold and underlined.statistics of the three datasets in the Appendix.

Table 2 :
Ablation studies on our proposed framework.

Table 4 :
Statistics of Datasets.

Table 6 :
The detailed results with different sizes of our fine-tuned LMs.

Table 7 :
Comparison of characteristics between our MR.HARM with state-of-the-art models for harmful meme detection.

Table 8 :
Accuracy Macro-F 1 Accuracy Macro-F 1 Effects of the one-stage training.