Detecting Harmful Memes and Their Targets

Among the various modes of communication in social media, the use of Internet memes has emerged as a powerful means to convey political, psychological, and socio-cultural opinions. Although memes are typically humorous in nature, recent days have witnessed a proliferation of harmful memes targeted to abuse various social entities. As most harmful memes are highly satirical and abstruse without appropriate contexts, off-the-shelf multimodal models may not be adequate to understand their underlying semantics. In this work, we propose two novel problem formulations: detecting harmful memes and the social entities that these harmful memes target. To this end, we present HarMeme, the first benchmark dataset, containing 3,544 memes related to COVID-19. Each meme went through a rigorous two-stage annotation process. In the first stage, we labeled a meme as very harmful, partially harmful, or harmless; in the second stage, we further annotated the type of target(s) that each harmful meme points to: individual, organization, community, or society/general public/other. The evaluation results using ten unimodal and multimodal models highlight the importance of using multimodal signals for both tasks. We further discuss the limitations of these models and we argue that more research is needed to address these problems.


Introduction
The growing popularity of social media has led to the rise of multimodal content as a way to express ideas and emotions. As a result, a brand new type of message was born: meme. A meme is typically formed by an image and a short piece of text on top of it, embedded as part of the image. Memes are typically innocent and designed to look funny. Over time, memes started being used for harmful purposes in the context of contemporary political and socio-cultural events, targeting individuals, groups, businesses, and society as a whole. At the same time, their multimodal nature and often camouflaged semantics make their analysis highly challenging (Sabat et al., 2019).
Meme analysis. The proliferation of memes online and their increasing importance have led to a growing body of research on meme analysis (Sharma et al., 2020a;Reis et al., 2020;. It has also been shown that off-the-shelf multimodal tools may be inadequate to unfold the underlying semantics of a meme as (i) memes are often context-dependent, (ii) the visual and the textual content are often uncorrelated, and (iii) meme images are mostly morphed, and the embedded text is sometimes hard to extract using standard OCR tools (Bonheme and Grzes, 2020).
The dark side of memes. Recently, there has been a lot of effort to explore the dark side of memes, e.g., focusing on hate (Kiela et al., 2020) and offensive (Suryawanshi et al., 2020) memes. However, the harm a meme can cause can be much broader. For instance, the meme 1 in Figure 1c is neither hateful nor offensive, but it is harmful to the media shown on the top left (ABC, CNN, etc.), as it compares them to China, suggesting that they adopt strong censorship policies. In short, the scope of harmful meme detection is much broader, and it may encompass other aspects such as cyberbullying, fake news, etc. Moreover, harmful memes have a target (e.g., news organization such as ABC and CNN in our previous example), which requires separate analysis not only to decipher their underlying semantics, but also to help with the explainability of the detection models. Our contributions. In this paper, we study harmful memes, and we formulate two problems. Problem 1 (Harmful meme detection): Given a meme, detect whether it is very harmful, partially harmful, or harmless. Problem 2 (Target identification of harmful memes): Given a harmful meme, identify whether it targets an individual, an organization, a community/country, or the society/general public/others. To this end, we develop a novel dataset, HarMeme, containing 3, 544 real memes related to COVID-19, which we collected from the web and carefully annotated. Figure 1 shows several examples of memes from our collection, whether they are harmful, as well as the types of their targets. We prepare detailed annotation guidelines for both tasks. We further experiment with ten state-of-the-art unimodal and multimodal models for benchmarking the two problems. Our experiments demonstrate that a systematic combination of multimodal signals is needed to tackle these problems. Interpreting the models further reveals some of the biases that the best multimodal model exhibits, leading to the drop in performance. Finally, we argue that off-the-shelf models are inadequate in this context and that there is a need for specialized models Our contributions can be summarized as follows: • We study two new problems: (i) detecting harmful memes and (ii) detecting their targets.
• We release a new benchmark dataset, HarMeme, developed based on comprehensive annotation guidelines.
• We perform initial experiments with state-ofthe-art textual, visual, and multimodal models to establish the baselines. We further discuss the limitations of these models.
Reproducibility. The full dataset and the source code of the baseline models are available at The appendix contains the values of the hyperparameters and the detailed annotation guidelines.

Related Work
Below, we present an overview of the datasets and the methods used for multimodal meme analysis.
Meme sentiment/emotion analysis. Hu and Flaxman (2018) developed the TUMBLR dataset for emotion analysis, consisting of image-text pairs along with associated tags, by collecting posts from the TUMBLR platform. Thang Duong et al. (2017) prepared a multimodal dataset containing images, titles, upvotes, downvotes, #comments, etc., all collected from Reddit. Recently, SemEval-2020 Task 9 on Memotion Analysis (Sharma et al., 2020a) introduced a dataset of 10k memes, annotated with sentiment, emotions, and emotion intensity. Most participating systems in this challenge used fusion of visual and textual features computed using models such as Inception, ResNet, CNN, VGG-16 and DenseNet for image representation (Morishita et al., 2020;Sharma et al., 2020b;Yuan et al., 2020), and BERT, XLNet, LSTM, GRU and DistilBERT for text representation (Liu et al., 2020;Gundapu and Mamidi, 2020). Due to class imbalance in the dataset, approaches such as GMM and Training Signal Annealing (TSA) were also found useful. Morishita et al. (2020); Bonheme and Grzes (2020); Guo et al. (2020); Sharma et al. (2020b) proposed ensemble learning, whereas Gundapu and Mamidi (2020); De la Peña Sarracén et al. (2020) and several others used multimodal approaches. A few others leveraged transfer-learning using pre-trained models such as BERT (Devlin et al., 2019), VGG-16 (Simonyan and Zisserman, 2015), and ResNet (He et al., 2016). Finally, state-of-the-art results for all three tasks -sentiment classification, emotion classification and emotion quantification on this dataset,-were reported by , who proposed a deep neural model that combines sentence demarcation and multi-hop attention. They also studied the interpretability of the model using the LIME framework (Ribeiro et al., 2016).
Meme propagation. Dupuis and Williams (2019) surveyed personality traits of social media users who are more active in spreading misinformation in the form of memes. Crovitz and Moran (2020) studied the characteristics of memes as a vehicle for spreading potential misinformation and disinformation. Zannettou et al. (2020a) discussed the quantitative aspects of large-scale dissemination of racist and hateful memes among polarized communities on platforms such as 4chan's /pol/. Ling et al. (2021) examined the artistic composition and the aesthetics of memes, the subjects they communicate, and the potential for virality.
Based on this analysis, they manually annotated 50 memes as viral vs. non-viral. Zannettou et al. (2020b) analyzed the "Happy merchant" memes and showed how online fringe communities influence their spread to mainstream social networking platforms. They reported reasonable agreement for most manually annotated labels, and established a characterization for meme virality.
Other studies on memes. Reis et al. (2020) built a dataset of memes related to the 2018 and the 2019 election in Brazil (34k images, 17k users) and India (810k images, 63k users) with focus on misinformation. Another dataset of 950 memes targeted the propaganda techniques used in memes (Dimitrov et al., 2021a), which was also featured as a shared that at SemEval-2021 (Dimitrov et al., 2021b). Leskovec et al. (2009) introduced a dataset of 96 million memes collected from various links and blog posts between August 2008 and April 2009 for tracking the most frequently appearing stories, phrases, and information. Topic modeling of textual and visual cues of hate and racially abusive multi-modal content over sites such as 4chan was studied for scenarios that leverage genetic testing to claim superiority over minorities (Mittos et al., 2020). Zannettou et al. (2020a) examined the content of meme images and online posting activities to identify the probability of occurrence of one event in a specific background process, affecting the occurrence of other events in the rest of the processes, also known as Hawkes process (Hawkes, 1971), within the context of online posting of trolls.  observed that fauxtographic content tends to attract more attention, and established how such content becomes a meme in social media. Finally, there is a recent survey on multi-modal disinformation detection (Alam et al., 2021).
Differences with existing studies. Hate speech detection in multimodal memes (Kiela et al., 2020) is the closest work to ours. However, we are substantially different from it and from other related studies as (i) we deal with harmful meme detection, which is a more general problem than hateful meme detection; (ii) along with harmful meme detection, we also identify the entities that the harmful meme targets; (iii) our HarMeme comprises real-world memes posted on the web as opposed to using synthetic memes as in (Kiela et al., 2020); and (iv) we present a unique dataset and benchmark results for harmful meme detection and for identifying the target of harmful memes.
Here, we define harmful memes as follows: multimodal units consisting of an image and a piece of text embedded that has the potential to cause harm to an individual, an organization, a community, or the society more generally. Here, harm includes mental abuse, defamation, psycho-physiological injury, proprietary damage, emotional disturbance, and compensated public image.
Harmful vs. hateful/offensive. Harmful is a more general term than offensive and hateful: offensive and hateful memes are harmful, but not all harmful memes are offensive or hateful. For instance, the memes in Figures 1b and 1c are neither offensive nor hateful, but harmful to Donald Trump and to news media such as CNN, respectively. Offensive memes typically aim to mock or to bully a social entity. A hateful meme contains offensive content that targets an entity (e.g., an individual, a community, or an organization) based on its personal/sensitive attributes such as gender, ethnicity, religion, nationality, sexual orientation, color, race, country of origin, and/or immigration status. The harmful content in a harmful meme is often camouflaged and might require critical judgment to establish its potencial to do hard. Moreover, the social entities attacked or targeted by harmful memes can be any individual, organization, or community, as opposed to hateful memes, where entities are attacked based on personal attributes.

Dataset
Below, we describe the data collection, the annotation process and the guidelines, and we give detailed statistics about the HarMeme dataset.

Data Collection and Deduplication
To collect potentially harmful memes in the context of COVID-19, we searched using different services, mainly Google Image Search. We used keywords such as Wuhan Virus Memes, US Election and COVID Memes, COVID Vaccine Memes, Work From Home Memes, Trump Not Wearing Mask Memes. We then used an extension 2 of Google Chrome to download the memes. We further scraped various publicly available groups on Instagram for meme collection. Note that, adhering to the terms of social media, we did not use content from any private/restricted pages. Unlike the Hateful Memes Challenge (Kiela et al., 2020), which used synthetically generated memes, our HarMeme dataset contains original memes that were actually shared in social media. As all memes were gathered from real sources, we maintained strict filtering criteria 3 on the resolution of meme images and on the readability of the meme text during the collection process. We ended up collecting 5, 027 memes. However, as we collected memes from independent sources, we had some duplicates. We thus used two efficient de-duplication repositories 4 5 sequentially, and we preserved the memes with the highest resolution from each group of duplicates. We removed 1, 483 duplicate memes, thus ending up with a dataset of 3, 544. Although we tried to collect only harmful memes, the dataset contained memes with various levels of harmfulness, which we manually labeled during the annotation process, as discussed in Section 4.3. We further used Google's OCR Vision API 6 to extract the textual content of each meme.

Annotation Guidelines
As discussed in Section 3, we consider a meme as harmful only if it is implicitly or explicitly intended to cause harm to an entity, depending on the personal, political, social, educational or industrial background of that entity. The intended harm can be expressed in an obvious manner such as by abusing, offending, disrespecting, insulting, demeaning, or disregarding the entity or any sociocultural or political ideology, belief, principle, or doctrine associated with that entity. Likewise, the harm can also be in the form of a more subtle attack such as mocking or ridiculing a person or an idea.
We asked the annotators to label the intensity of the harm as harmful or partially harmful, depending upon the context and the ingrained explication of the meme. Moreover, we formally defined four different classes of targets and compiled welldefined guidelines 7 that the annotators adhered to while manually annotating the memes. The four target entities are as follows (c.f. Figure 1): 1. Individual: A person, usually a celebrity (e.g., a well-known politician, an actor, an artist, a scientist, an environmentalist, etc. such as Donald Trump, Joe Biden, Vladimir Putin, Hillary Clinton, Barack Obama, Chuck Norris, Greta Thunberg, Michelle Obama).

Organization:
An organization is a group of people with a particular purpose, such as a business, a governmental department, a company, an institution or an association, comprising more than one person, and having a particular purpose, such as research organizations (e.g., WTO, Google) and political organizations (e.g., the Democratic Party).

Community:
A community is a social unit with commonalities based on personal, professional, social, cultural, or political attributes such as religious views, country of origin, gender identity, etc. Communities may share a sense of place situated in a given geographical area (e.g., a country, a village, a town, or a neighborhood) or in virtual space through communication platforms (e.g., online forums based on religion, country of origin, gender).
4. Society: When a meme promotes conspiracies or hate crimes, it becomes harmful to the general public, i.e., to the entire society.
During the process of collection and annotation, we rejected memes based on the following four criteria: (i) the meme text is in code-mixed or non-English language; (ii) the meme text is not readable (e.g., blurry text, incomplete text, etc.); (iii) the meme is unimodal, containing only textual or visual content; (iv) the meme contains cartoons (we added this last criterion as cartoons can be hard to analyze by AI systems).

Annotation Process
For the annotation process, we had 15 annotators, including professional linguists and researchers in Natural Language Processing (NLP): 10 of them were male and the other 5 were female, and their age ranged between 24-45 years. We used the PyBossa 8 crowdsourcing framework for our annotations (c.f. Figure 3). We split the annotators into five groups of three people, and each group annotated a different subset of the data. Each annotator spent about 8.5 minutes on average to annotate one meme. At first, we trained our annotators with the definition of harmful memes and their targets, along with the annotation guidelines. To achieve quality annotation, our main focus was to make sure that the annotators were able to understand well what harmful content is and how to differentiate it from humorous, satirical, hateful, and non-harmful content. Dry run. We conducted a dry run on a subset of 200 memes, which helped the annotators understand well the definitions of harmful memes and targets, as well as to eliminate the uncertainties about the annotation guidelines. Let α i be a single annotator. For the preliminary data, we computed the inter-annotator agreement in terms of Cohen's κ (Bobicev and Sokolova, 2017) for three randomly chosen annotators α [1,2,3] for each meme for both tasks. The results are shown in Table 1. We can see that the score is low for both tasks (0.295 and 0.373), which is expected for the initial dry run. With the progression of the annotation phases, we observed much higher agreement, thus confirming that the dry run helped to train the annotators.
Final annotation. After the dry run, we started the final annotation process. Figure 3a shows an example annotation of the PyBossa annotation platform. We asked the annotators to check whether a given meme falls under the four rejection criteria as given in the annotation guidelines. After confirming the validity of the meme, it was rated by three annotators for both tasks.
Consolidation. In the consolidation phase, for high agreements, we used majority voting to decide the final label, and we added a fourth annotator otherwise. Table 2 shows statistics about the labels and the data splits. After the final annotation, Cohen's κ increased to 0.695 and 0.797 for the two tasks, which is moderate and high agreement, respectively. These scores show the difficulty and the variability in gauging the harmfulness by human experts. For example, we found memes where two annotators independently chose partially harmful, but the third annotator annotated it as very harmful. Figure 4 shows the length distribution of the meme text for both tasks, and Table 3 shows the top-5 most frequent words in the union of the validation and the test sets. We can see that names of politicians and words related to COVID-19 are frequent in very harmful and partially harmful memes. For the target of the harmful memes, we notice the presence of various class-specific words such as president, trump, obama, china. These words often incorporate bias in the machine learning models, which makes the dataset more challenging and difficult to learn from (see Section 6.4 for more detail).

Benchmarking HarMeme dataset
We provide benchmark evaluations on HarMeme with a variety of state-of-the-art unimodal textual models, unimodal visual models, and models using both modalities. Except for unimodal visual models, we use MMF (Multimodal Framework) 9 to conduct the necessary experiments.

Multimodal Models
£ Late Fusion: This model uses the mean score of pre-trained unimodal ResNet-152 and BERT. £ Concat BERT: It concatenates the features extracted by pre-trained unimodal ResNet-152 and text BERT, and uses a simple MLP as the classifier. £ MMBT: Supervised Multimodal Bitransformers (Kiela et al., 2019) is a multimodal architecture that inherently captures the intra-modal and the intermodal dynamics within various input modalities. £ ViLBERT CC: Vision and Language BERT (ViLBERT) (Lu et al., 2019), trained on an intermediate multimodal objective (Conceptual Captions) (Sharma et al., 2018), is a strong model with taskagnostic joint representation of image + text.    Table 3: Top-5 most frequent words per class. The tf-idf score per word is given within parenthesis.
£ Visual BERT COCO: Visual BERT (V-BERT) (Li et al., 2019) pre-trained on the multimodal COCO dataset (Lin et al., 2014) is another strong multimodal model used for a broad range of vision and language tasks.

Experimental Results
Below, we report the performance of the models described in the previous section for each of the two tasks. We further discuss some biases that negatively impact performance. Appendix A gives additional details about training and the values of the hyper-parameters we used in our experiments.

Evaluation measures
We used six evaluation measures: Accuracy, Precision, Recall, Macroaveraged F1, Mean Absolute Error (MAE), and Macro-Averaged Mean Absolute Error (MMAE) (Baccianella et al., 2009). For the first four measures, higher values are better, while for the last two, lower values are better. Since the test set is imbalanced, measures like macro F1 and MMAE are more relevant. Table 4 shows the results for the harmful meme detection task. We start our experiments by merging the very hateful and the partially hateful classes, thus turning the problem into an easier binary classification. Afterwards, we perform the 3-class classification task. Since the test set is imbalanced, the majority class baseline achieves 64.76% accuracy. We observe that the unimodal visual models perform only marginally better than the majority class baseline, which indicates that they are insufficient to learn the underlying semantics of the memes. Moving down the table, we see that the unimodal text model is marginally better than the visual models. Then, for multimodal models, the performance improves noticeably, and more sophisticated fusion techniques yield better results. We also notice the effectiveness of multimodal pre-training over unimodal pre-training, which supports the recent findings by . While both ViL-BERT CC and V-BERT COCO perform similarly, the latter achieves better Macro F1 and MMAE, which are the most relevant measures.

Model
Harmful Meme Detection 2-Class Classification    Table 5 shows the results for the target identification task. This is an imbalanced 4-class classification problem, and the majority class baseline yields 46.60% accuracy. The unimodal models perform relatively better here, achieving 63% − 70% accuracy; their F1 Macro and MMAE scores are also above the majority class. However, the overall performance of the unimodal models is poor. Incorporating multimodal signals with fine-grained fusion improves the results substantially, and advanced multimodal fusion techniques with multimodal pre-training perform much better than simple late fusion with unimodal pre-training. Moreover, V-BERT COCO outperforms ViLBERT CC by 8% of F1 score and by nearly 0.3 of MMAE.

Human Evaluation
To understand how human subjects perceive these tasks, we further hired a different set of experts (not the annotators) to label the test set. We observed 86% − 91% accuracy on average for both tasks, which is much higher than V-BERT, the bestperforming model. This shows that their is a potential for enriched multimodal models that better understand the ingrained semantics of the memes.

Side-by-side Diagnostics and Anecdotes
Since the HarMeme dataset was compiled of memes related to COVID-19, we expected that models with enriched contextual knowledge and sophisticated technique would have superior performance. Thus, to comprehend the interpretability of V-BERT (the best model), we used LIME (Locally Interpretable Model-Agnostic Explanations) (Ribeiro et al., 2016), a consistent model-agnostic explainer to interpret the predictions.
We chose two memes from the test set to analyze the potential explanability of V-BERT. The first meme, which is shown in Figure 5a, was manually labeled as very harmful, and V-BERT successfully classified it, with prediction probabilities of 0.651, 0.260, and 0.089 corresponding to the very harmful, the partially harmful, and the harmless classes respectively. Figure 5b highlights the most contributing super-pixels to the very harmful (green) class. As expected, the face of Donald Trump, as highlighted by the green pixels, prominently contributed to the prediction. Figure 5c demonstrates the contribution of different meme words to the model prediction. We can see that words like CORONA and MASK have significant contributions to the very harmful class, thus supporting the lexical analysis of HarMeme as shown in Table 3.
The second meme, which is shown in Figure 5d, was manually labeled as harmless, but V-BERT incorrectly predicted it to be very harmful. Figure 5e shows that, similarly to the previous example, the face of Donald Trump contributed to the prediction of the model. We looked closer into our dataset, and we found that it contained many memes with the image of Donald Trump, and that the majority of these memes fall under the very harmful category and targeted and individual. Therefore, instead of leaning the underlying semantics of one particular meme, the model easily got biased by the presence of Donald Trump's image and blindly classified the meme as very harmful.

Conclusion and Future Work
We presented HarMeme, the first large-scale benchmark dataset, containing 3,544 memes, related to COVID-19, with annotations for degree of harmfulness (very harmful, partially harmful, or harmless), as well as for the target of the harm (an individual, an organization, a community, or society). The evaluation results using several unimodal and multimodal models highlighted the importance of modeling the multimodal signal (for both tasks) -(i) detecting harmful memes and (ii) detecting their targets-, and indicated the need for more sophisticated methods. We also analyzed the best model and identified its limitations.
In future work, we plan to design new multimodal models and to extend HarMeme with examples from other topics, as well as to other languages. Alleviating the biases in the dataset and in the models are other important research directions.

Ethics and Broader Impact
User Privacy. Our dataset only includes memes and it does not contain any user information.
Biases. Any biases found in the dataset are unintentional, and we do not intend to do harm to any group or individual. We note that determining whether a meme is harmful can be subjective, and thus it is inevitable that there would be biases in our gold-labeled data or in the label distribution. We address these concerns by collecting examples using general keywords about COVID-19, and also by following a well-defined schema, which sets explicit definitions during annotation. Our high inter-annotator agreement makes us confident that the assignment of the schema to the data is correct most of the time.
Misuse Potential. We ask researchers to be aware that our dataset can be maliciously used to unfairly moderate memes based on biases that may or may not be related to demographics and other information within the text. Intervention with human moderation would be required in order to ensure that this does not occur.
Intended Use. We present our dataset to encourage research in studying harmful memes on the web. We distribute the dataset for research purposes only, without a license for commercial use. We believe that it represents a useful resource when used in the appropriate manner.
Environmental Impact. Finally, we would also like to warn that the use of large-scale Transformers requires a lot of computations and the use of GPUs/TPUs for training, which contributes to global warming (Strubell et al., 2019). This is a bit less of an issue in our case, as we do not train such models from scratch; rather, we fine-tune them on relatively small datasets. Moreover, running on a CPU for inference, once the model has been finetuned, is perfectly feasible, and CPUs contribute much less to global warming.   Figure B.1: Examples of memes that we rejected during the process of data collection and annotation.