DISARM: Detecting the Victims Targeted by Harmful Memes

Internet memes have emerged as an increasingly popular means of communication on the Web. Although typically intended to elicit humour, they have been increasingly used to spread hatred, trolling, and cyberbullying, as well as to target specific individuals, communities, or society on political, socio-cultural, and psychological grounds. While previous work has focused on detecting harmful, hateful, and offensive memes, identifying whom they attack remains a challenging and underexplored area. Here we aim to bridge this gap. In particular, we create a dataset where we annotate each meme with its victim(s) such as the name of the targeted person(s), organization(s), and community(ies). We then propose DISARM (Detecting vIctimS targeted by hARmful Memes), a framework that uses named entity recognition and person identification to detect all entities a meme is referring to, and then, incorporates a novel contextualized multimodal deep neural network to classify whether the meme intends to harm these entities. We perform several systematic experiments on three test setups, corresponding to entities that are (a) all seen while training, (b) not seen as a harmful target on training, and (c) not seen at all on training. The evaluation results show that DISARM significantly outperforms ten unimodal and multimodal systems. Finally, we show that DISARM is interpretable and comparatively more generalizable and that it can reduce the relative error rate for harmful target identification by up to 9 points absolute over several strong multimodal rivals.


Introduction
Social media offer the freedom and the means to express deeply ingrained sentiments, which can be done using diverse and multimodal content such as memes. Besides being popularly used to express benign humour, Internet memes have also been misused to incite extreme reactions, hatred, and to spread disinformation on a massive scale. Numerous recent efforts have attempted to characterize harmfulness (Pramanick et al., 2021b), hate speech (Kiela et al., 2020), and offensiveness (Suryawanshi et al., 2020) within memes. Most of these efforts have been directed towards detecting malicious influence within memes, but there has been little work on identifying whom the memes target. Besides detecting whether a meme is harmful, it is often important to know whether the meme contains an entity that is particularly targeted in a harmful way. This is the task we are addressing here: detecting the entities that a meme targets in a harmful way.
Harmful targeting in memes is often done using satire, sarcasm, or humour in an explicit or an implicit way, aiming at attacking an individual, an organization, a community, or society in general. For example, Fig. 1a depicts Justin Trudeau, the Prime Minister of Canada, as communally biased against Canadians, while favoring alleged killings by Muslims, whereas Fig. 1b shows an arguably benign meme of the same person expressing subtle humour. Essentially, the meme in Fig. 1a harmfully targets Justin Trudeau directly, while causing indirect harm to Canadians and to Muslims as well. Note that in many cases interpreting memes and their harmful intent requires some additional background knowledge for the meme to be understood properly.
Hence, an automated system for detecting the entities targeted by harmful memes faces two major challenges: (i) insufficient background context, (ii) complexity posed by the implicit harm, and (iii) keyword bias in a supervised setting.
To address these challenges, here we aim to address the task of harmful target detection in memes by formulating it as an open-ended task, where a meme can target an entity not seen on training. An end-to-end solution requires (i) identifying the entities referred to in the meme, and (ii) deciding whether each of these entities is being targeted in a harmful way. To address these two tasks, we perform systematic contextualization of the multimodal information presented within the meme by first performing intra-modal fusion between an external knowledge-based contextualized-entity and the textually-embedded harmfulness in the meme, which is followed by cross-modal fusion of the contextualized textual and visual modalities using low-rank bi-linear pooling, resulting in an enriched multimodal representation. We evaluate our model using three-level stress-testing to better assess its generalizability to unseen targets.
We create a dataset, and we propose an experimental setup and a model to address the aforementioned requirements, making the following contributions: 1 : 1. We introduce the novel task of detecting the entities targeted by harmful memes.
2. We create a new dataset for this new task, Ext-Harm-P, by extending Harm-P (Pramanick et al., 2021b) via re-annotating each harmful meme with the entity it targets.
3. We propose DISARM, a novel multimodal neural architecture that uses an expressive contextualized representation for detecting harmful targeting in memes.
4. We empirically showcase that DISARM outperforms ten unimodal and multimodal models by several points absolute in terms of macro-F1 scores in three different evaluation setups.

Related Work
Misconduct on Social Media. The rise in misconduct on social media is a prominent research topic. Some forms of online misconduct include rumours (Zhou et al., 2019), fake news (Aldwairi and Alwahedi, 2018;Shu et al., 2017;Nguyen et al., 2020), misinformation (Ribeiro et al., 2021;Shaar et al., 2022), disinformation (Alam et al., 2021;Hardalov et al., 2022), hate speech (MacAvaney et al., 2019;Zhang and Luo, 2019;Zampieri et al., 2020), trolling (Cook et al., 2018), and cyber-bullying (Kowalski et al., 2014;Kim et al., 2021). Some notable work in this direction includes stance (Graells-Garrido et al., 2020) and rumour veracity prediction, in a multi-task learning framework (Kumar and Carley, 2019), wherein the authors proposed a Tree LSTM for characterizing online conversations. Wu and Liu (2018) explored user and social network representations for classifying a message as genuine vs. fake. Cheng et al. (2017) studied user's mood along with the online contextual discourse and demonstrated that it helps for trolling behaviour prediction on top of user's behavioural history. Relia et al. (2019) studied the synergy between discrimination based on race, ethnicity, and national origin in the physical and in the virtual space.  (Kiela et al., 2020) introduced the task of classifying a meme as hateful vs. non-hateful. Different approaches such as feature augmentation, attention mechanism, and multimodal loss re-weighting were attempted (Das et al., 2020;Sandulescu, 2020;Zhou et al., 2021;Lippe et al., 2020)   Web-entity detection along with fair face classification (Karkkainen and Joo, 2021) and semisupervised learning-based classification (Zhong, 2020) were also used for the hateful meme classification task. Other noteworthy research includes using implicit models, e.g., topic modelling and multimodal cues, for detecting offensive analogy (Shang et al., 2021b) and hateful discrimination (Mittos et al., 2020) in memes. Wang et al. (2021) argued that online attention can be garnered immensely via fauxtography, which could eventually evolve towards turning into memes that potentially go viral. To support research on these topics, several datasets for offensiveness, hate speech, and harmfulness detection have been created (Suryawanshi et al., 2020;Kiela et al., 2020;Pramanick et al., 2021a,b;Gomez et al., 2020;Dimitrov et al., 2021;Sharma et al., 2022).
Most of the above studies attempted to address classification tasks in a constrained setting. However, to the best of our knowledge, none of them targeted the task of detecting the specific entities that are being targeted. Here, we aim to bridge this gap with focus on detecting the specific entities targeted by a given harmful meme.

Dataset
The Harm-P dataset (Pramanick et al., 2021b) consists of 3,552 memes about US politics. Each meme is annotated with its harmful label and the social entity that it targets. The targeted entities are coarsely classified into four social groups: individual, organization, community, and the general public. While these coarse classes provide an abstract view of the targets, identifying the specific targeted person, organization, or community in a fine-grained fashion is also crucial, and this is our focus here. All the memes in this dataset broadly pertain to US Politics domain, and they target well-known personalities or organizations. To this end, we manually re-annotated the memes in this dataset with the specific people, organizations, and communities that they target. Extending Harm-P (Ext-Harm-P). Towards generalizability, we extend Harm-P by redesigning the existing data splits as shown in Table 1. We call the resulting dataset Ext-Harm-P. It contains a total of 4,446 examples including 1,594 harmful and 2,852 non-harmful; both categories have references to a number of entities. For training, we use the harmful memes provided as part of the original dataset (Pramanick et al., 2021b), which we re-annotate for the fine-grained entities that are being targeted harmfully as positive samples (harmful targets). This is matched with twice as many negative samples (not-harmful targets). For negative targets, we use the top-2 entities from the original entity lexicon, which are not labeled for harmfulness and have the highest lexical similarity with the meme text (Ferreira et al., 2016). This at least ensures lexical similarity with the entities referenced within a meme, thereby facilitating a confounding effect (Kiela et al., 2020) as well. For the test set, all the entities are first extracted automatically using named entity recognition (NER) and person identification (PID) 2 . This is followed by manual annotation of the test set.
Dataset Annotation Process Since assessing the harmfulness of memes is a highly subjective task, our annotators were requested to follow four key steps when annotating each meme, aiming to ensure label consistency. The example in Fig. 2 demonstrates the steps taken while annotating: we first identify the candidate entities, and then we decide whether a given entity is targeted in a harmful way. We asked our annotators to do the following (additional details about the annotation process are given in Appendix D):  We had three annotators and a consolidator. The inter-annotator agreement before consolidation had a Fleiss Kappa of 0.48 (moderate agreement), and after consolidation it increased to 0.64 (substantial agreement).
Analyzing Harmful Targeting in Memes. The memes in Ext-Harm-P are about US Politics, and thus they prominently feature entities such as Joe Biden and Donald Trump, both harmfully and harmlessly. The ratio between these types of referencing varies across individuals, organizations, and communities. We can see in Fig. 3 that the top-5 harmfully referenced individuals and organizations are observed to be subjected to a more relative harm (normalized by the number of occurrences of these entities in memes). However, the stacked plots for the top-5 harmfully targeted communities Mexicans, Black, Muslim, Islam, and Russian in Fig. 3c show relatively less harm targeting these communities.

Proposed Approach
Our proposed model DISARM, as depicted in Fig. 4, is based on a fusion of the textual and the visual modalities, explicitly enriched via contextualised representations by leveraging CLIP (Radford et al., 2021). We chose CLIP as a preferred encoder module for contextualization, due to its impressive zero-shot multimodal embedding capabilities. At first, valid entities are extracted automatically, as part of the process of creating training/validation sets. Then, for each meme, we first obtain the contextualized-entity (CE) representation by fusing the CLIP-encoded context and the entity representation. CE is then fused with BERT-based (Devlin et al., 2019) embeddedharmfulness (EH) encoding fine-tuned on the OCR-extracted text and entities as inputs. We call the resulting fusion output a contextualizedtext (CT) representation. CT is then fused with the contextualized-image (CI) representation, obtained using the CLIP encoder for the image. We, henceforth, refer to the resulting enriched representation as the contextualized multimodal (CMM) representation. We modify the multimodal lowrank bi-linear pooling (Kim et al., 2017) to fuse the input representation into a joint space.
This approach, as can be seen in the subsequent sections below, not only can capture complex cross-modal interactions, but it also provides an efficient fusion mechanism towards obtaining a context-enriched representation. Finally, we use this representation to train a classifier for our task. We describe each module in detail below.
Low-rank Bi-linear Pooling (LRBP). We begin by revisiting low-rank bi-linear pooling to set the necessary background. Due to the many parameters in bi-linear models, Pirsiavash et al.
(2009) suggested a low-rank bi-linear (LRB) approach to reduce the rank of the weight matrix W i . Consequently, the number of parameters and hence the complexity, are reduced. The weight matrix W i is re-written as W i = U i V T i , where U i ∈ R N ×d and V i ∈ R M ×d , effectively putting an upper bound of min(N, M ) on the value of d. Therefore, the low-rank bi-linear models can be expressed as follows: where 1 ∈ R d is a column vector of ones, and • is Hadamard product. f i in Equation (1) can be further re-written to obtain f as follows: where f ∈ {f i }, P ∈ R d×c , b ∈ R c , d is an output, and c is an LRB hyper-parameter. We further introduce a non-linear activation formulation for LRBP, following Kim et al. (2017), who argued that non-linearity both before and after the Hadamard product complicates the gradient computation. This addition to Equation (2) can be represented as follows: We slightly modify the multimodal low-rank bi-linear pooling (MMLRBP). Instead of directly projecting the input x ∈ R N and y ∈ R M into a lower dimension d, we first project the input modalities in a joint space N . We then perform LRBP as expressed in Equation 3, by using jointly embedded representations x mm ∈ R N ×d and y mm ∈ R N ×d to obtain a multimodal fused representation f mm , as expressed below: Structured Context. Towards modelling auxiliary knowledge, we curate contexts for the memes in Ext-Harm-P. First, we use the meme text as a search query 3 to retrieve relevant contexts, using the title and the first paragraph of the resulting top document as a context, which we call con.

Contextualized-entity Representation (CE).
Towards modelling the context-enriched entity, we first obtain the embedding of the input entity ent. Since we have a finite set of entities referenced in the memes in our training dataset, we perform a lookup in the embedding matrix from R V ×H to obtain the corresponding entity embedding ent ∈ R H , with H = 300 being the embedding dimension and V the vocabulary size. We train the embedding matrix from scratch as part of the overall training of our model. We project the obtained entity representation ent into a 512dimensional space, which we call e. To augment a given entity with relevant contextual information, we fuse it with a contextual representation c ∈ R 512 obtained by encoding the associated context (con) using CLIP. We perform this fusion using our adaptation of the multimodal low-rank bi-linear pooling as defined by Equation (4). This yields the following contextualized-entity (CE) representation c ent :

Contextualized-Text (CT) Representation.
Once we obtain the contextualized-entity embedding c ent , we concatenate it with the BERT encoding for the combined representation of the OCR-extracted text and the entity (o ent ∈ R 768 ). We call this encoding an embedded-harmfulness (EH) representation. The concatenated representation from R 1280 is then projected non-linearly into a lower dimension using a dense layer of size 512. We call the resulting vector c txt a contextualized-text (CT) representation: where W ∈ R 1280×512 .

Contextualized Multimodal (CMM) Representation.
Once we obtain the contextualized-text representation c txt ∈ R 512 , we again perform multimodal low-rank bi-linear pooling using Equation (4) to fuse it with the contextualizedimage representation c img ∈ R 512 , obtained using the CLIP image-encoder. The operation is expressed as follows: where c mm ∈ R 512 , P 2 ∈ R 256×512 , U 2 ∈ R 512×256 , and V 2 ∈ R 512×256 . Notably, we learn two different projection matrices P 1 and P 2 , for the two fusion operations performed as part of Equations (5) and (7), respectively, since the fused representations at the respective steps are obtained using different modality-specific interactions.
Classification Head. Towards modelling the binary classification for a given meme and a corresponding entity as either harmful or non-harmful, we use a shallow multi-layer perceptron with a single dense layer of size 256, which represents a condensed representation for classification. We finally map this layer to a single dimension output via a sigmoid activation. We use binary crossentropy for the back-propagated loss.

Experiments
We experiment with various unimodal (image/text-only) and multimodal models, including such pre-trained on multimodal datasets such as MS COCO (Lin et al., 2014) and CC (Sharma et al., 2018). We train DISARM and all unimodal baselines using PyTorch, while for the multimodal baselines, we use the MMF framework. 4 5

Evaluation Measures
For evaluation, we use commonly used macroaverage versions of accuracy, precision, recall, and F1 score. For example, we discuss the harmful class recall, which is relevant for our study as it characterizes the model's performance at detecting harmfully targeting memes. All results we report are averaged over five independent runs. Evaluation Strategy. With the aim of having a realistic setting, we pose our evaluation strategy as an open-class one. We train all systems using under-sampling of the entities that were not targeted in a harmful way: using all positive (harmful) examples and twice as many negative (nonharmful) ones. We then perform an open-class testing, for all referenced entities (some possibly unseen on training) per meme, effectively making the evaluation more realistic. To this end, we formulate three testing scenarios as follows, with their Harmful (H) and Non-harmful (N) counts: Baseline Models. Our baselines include both unimodal and multimodal models as follows: -Unimodal Systems: VGG16, VIT: For the unimodal (image-only) systems, we use two well-known models: VGG16 (Simonyan and Zisserman, 2015) and VIT (Vision Transformers) that emulate a Transformer-based application jointly over textual tokens and image patches (Dosovitskiy et al., 2021). GRU, XL-Net: For the unimodal (text-only) systems, we use GRU (Cho et al., 2014), which adaptively captures temporal dependencies, and XLNet (Yang et al., 2019), which implements a generalized auto-regressive pre-training strategy.
-Multimodal Systems: MMF Transformer: This is a multimodal Transformer model that uses visual and language tokens with selfattention. 6 MMBT: Multimodal Bitransformer (Kiela et al., 2019) captures the intramodal and the inter-modal dynamics.    Experimental Results. We compare the performance of several unimodal and multimodal systems (pre-trained or trained from scratch) vs. DISARM and its variants. All systems are evaluated using the 3-way testing strategy described above. We then perform ablation studies on representations that use the contextualized-entity, its fusion with embedded-harmfulness resulting into contextualized-text, and the final fusion with contextualized-image yielding the contextualizedmultimodal modules of DISARM (see Appendix B for a detailed ablation study). 7 This is followed by interpretability analysis. Finally, we discuss the limitations of DISARM by performing error analysis (details in Appendix C). All Entities Seen During Training: In our unimodal text-only baseline experiments, the GRUbased system yields a relatively lower harmful recall of 0.74 compared to XLNet's 0.82, but a better overall F1 score of 0.75 vs. 0.67 for XLNet, as shown in Table 2. The lower harmful precision of 0.65 and the not-harmful recall of 0.52 contribute to the lower F1 score for XLNet. 7 We use the abbreviations CE, CT, CI, CMM, EH, and MMLRBP for the contextualized representations of the entity, the text, the image, the multimodal representation, the embedded-harmfulness, and the multimodal low-rank bilinear pooling, respectively.
Among the image-only unimodal systems, VGG performs better with a non-harmful recall of 0.81, but its poor performance for detecting harmful memes yields a lower harmful recall of 0.68. At the same time, VIT has a relatively better harmful recall of 0.74. Overall, the unimodal results (see Table 2) indicate the efficacy of self-attention over convolution for images and RNN (GRU) sequence modeling for text.
Multimodally pre-trained models such as Visu-alBERT and ViLBERT yield moderate F1 scores of 0.70 and 0.68, and harmful recall of 0.78 and 0.77, respectively (see Table 2). Fresh training facilitates more meaningful results in favour of nonharmful precision of 0.78 for both models, and harmful recall of 0.84 and 0.82 for VisualBERT and ViLBERT, respectively. Overall, ViLBERT yields the most balanced performance of 0.75 in terms of F1 score. It can be inferred from these results (see Table 2) that multimodal pre-training leverages domain relevance.
We can see in Table 2 that multimodal low-rank bi-linear pooling distinctly enhances the performance in terms of F1 score. The improvements can be attributed to the fusion of the CE and EH representations, respectively, with CI, instead of a simple concatenation. This is more prominent for CE with an F1 score of 0.78, which shows the importance of modeling the background context. Finally, DISARM yields a balanced F1 score of 0.78, with a reasonable precision of 0.74 for nonharmful category, and the best recall of 0.86 for the harmful category.
All Entities Unseen as Harmful Targets During Training: With Test Set B, the evaluation is slightly more challenging in terms of the entities to be assessed, as these were never seen at training time as harmful.  Unimodal systems perform poorly on the harmful class, with the exception of XLNet (see Ta-ble 2), where the harmful class recall as 0.56. For the multimodal baselines, systems pre-trained using COCO (VisualBERT) and CC (ViLBERT) yield a moderate recall of 0.64 and 0.71 for the harmful class in contrast to what we saw for Test Set A in Table 2. This could be due to additional common-sense reasoning helping such systems, on a test set that is more open-ended compared to Test Set A. Their non-pre-trained versions along with the MM Transformer and MMBT achieve better F1 scores, but with low harmful recall.
Multimodal fusion using MMLRBP improves the harmful class recall for CE to 0.52, but yields lower values of 0.37 for EH fusion with CI (see Table 2). This reconfirms the utility of the context. In comparison, DISARM yields a balanced F1 score of 0.65 with the best precision of 0.83 and 0.38, along with decent recall of 0.79 and 0.69 for non-harmful and harmful memes, respectively.
All Entities Unseen During Training: The results decline in this scenario (similarly to Test Set B), except for the harmful class recall of 0.62 for XL-Net, as shown in Table 3. In the current scenario (Test Set C), none of the entities being assessed at testing is seen during the training phase. For multimodal baselines, we see a similar trend for VisualBERT (COCO) and ViLBERT (CC), with the harmful class recall of 0.72 for ViLBERT (CC) being significantly better than the 0.12 for Visual-BERT (COCO). This again emphasizes the need for the affinity between the pre-training dataset and the downstream task at hand. In general, the precision for the harmful class is very low.
We observe (see Table 3) sizable boost for the harmful class recall for MMLRBP-based multimodal fusion of CI with CE (0.69%), against a decrease with EH (0.31%). In comparison, DISARM yields a low, yet the best harmful precision of 0.36, and a moderate recall of 0.70 (see Table 3). Moreover, besides yielding reasonable precision and recall of 0.86 and 0.76 for the non-harmful class, DISARM achieves better average precision, recall, and F1 scores of 0.61, 0.73, and 0.64, respectively.
Generalizability of DISARM. The generalizability of DISARM follows from its characteristic modelling and context-based fusion. DISARM demonstrates an ability to detect harmful targeting for a diverse set of entities. Specifically, the threeway testing setup inherently captures the efficacy with which DISARM can detect unseen harmful targets. The prediction for entities completely unseen on training yields better results (see Tables 2  and 3), and suggests possibly induced bias in the former scenario. Moreover, it is a direct consequence of the fact that we were able to incorporate only a limited set of the 246 potential targets. Overall, we argue that DISARM generalizes well for unseen entities with 0.65 and 0.64 macro-F1 scores, as compared to ViLBERT's 0.58 and MMBT's 0.51, for Test Sets B and C, respectively.

Comparative
Diagnosis. Despite the marginally better harmful recall for ViLBERT (CC) on Test Set B (see Table 2) and Test Set C (see Table 3), the overall balanced performance of DISARM appears to be reasonably justified based on the comparative interpretability analysis between the attention maps for the two systems. Fig. 5 shows the attention maps for an example meme. It depicts a meme that is correctly predicted to harmfully target the Democratic Party by DISARM and incorrectly by ViLBERT. As visualised in Fig. 5a, the harmfully-inclined word killing effectively attends not only to baby, but also to Democrats and racist. The relevance is depicted via different color schemes and intensities, respectively. Interestingly, killing also attends to the Democratic Party, both as part of the OCRextracted text and the target-candidate, jointly encoded by BERT. The multimodal attention leveraged by DISARM is depicted (via the CLIP encoder) in Fig. 5b, demonstrating the utility of contextualised attention over the male figure that represents an attack on the Democratic Party. Also, DISARM has a relatively focused field of vision, as shown in Fig. 5c, as compared to a relatively scattered one for ViLBERT (see Fig. 5d). This suggest a better multimodal modelling capacity for DISARM as compared to ViLBERT.

Conclusion and Future Work
We introduced the novel task of detecting the targeted entities within harmful memes and we highlighted the inherent challenges involved. Towards addressing this open-ended task, we extended Harm-P with target entities for each harmful meme. We then proposed a novel multimodal deep neural framework, called DISARM, which uses an adaptation of multimodal low-rank bi-linear pooling-based fusion strategy at different levels of representation abstraction. We showed that DISARM outperforms various uni/multi-modal baselines in three different scenarios by 4%, 7%, and 13% increments in terms of macro-F1 score, respectively. Moreover, DISARM achieved a relative error rate reduction of 9% over the best baseline. We further emphasized the utility of different components of DISARM through ablation studies. We also elaborated on the generalizability of DISARM, thus confirming its modelling superiority over ViLBERT via interpretability analysis. We finally analysed the shortcomings in DISARM that lead to incorrect harmful target predictions.
In the present work, we made an attempt to elicit some inherent challenges pertaining to the task at hand: augmenting the relevant context, effectively fusing multiple modalities, and pretraining. Yet, we also leave a lot of space for future research for this novel task formulation.

Ethics and Broader Impact
Reproducibility. We present detailed hyperparameter configurations in Appendix A and Table 4. The source code, and the dataset Ext-Harm-P are available at https://github. com/LCS2-IIITD/DISARM User Privacy. The information depicted/used does not include any personal information. Copyright aspects are attributed to the dataset source.
Annotation. The annotation was conducted by NLP experts or linguists in India, who were fairly treated and were duly compensated. We conducted several discussion sessions to make sure all annotators could understand the distinction between harmful vs. non-harmful referencing.
Biases. Any biases found in the dataset are unintentional, and we do not intend to cause harm to any group or individual. We acknowledge that detecting harmfulness can be subjective, and thus it is inevitable that there would be biases in our goldlabelled data or in the label distribution. This is addressed by working on a dataset that is created using general keywords about US Politics, and also by following a well-defined schema, which sets explicit definitions for annotation.
Misuse Potential. Our dataset can be potentially used for ill-intended purposes, such as biased targeting of individuals/communities/organizations, etc. that may or may not be related to demographics and other information within the text. Intervention with human moderation would be required to ensure that this does not occur.
Intended Use. We make use of the existing dataset in our work in line with the intended usage prescribed by its creators and solely for research purposes. This applies in its entirety to its further usage as well. We commit to releasing our dataset aiming to encourage research in studying harmful targeting in memes on the web. We distribute the dataset for research purposes only, without a license for commercial use. We believe that it represents a useful resource when used appropriately.
Environmental Impact. Finally, large-scale models require a lot of computations, which contribute to global warming (Strubell et al., 2019). However, in our case, we do not train such models from scratch; rather, we fine-tune them on a relatively small dataset.

Acknowledgments
The work was partially supported by a Wipro research grant, Ramanujan Fellowship, the Infosys Centre for AI, IIIT Delhi, and ihub-Anubhuti-iiitd Foundation, set up under the NM-ICPS scheme of the Department of Science and Technology, India. It is also part of the Tanbih mega-project, developed at the Qatar Computing Research Institute, HBKU, which aims to limit the impact of "fake news," propaganda, and media bias by making users aware of what they are reading, thus promoting media literacy and critical thinking. We trained all our models using PyTorch on NVIDIA Tesla V100 GPU, with 32 GB dedicated memory, CUDA-11.2 and cuDNN-8.1.1 installed. For the unimodal models, we imported all the pre-trained weights from the TORCHVISION.MODELS 8 , a sub-package of the PyTorch framework. We initialized the remaining weights randomly using a zero-mean Gaussian distribution with a standard deviation of 0.02. We train DISARM in a setup considering only harmful class data from Harm-P (Pramanick et al., 2021b). We extended it by manually annotating for harmful targets, followed by including non-harmful examples using automated entity extraction (textual and visual) strategies for training/validation splits and manual annotation (for both harmful and nonharmful) for the test split.
When training our models and exploring various values for the different model hyperparameters, we experimented with using the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1e −4 , a weight decay of 1e −5 , and a Binary Cross-Entropy (BCE) loss as the objective function. We extensively fine-tuned our experimental setups based upon different architectural requirements to select the best hyper-parameter values. We also used early stopping for saving the best intermediate checkpoints.

B Ablation Study
In this section, we present some ablation studies for sub-modules of DISARM based on CE, EH, CT, and CI, examined in isolation and in combinations, and finally for DISARM using CMM.

B.1 Test Set A
As observed in the comparisons made with the other baseline systems for the Test Set A in Table  2, the overall range of the F1 scores is relatively higher with the lowest value being 0.66 for XLNet (text-only) model. The results for unimodal systems, as can be observed in Table 5, is satisfactory with values of 0.74, 0.73, and 0.77 for CE EH, and CI unimodal systems, respectively. For multimodal systems, we can observe distinct lead for the MMLRBP-based fusion strategy, for both CE and EH systems over the concatenation-based approach, except for EH's recall drop by 7%. Finally DISARM yields the best overall F1 score of 0.78.

B.2 Test Set B
With context not having any harmfulness cues for a given meme when considered in isolation, the unimodal CE module performs the worst with 0.48 F1 score, and 0.07 recall for the harmful class, in the open-ended setting of Test Set B. In contrast, EH yields an impressive F1 score of 0.55, and a harmful recall of 0.41. This relative gain of 7% in terms of F1 score could be due to the presence of explicit harmfulness cues. The complementary effect of considering contextual information can be inferred from the joint modeling of CE and EH, to obtained CT, that enhances the F1 score and the harmful recall by 2% and 3%, respectively (see Table 5). Unimodal assessment of CI performs moderately with an F1 score of 0.51, but with a poor harmful recall of 0.15. MMLRBP, towards joint-modeling of CE and CI yields a significant boost in the harmful recall to 0.52 (see Table 5). On the other hand, MMLRBP-based fusion of EH and CI yields 0.54 F1 score, which is 1% below that for the unimodal EH system. This emphasizes the importance of accurately modeling the embedded harmfulness, besides augmenting with additional context. A complementary impact of CE, EH, and CI is observed for DISARM with a balanced F1 score of 0.6 and a competitive harmful recall value of 0.69.

B.3 Test Set C
As observed in the previous scenario, the unimodal models for CE yield a low F1 score of 0.48 and the worst harmful recall value of 0.06. Much better performance is observed for unimodal setups including EH, and its joint modelling with CE with improved F1 scores of 0.56 and 0.58, respectively, along with the harmful recall score of 0.56 and 0.57, respectively. CI based unimodal evaluation again yields a moderate F1 score of 0.53 (see Table 5), along with a poor harmful recall of 0.19, which shows its inadequacy to model harmful targeting on its own. For multimodal setups, the joint modelling of CE and CI benefits from MMLRBP based fusion, yielding a gain of 7% and 13% in terms of F1 score and harmful recall, respectively. This confirms the importance of contextual multimodal semantic alignment. Correspondingly, joint multimodal modelling of EH and CI regresses the unimodal affinity within the EH. Finally, DISARM outperforms all other systems in this category with the best F1 score of 0.64, with a decent harmful recall score of 0.7. The experimental results here are for comparison and analysis of the optimal set of design and baseline choices. We should note that we performed extensive experiments as part of our preliminary investigation, with different contextual modelling strategies, attention mechanisms, modelling choices, etc., to reach a conclusive architectural configuration that show promise for addressing the task of target detection in harmful memes.

C Error Analysis
It is evident from the results shown in Tables 2 and 3 that DISARM still has shortcomings. Examples like the one shown in Fig. 6 are seemingly harmless, both textually and visually, but imply serious harm to a person of color in an implicit way.

D.2 Characteristics of Harmful Targeting
There are several factors that collectively facilitate the characterisation of harmful targeting in memes. Here are some: 1. A prominent way of harmfully targeting an entity in a meme is by leveraging sarcastically harmful analogies, framed via either textual or visual instruments (see Fig. 7a).
2. There could be multiple entities being harmfully targeted within a meme as depicted in Fig. 2. Hence, annotators were asked to provide all such targets as harmful, with no exceptions.
3. A harmful targeting within a meme could have visual depictions that are either gory, violent, graphically sensitive, or pornographic (see Fig. 7b).
4. Any meme that insinuates an entity on either social, political, professional, religious grounds, can cause harm (see Fig. 7c and 7d).
5. Any meme that implies an explicit/implicit threat to an individual, a community, a national or an international entity is harmful (see Fig. 7d and 7e).
6. Whenever there is any ambiguity regarding the harmfulness of any reference being made, we requested the annotators to proceed following the best of their understanding.

E Ext-Harm-P Characteristics
Below, we perform some analysis of the lexical content of the length of the meme text.

E.1 Lexical Analysis
Interestingly, a significant number of memes are disseminated making references to popular individuals such as Joe Biden, Donald Trump, etc., as can be observed for individual sub-categories (for both harmful and non-harmful memes) in Table 6.
We can see in Table 6 that for harmfulorganization, the top-5 harmfully targeted organizations include the top-2 leading political organizations in the USA (the Democratic Party and the Republican Party), which are of significant political relevance, followed by the Libertarian Party, a media outlet (CNN), and finally the generic government. At the same time, non-harmfully referenced organizations includes the Biden camp and the Trump administration, which are mostly leveraged for harmfully targeting (or otherwise) the associated public figure. Finally, communities such as Mexicans, Black, Muslim, Islam, and Russian are often immensely prejudiced against online, and thus also in our meme dataset. At the same time, non-harmfully targeted communities such as the Trump supporters and the African Americans are not targeted as often as the aforementioned ones, as we can see in Table 6.
The above analysis of the lexical content of the memes in our datasets largely emphasizes the inherent bias that multimodal content such as memes can exhibit, which in turn can have direct influence on the efficacy of machine/deep learningbased systems for detecting the entities targeted by harmful memes. The reasons for this bias are mostly linked to societal behaviour at the organic level, and the limitations posed by current techniques to process such data. The mutual exclusion for harmful vs. non-harmful categories for community shows the inherent bias that could pose a challenge, even for the best multi-modal deep neural systems. The high pervasiveness of a few prominent keywords could effectively lead to increasing bias towards them for specific cases. At the same time, the significant overlap observed in Table 6 for the enlisted entities, between harmful and not-harmful individuals, highlights the need for sophisticated multi-modal systems that can effectively reason towards making a complex decision like detecting harmful targeting within memes, rather than exploit the biases towards certain entities in the training data.

E.2 Meme-Message Length Analysis
Most of the harmful memes are observed to be created using texts of length 16-18 (see Fig. 8).
At the same time, not-harmful meme-text lengths have a relatively higher standard deviation, possibly due to the diversity of non-harmful messages. Trump and the Republic Party have memetext length distributions similar to the non-harmful category: skewing left, but gradually decreasing towards the right. This suggests a varying content generation pattern amongst meme creators (see Fig. 8). The meme-text length distribution for Biden closely approximates a normal distribution with a low standard deviation. Both categories would pre-dominantly entail creating memes with shorter text lengths, possibly due to the popularity of Biden amongst humorous content creators. A similar trend could be seen for the Democratic Party as well, where most of the instances fall within the 50-75 memetext length range. The overall harmful and nonharmful meme-text length distribution is observed to be fairly distributed across different meme-text lengths for Mexican. At the same time, the amount of harm intended towards the Black community is observed to be significantly higher, as compared to moderately distributed non-harmful memes depicted by the corresponding meme-text length distribution in Fig. 8.