MEMEX: Detecting Explanatory Evidence for Memes via Knowledge-Enriched Contextualization

Memes are a powerful tool for communication over social media. Their affinity for evolving across politics, history, and sociocultural phenomena renders them an ideal vehicle for communication. To comprehend the subtle message conveyed within a meme, one must understand the relevant background that facilitates its holistic assimilation. Besides digital archiving of memes and their metadata by a few websites like knowyourmeme.com, currently, there is no efficient way to deduce a meme’s context dynamically. In this work, we propose a novel task, MEMEX - given a meme and a related document, the aim is to mine the context that succinctly explains the background of the meme. At first, we develop MCC (Meme Context Corpus), a novel dataset for MEMEX. Further, to benchmark MCC, we propose MIME (MultImodal Meme Explainer), a multimodal neural framework that uses external knowledge-enriched meme representation and a multi-level approach to capture the cross-modal semantic dependencies between the meme and the context. MIME surpasses several unimodal and multimodal systems and yields an absolute improvement of 4% F1-score over the best baseline. Lastly, we conduct detailed analyses of MIME’s performance, highlighting the aspects that could lead to optimal modeling of cross-modal contextual associations.


Introduction
Social media has become a mainstream communication medium for the masses, redefining how we interact within society. The information shared on social media has diverse forms, like text, audio, and visual messages, or their combinations thereof. A meme is a typical example of such social media artifact that is usually disseminated with the flair of sarcasm or humor. While memes facilitate convenient means for propagating complex social, cultural, or political ideas via visual-linguistic semiotics, they often abstract away the contextual de-Meme Source: Redditr/historymemes

Context Source: Wikipedia
The result was a palace uprising by the Janissaries, who promptly imprisoned the young sultan in Yedikule Fortress in Istanbul, where Osman II was strangled to death. After Osman's death, his ear was cut off and represented to Halime Sultan and Sultan Mustafa I to confirm his death and Mustafa would no longer need to fear his nephew. It was the first time in the Ottoman history that a Sultan was executed by the Janissaries. Table 1: MEMEX -given a meme and a relevant context, the aim is to identify the evidence in the context that can succinctly explain the background of the meme, depicted above via emboldened and highlighted excerpt. tails that would typically be necessary for the uninitiated. Such contextual knowledge is critical for human understanding and computational analysis alike. We aim to address this requirement by contemplating solutions that facilitate the automated derivation of contextual evidence towards making memes more accessible. To this end, we formulate a novel task -MEMEX, which, given a meme and a related context, aims to detect the sentences from within the context that can potentially explain the meme. Table 1 visually explains MEMEX. Memes often camouflage their intended meaning, suggesting MEMEX's utility for a broader set of multimodal applications having visual-linguistic dissociation. Other use cases include context retrieval for various art forms, news images, abstract graphics for digital media marketing, etc. Table 1 primarily showcases a meme's figure (left) and an excerpt from the related context (right). This meme is about the revenge killing of an Ottoman Sultan, by the Janissaries (infantry units), in reaction to their disbanding, by the Sultan. The first line conveys the supporting evidence for the meme from the related context, emboldened and highlighted in Table 1. The aim is to model the required cross-modal association that facilitates the detection of such supporting pieces of evidence from a given related contextual document.
The recent surge in the dissemination of memes has led to an evolving body of studies on meme analysis in which the primary focus has been on tasks, such as emotion analysis (Sharma et al., 2020), visual-semantic role labeling (Sharma et al., 2022c), detection of phenomena like sarcasm, hatespeech (Kiela et al., 2020), trolling (Hegde et al., 2021) and harmfulness (Pramanick et al., 2021;Sharma et al., 2022b).
These studies indicate that off-the-shelf multimodal models, which perform well on several traditional visual-linguistic tasks, struggle when applied to memes (Kiela et al., 2020;Baltrušaitis et al., 2017;Sharma et al., 2022b). The primary reason behind this is the contextual dependency of memes for their accurate assimilation and analysis. Websites like knowyourmeme.com (KYM) facilitate important yet restricted information. MEMEX requires the model to learn the cross-modal analogies shared by the contextual evidence and the meme at various levels of information abstraction, towards detecting the crucial explanatory evidence 1 . The critical challenge is to represent the abstraction granularity aptly. Therefore, we formulate MEMEX as an "evidence detection" task, which can help deduce pieces of contextual evidence that help bridge the abstraction gap. However, besides including image and text modality, there is a critical need to inject contextual signals that compensate for the constraints due to the visual-linguistic grounding offered by conventional approaches.
Even with how effective and convenient memes are to design and disseminate over social media strategically, they are often hard to understand or are easily misinterpreted by the uninitiated, typically without the proper context. Thereby suggesting the importance of addressing a task like MEMEX. Governments or organizations involved in content moderation over social media platforms could use such a utility, underlining the convenience that such a context deduction solution would bring about in assimilating harmful memes and thereby adjudicating their social implications in emergencies like elections or a pandemic.
Motivated by this, we first curate MCC, a new dataset that captures various memes and related contextual documents. We also systematically experiment with various multimodal solutions to 1 A comparative analysis for KYM and MIME is presented in Appendix C. address MEMEX, which culminates into a novel framework named MIME (MultImodal Meme Explainer). Our model primarily addresses the challenges posed by the knowledge gap and multimodal abstraction and delivers optimal detection of contextual evidence for a given pair of memes and related contexts. In doing so, MIME surpasses several competitive and conventional baselines.
To summarize, we make the following main contributions 2 .: • A novel task, MEMEX, aimed to identify explanatory evidence for memes from their related contexts. • A novel dataset, MCC, containing 3400 memes and related context, along with gold-standard human annotated evidence sentence-subset. • A novel method, MIME that uses common senseenriched meme representation to identify evidence from the given context. • Empirical analysis establishing MIME's superiority over various unimodal and multimodal baselines, adapted for the MEMEX task.

Related Work
This section briefly discusses relevant studies on meme analysis that primarily attempt to capture a meme's affective aspects, such as hostility and emotions. Besides these, we also review other popular tasks to suitably position our work alongside different related research dimensions being explored.
Visual Question Answering (VQA): Early prominent work on VQA with a framework encouraging open-ended questions and candidate answers was done by Antol et al. (2015). Since then, there have been multiple variations observed. Antol et al. (2015) classified the answers by jointly representing images and questions. Others followed by examining cross-modal interactions via attention types not restricted to co/soft/hard-attention mechanisms (Lu et al., 2016;Anderson et al., 2018;Malinowski et al., 2018), effectively learning the explicit correlations between question tokens and localised image regions. Notably, there was a series of attempts toward incorporating common-sense reasoning (Zellers et al., 2019;Wu et al., , 2017Marino et al., 2019). Many of these studies also leveraged information from external knowledge bases for addressing VQA tasks. General models like UpDn (Anderson et al., 2018) and LXMERT (Tan and Bansal, 2019) explicitly leverage non-linear transformations and Transformers for the VQA task, while others like LMH (Clark et al., 2019) and SSL (Zhu et al., 2021) addressed the critical language priors constraining the VQA performances, albeit with marginal enhancements.
Cross-modal Association: Due to an increased influx of multimodal data, the cross-modal association has recently received much attention. For cross-modal retrieval and vision-language pretraining, accurate measurement of cross-modal similarity is imperative. Traditional techniques primarily used concatenation of modalities, followed by self-attention towards learning cross-modal alignments . Following the objectcentric approaches, Zeng et al. (2021) and  proposed a multi-grained alignment ap-proach, which captures the relation between visual concepts of multiple objects while simultaneously aligning them with text and additional meta-data. On the other hand, several methods also learned alignments between coarse-grained features of images and texts while disregarding object detection in their approaches (Huang et al., 2020;. Later approaches attempted diverse methodologies, including cross-modal semantic learning from visuals and contrastive loss formulations (Yuan et al., 2021;Jia et al., 2021;Radford et al., 2021). Despite a comprehensive coverage of crossmodal and meme-related applications in general, there are still several fine-grained aspects of memes like memetic contextualization that are yet to be studied. Here, we attempt to address one such novel task, MEMEX.

MCC: Meme Context Corpus
Due to the scarcity of publicly-available large-scale datasets that capture memes and their contextual information, we build a new dataset, MCC (Meme Context Corpus). The overall dataset curation was conducted in three stages: (i) meme collection, (ii) content document curation, and (iii) dataset annotation. These stages are detailed in the remaining section.

Meme Collection
In this work, we primarily focus on political and historical, English language memes. The reason for such a choice is the higher presence of online memes based on these topics. This is complemented by the availability of systematic and detailed information documented over well-curated digital archives. In addition to these categories, we also extend our search-space to some other themes pertaining to movies, geo-politics and entertainment as well. For scraping the meme images, we mainly leverage Google Images 3 and Reddit 4 , for their extensive search functionality and diverse multimedia presence.

Context Document Curation
We curate contextual corpus corresponding to the memes collected in the first step. This context typically constitutes pieces of evidence for the meme's background, towards which we consider Annotation Guidelines 1 Meme and the associated context should be understood before annotation. 2 Meme's semantics must steer the annotation. 3 Self-contained, minimal units of information can constitute evidence. 4 Valid evidence may or may not occur contiguously. 5 Cases not supported by a contextual document should be searched on other established sources. 6 Ambiguous cases should be skipped.  Wikipedia 5 (Wiki) as a primary source. We use a Python-based wrapper API 6 to obtain text from Wikipedia pages. For example, for Trump, we crawl his Wiki. page 7 . For the scenarios wherein sufficient details are not available on a page, we look for fine-grained Wiki topics or related non-Wiki news articles. For several other topics, we explore community-based discussion forums and question-answering websites like Quora 8 or other general-purpose websites.

Annotation Process
Towards curating MCC, we employed two annotators, one male and the other female (both Indian origin), aged between 24 to 35 yrs, who were duly paid for their services, as per Indian standards. Moreover, both were professional lexicographers and social media savvy, well versed in the urban social media vernacular. A set of prescribed guidelines for the annotation task, as shown in Table 2, were shared with the annotators. Once the annotators were sure that they understood the meme's background, they were asked to identify the sentences in the context document that succinctly provided the background for the meme. We call these sentences "evidence sentences" as they facilitate (sub-)optimal evidences that constitute likely background information. The annotation quality was assessed using Cohen's Kappa, after an initial dry-run and the final annotation. The first stage divulged a moderate agreement score of 0.55, followed by several rounds of discussions, leading to a substantial agreement score of 0.72.

Dataset Description
The topic-wise distribution of the memes reflects their corresponding availability on the web. Consequently, MCC proportionately includes History (38.59%), Entertainment (15.44%), Joe Biden (12.17%), Barack Obama (9.29%), Coronavirus (7.80%), Donald Trump (6.61%), Hillary Clinton (6.33%), US Elections (1.78%), Elon Musk (1.05%) and Brexit (0.95%). Since the contextual document-size corresponding to the memes was significantly large (on average, each document consists of 250 sentences), we ensured tractability within the experimental setup by limiting the scope of the meme's related context to a subset of the entire document. Upon analyzing the token distribution for the ground-truth pieces of evidence, we observe the maximum token length of 312 (c.f. Fig.  1b for the evidence token distribution). Therefore, we set the maximum context length threshold to 512 tokens. This leads to the consideration of an average of ≈ 128 tokens and a maximum of over 350 tokens (spanning 2-3 paragraphs) within contextual documents (c.f. Fig. 1a for the context token distribution). This corresponds to a maximum of 10 sentences per contextual document. We split the dataset into 80:10:10 ratio for train/validation/test sets, resulting in 3003 memes in the train set and 200 memes each in validation and test sets. Moreover, we ensure proportionate distributions among the train, val and test sets. Each sample in MCC consists of a meme image, the context document, OCR-extracted meme's text, and a set of ground truth evidence sentences. 9

Methodology
In this section, we describe our proposed model, MIME. It takes a meme (an image with overlaid text) and a related context as inputs and outputs a sequence of labels indicating whether the context's constituting evidence sentences, either in part or collectively, explain the given meme or not.  Figure 2: The architecture of our proposed model, MIME. We obtain external knowledge-enriched multimodal meme representation using Knowledge-enriched Meme Encoder (KME 1 ). We make use of a Meme-Aware Transformer (MAT 2 ) and a Meme-Aware LSTM layer (MA-LSTM 3 ) to incorporate the meme information while processing the context smoothly.
As depicted in Fig. 2, MIME consists of a text encoder to encode the context and a multimodal encoder to encode the meme (image and text). To address the complex abstraction requirements, we design a Knowledge-enriched Meme Encoder (KME) that augments the joint multimodal representation of the meme with external common-sense knowledge via a gating mechanism. On the other hand, we use a pre-trained BERT model to encode the sentences from the candidate context.
We then set up a Meme-Aware Transformer (MAT) to integrate meme-based information into the context representation for designing a multilayered contextual-enrichment pipeline. Next, we design a Meme-Aware LSTM (MA-LSTM) that sequentially processes the context representations conditioned upon the meme-based representation. Lastly, we concatenate the last hidden context representations from MA-LSTM and the meme representation and use this jointly-contextualized meme representation for evidence detection. Below, we describe each component of MIME in detail.
Context Representation: Given a related context, C consisting of sentences [c 1 , c 2 ...c n ], we encode each sentence in C individually using a pre-trained BERT encoder, and the pooled output corresponding to the [CLS] token is used as the context representation. Finally, we concatenate the individual sentence representation to get a unified context representation H c ∈ R n×d , with a total of n sentences.
Knowledge-enriched Meme Encoder: Since memes encapsulate the complex interplay of linguistic elements in a contextualized setting, it is necessary to facilitate a primary understanding of linguistic abstraction besides factual knowledge. In our scenario, the required contextual mapping is implicitly facilitated across the contents of the meme and context documents. Therefore, to supplement the feature integration with the required common sense knowledge, we employ ConceptNet (Speer et al., 2017): a semantic network designed to help machines comprehend the meanings and semantic relations of the words and specific facts people use. Using a pre-trained GCN, trained using Concept-Net, we aim to incorporate the semantic characteristics by extracting the averaged GCN-computed representations corresponding to the meme's text. In this way, the representations obtained are common sense-enriched and are further integrated with the rest of the proposed solution.
To incorporate external knowledge, we use Con-ceptNet (Speer et al., 2017) knowledge graph (KG) as a source of external commonsense knowledge. To take full advantage of the KG and at the same time to avoid the query computation cost, we use the last layer from a pre-trained graph convolutional network (GCN), trained over ConceptNet (Malaviya et al., 2020).
We first encode meme M by passing the meme image M i and the meme text M t 10 to an empiri-cally designated pre-trained MMBT model (Kiela et al., 2019), to obtain a multimodal representation of the meme H m ∈ R d . Next, to get the external knowledge representation, we obtain the GCN node representation corresponding to the words in the meme text M t . This is followed by average-pooling these embeddings to obtain the unified knowledge representation H k ∈ R d .
To learn a knowledge-enriched meme represen-tationĤ m , we design a Gated Multimodal Fusion (GMF) block. As part of this, we employ a meme gate (g m ) and the knowledge gate (g k ) to modulate and fuse the corresponding representations.
Here, W m and W k ∈ R 2d×d are trainable parameters.
Meme-Aware Transformer: A conventional Transformer encoder (Vaswani et al., 2017a) uses self-attention, which facilitates the learning of the inter-token contextual semantics. However, it does not consider any additional contextual information helpful in generating the query, key, and value representations. Inspired by the context-aware selfattention proposed by Yang et al. (2019), in which the authors proposed several ways to incorporate global, deep, and deep-global contexts while computing self-attention over embedded textual tokens, we propose a meme-aware multi-headed attention (MHA). This facilitates the integration of multimodal meme information while computing the selfattention over context representations. We call the resulting encoder a meme-aware Transformer (MAT) encoder, which is aimed at computing the cross-modal affinity for H c , conditioned upon the knowledge-enriched meme representationĤ m . Conventional self-attention uses query, key, and value vectors from the same modality. In contrast, as part of meme-aware MHA, we first generate the key and the value vectors conditioned upon the meme information and then use these vectors via conventional multi-headed attention-based aggregation. We elaborate on the process below.
Given the context representation H c , we first calculate the conventional query, key, and value vectors Q, K, V ∈ R n×d , respectively as given below: Here, n is the maximum sequence length, d is the embedding dimension, and W Q , W K , and W V ∈ R d×d are learnable parameters. We then generate new key and value vectorsK andV , respectively, which are conditioned on the meme representationĤ m ∈ R 1×d (broadcasted corresponding to the context size). We use a gating parameter λ ∈ R n×1 to regulate the memetic and contextual interaction. Here, U k and U v constitute learnable parameters.
We learn the parameters λ k and λ v using a sigmoid based gating mechanism instead of treating them as hyperparameters as follows: Here, W k 1 , W v 1 , W k 2 and W v 2 ∈ R d×1 are learnable parameters. Finally, we use the query vector Q againstK andV , conditioned on the meme information in a conventional scaled dot-product-based attention. This is extrapolated via multi-headed attention to materialize the Meme-Aware Transformer (MAT) encoder, which yields meme-aware context representations H c/m ∈ R n×d .
Meme-Aware LSTM: Prior studies have indicated that including a recurrent neural network such as an LSTM with a Transformer encoder like BERT is advantageous. Rather than directly using a standard LSTM in MIME, we aim to incorporate the meme information into sequential recurrencebased learning. Towards this objective, we introduce Meme-Aware LSTM (MA-LSTM) in MIME. MA-LSTM is a recurrent neural network inspired by (Xu et al., 2021) that can incorporate the meme representationĤ m while computing cells and hidden states. The gating mechanism in MA-LSTM allows it to assess how much information it needs to consider from the hidden states of the enriched context and meme representations, H c/m andĤ m , respectively. Fig. 2 shows the architecture of MA-LSTM. We elaborate on the working of the MA-LSTM cell below. It takes as input the previous cell states c t−1 , previous hidden representation h t−1 , current cell input H ct , and an additional meme representation H m . Besides the conventional steps involved for the computation of input, forget, output and gate values w.r.t the input H ct , the input and the gate values are also computed w.r.t the additional in-putĤ m . The final cell state and the hidden state outputs are obtained as follows: The hidden states from each time step are then concatenated to produce the unified context repre-sentationĤ c/m ∈ R n×d .
Prediction and Training Objective: Finally, we concatenateĤ m andĤ c/m to obtain a joint contextmeme representation, which we then pass through a feed-forward layer to obtain the final classification. The model outputs the likelihood of a sentence being valid evidence for a given meme. We use the cross-entropy loss to optimize our model.

Baseline Models
We experiment with various unimodal and multimodal encoders for systematically encoding memes and context representations to establish comparative baselines. The details are presented below. To obtain multimodal representations from memes using CLIP image and text encoders, whereas CLIP text encoder for context representation. • BAN (Kim et al., 2018): To obtain a joint representation using low-rank bilinear pooling while leveraging the dependencies among two groups of input channels. • VisualBERT : To obtain multimodal pooled representations for memes, using a Transformer-based visual-linguistic model.

Experimental Results
This section presents the results (averaged over five independent runs) on our thematically diversified test-set and performs a comparison, followed by qualitative and error analysis. For comparison, we use the following standard metrics -accuracy  (Acc.), macro averaged F1, precision (Prec.), recall (Rec.), and exact match (E-M) score 11 . To compute the scores corresponding to the partial match scenarios, we compute the precision/recall/F1 separately for each case before averaging across the test set. Additionally, as observed in (Beskow et al., 2020), we perform some basic image-editing operations like adjusting contrast, tint, temperature, shadowing and highlight, on meme images in MCC for (i) optimal OCR extraction of meme text, and (ii) noise-resistant feature learning from images 12 .
Meme-evidence Detection (MEMEX): As part of performance analysis, we observe from Table  3 that unimodal systems, in general, perform with mediocrity, with the Bert-based model yielding a relatively better F1 score of 0.7641, as compared to the worst score of 0.6985 by ViT-based model. It can be reasoned that textual cues would be significantly pivotal in modeling association when the target modality is also text-based. On the contrary, purely image-based conditioning would not be sufficient for deriving fine-grained correlations for accurately detecting correct evidence. Also, the lower precision, as against the higher recall scores, suggests the inherent noise being additionally modeled.
On the other hand, multimodal models either strongly compete or outperform unimodal ones, with CLIP being an exception. With an impressive F1 score of 0.7725, MMBT fares optimally compared to the other comparative multimodal baselines. This is followed by the early-fusion-based approach and VisualBERT, with 0.7721 and 0.7658 F1 scores, respectively. BAN (Bilinear Attention 11 Additional experimental details are available in Appendix A.
12 See Section 7 for details on Terms and Conditions for Data Usage.

MMBT
John Paul Jones was a Scottish-American naval captain who was the United States' first well-known naval commander in the American Revolutionary War. He made many friends among U.S political elites, as well as enemies (who accused him of piracy). His actions in British waters during the Revolution earned him an international reputation which persists to this day. MIME John Paul Jones was a Scottish-American naval captain who was the United States' first well-known naval commander in the American Revolutionary War. He made many friends among U.S political elites, as well as enemies (who accused him of piracy). His actions in British waters during the Revolution earned him an international reputation which persists to this day. Network) performs better than early-fusion and CLIP, but falls short by a 1-2% F1 score. Models like MMBT and VisualBERT leverage pre-trained unimodal encoders like BERT and ResNet and project a systematic joint-modeling scheme for multiple modalities. Although this has proven to be beneficial towards addressing tasks that leverage visual-linguistic grounding, especially when pretrained using large-scaled datasets like MSCOCO (VisualBERT), their limitations can be ascertained from Table 3, wherein MIME yields absolute improvements of 5.34%, 3.97%, 4.26%, 2.31% and 8.00% in accuracy, F1 score, precision, recall, and exact match scores, respectively, over the best baseline, MMBT. This suggests potential improvement that a systematic and optimal contextualizationbased approach like MIME can offer.
Analysing Detected Evidences: We analyze the detected evidence by contrasting MIME's prediction quality with MMBT's. The meme depicted in Table  4 does not explicitly convey much information and only mentions two entities, "John Paul Jones" and "The British Isles". The MMBT baseline predicts the first sentence as an explanation, which contains the word "John Paul Jones", whereas MIME correctly predicts the last sentence that explains the meme. Observing the plausible multimodal analogy that might have led MIME to detect the relevant evidence in this case correctly is interesting. In general, we observe that the evidence predicted by MMBT does not fully explain the meme, whereas those predicted by MIME are often more fitting.  tal assessment of these components, over MMBT as a base model, can be observed from Table 5. Adding external knowledge-based cues along with the MMBT representation via KME leads to an enhancement of 0.98%-2.91% and 5% across the first four metrics and the exact match, respectively. Similar enhancements are observed with MAT and MA-LSTM, with increments of 0.91-2.25% and 0.06-2.25%, respectively. Therefore, it can be reasonably inferred that KME, MAT, and MA-LSTM distinctly contribute towards establishing the efficacy of MIME.
On removing MA-LSTM, we notice a distinct performance drop ∈ [0.47, 2.50]% across all five metrics. Dropping MAT from MIME downgrades the performance by 1.67-5.38% for the first four metrics and by 7.5% for the exact match score.
Finally, we examine the influence via replacement by employing a standard Transformer-based encoder instead of MAT and a BiLSTM layer instead of MA-LSTM, in MIME. The former results in a drop of 1.45-3.28% across all five metrics. Whereas, the drop for the latter is observed to be 0.21%-2.00%. This suggests the utility of systematic memetic contextualization while addressing MEMEX.
Error Analysis: Here, we analyze different types of errors incurred by the model. As observed from the first example in Table 6, ground-truth evidence contain abstract concepts like power dynamics and morality, along with various novel facts, which induce non-triviality. On the other hand, the second example depicts a partial prediction, wherein the extra excerpt detected by the MIME is likely due to the inductive biases based on concepts of presidential race, Jimmy Carter and visual description of the peanut statue. Finally, the model just mapped

Related Context
Heart of Darkness (1899) is a novella by Polish-English novelist Joseph Conrad. It tells the story of Charles Marlow, a sailor who takes on an assignment from a Belgian trading company as a ferry-boat captain in the African interior. The novel is widely regarded as a critique of European colonial rule in Africa, whilst also examining the themes of power dynamics and morality. Although Conrad does not name the river where the narrative takes place, at the time of writing the Congo Free State, the location of the large and economically important Congo River, was a private colony of Belgium's King Leopold II.
The Jimmy Carter Peanut Statue is a monument located in Plains, Georgia, United States. Built in 1976, the roadside attraction depicts a large peanut with a toothy grin, and was built to support Jimmy Carter during the 1976 United States presidential election. The statue was commissioned by the Indiana Democratic Party during the 1976 United States presidential election as a form of support for Democratic candidate Jimmy Carter's campaign through that state. The statue, a 13-foot (4.0 m) peanut, references Carter's previous career as a peanut farmer.
On February 26, 1815, Napoleon managed to sneak past his guards and somehow escape from Elba, slip past interception by a British ship, and return to France. Immediately, people and troops began to rally to the returned Emperor. French police forces were sent to arrest him, but upon arriving in his presence, they kneeled before him. Triumphantly, Napoleon returned to Paris on March 20, 1815. Paris welcomed him with celebration, and Louis XVIII, the new king, fled to Belgium. With Louis only just gone, Napoleon moved back into the Tuileries. The period known as the Hundred Days had begun. its prediction based on the embedded meme text, e.g., #3, while partly oblivious to the meme's visuals. Overall, MIME obtains an exact match for 58.50% of the test-set cases. At the same time, it cannot predict any explanation for 12.5% cases. The model obtains partial matches for about 14% of the cases, and for the remaining 14%, the model makes wrong predictions. 13 Discussion: As part of this study, we examine MIME's efficacy over other variants when the constituting components are considered both incrementally and decrementally (c.f Table 5). Notably, we observe that adding external common sense knowledge-based signals, and attending over the meme while processing the context evidence sentences using MAT and MA-LSTM modules, distinctly increases the performance. These components are empirically observed and demonstrated to induce performance enhancement and establish their efficacy proving their respective hypotheses of augmenting the representation learning with common sense-based multimodal feature enrichment, self-attention-based multimodal Transformer encoding of the pieces of evidence, and finally, se-13 Further discussion is available in Appendix 7. quence modeling of the derived multimodal Transformer representations, modeling their temporal entailment embedded in their contextual arrangement.
To further delineate the scope of this study, it does not aim to deduce/derive every possible contextual evidence that can comprehensively contextualize a given meme; instead, it is to derive the evidence pieces, given closely related raw information (which can be conveniently obtained by directed query searches), that can help provide that necessary contextual impetus towards adjudicating various memetic phenomenon (like hate, offense, etc.). The fact that such a pipeline is not constrained by a particular topic, domain, and information source makes it reasonably scalable.

Conclusion
This work proposed a new task -MEMEX that aims to identify evidence from a given context to explain the meme. To support this task, we also curated MCC, a novel manually-annotated multimodal dataset encompassing a broad range of topics. After that, we benchmarked MCC on several competitive systems and proposed MIME, a novel modeling framework that utilizes knowledge-enriched meme representation and integrates it with context via a unique multi-layered fusion mechanism. The empirical examination and an extensive ablation study suggested the efficacy of MIME and its constituents. We then analyzed MIME's correct contextual mapping heuristics, juxtaposed with its limitations, suggesting the possible scope of improvement.

Limitations
Although our approach, MIME is empirically observed to outperform several other competitive baselines, we do observe some limitations in the modeling capacity towards MEMEX. As depicted in Table 6, there are three possible scenarios of ineffective detection -(a) no predictions, (b) partial match, and (c) incorrect predictions. The key challenges stem from the limitations in modeling the complex level of abstractions that a meme exhibits. These are primarily encountered in either of the following potential scenarios: • A critical, yet a cryptic piece of information within memes, comes from the visuals, which typically requires some systematic integration of factual knowledge, that currently lacks in MIME.
• Insufficient textual cues pose challenges for MIME, for learning the required contextual associativity. • Potentially spurious pieces of evidence being picked up due to the lexical biasing within the related context.

Ethics and Broader Impact
Reproducibility. We present detailed hyperparameter configurations in Appendix A and Table 7. The source code and MCC dataset are publicly shared at https://github.com/ LCS2-IIITD/MEMEX_Meme_Evidence.git.
Data Collection. The data collection protocol was duly approved by an ethics review board.
User Privacy. The information depicted/used does not include any personal information.
Terms and Conditions for data usage. We performed basic image editing (c.f. Section 6) on the meme images downloaded from the Internet and used for our research. This ensures non-usage of the artwork/content in its original form. Moreover, we already included details of the subreddits and keywords used to collect meme content and the sources used for obtaining contextual document information as part of Appendix B.1, Section 3.2 and Figure 4d. Since the our dataset (MCC) contains material collected from various web-based sources in the public domain, the copyright and privacy guidelines applied are as specified by these corresponding sources, a few of them as follows: • Wikipedia: Text of Creative Commons Attribution-ShareAlike 3.0. 14 • Quora: License and Permission to Use Your Content, Section 3(c). 15 • Reddit Privacy Policy: Personal information usage and protection. 16 • Reddit Content Policy. 17 Future adaptations or continuation of this work would be required to adhere to the policies prescribed herein.
Biases. Any biases found in the dataset are unintentional, and we do not intend to cause harm to any group or individual. We acknowledge that memes can be subjective, and thus it is inevitable that there would be biases in our gold-labeled data or the label distribution. This is addressed by working on a dataset created using a diverse set of topics and following a well-defined annotation scheme, which explicitly characterizes meme-evidence association.
Misuse Potential. The possibility of being able to deduce relevant contextual, fact-oriented evidence, might facilitate miscreants to modulate the expression of harm against a social entity, and convey the intended message within a meme in an implicit manner. This could be aimed at fooling the regulatory moderators, who could potentially be utilizing a solution like the one proposed to contextualize memes, as such intelligently designed memes might not derive suitable contextual evidence that easily. As a consequence, the miscreants could endup successfully hindering the overall moderation process. Additionally, our dataset can be potentially used for ill-intended purposes, such as biased targeting of individuals/communities/organizations, etc., that may or may not be related to demographics and other information within the text. Intervention via human moderation would be required to ensure this does not occur.
Intended Use. We curated MCC solely for research purposes, in-line with the associated usage policies prescribed by various sources/platforms. This applies in its entirety to its further usage as well. We will distribute the dataset for research purposes only, without a license for commercial use. We believe that it represents a valuable resource when used appropriately.
Environmental Impact. Finally, large-scale models require a lot of computations, which contribute to global warming (Strubell et al., 2019). However, in our case, we do not train such models from scratch; instead, we fine-tune them on a relatively small dataset.

A Implementation Details and Hyperparameter values
We train all the models using Pytorch on an NVIDIA Tesla V100 GPU with 32 GB dedicated memory, CUDA-11.2, and cuDNN-8.1.1 installed.
For the unimodal models, we import all the pretrained weights from the torchvision.models subpackage of the PyTorch framework. We randomly initialize the remaining weights using a zeromean Gaussian distribution with a standard deviation of 0.02. We primarily perform manual fine-tuning, over five independent runs, towards establishing an optimal configuration of the hyper-parameters involved. Finally, we train all models we experiment with using the Adam optimizer and a binary cross entropy loss as the objective function.

B.1 Meme Collection
We use carefully constructed search queries for every category to obtain relevant memes from the Google Images search engine 18 . Towards searching variants for the topics related Joe Biden, some search queries used were "Joe Biden Political Memes", "Joe Biden Sexual Allegation Memes", "Joe Biden Gaffe Memes", "Joe Biden Ukraine Memes", among others; for memes related to Hillary Clinton, we had "Hillary Clinton Email Memes", "Hillary Clinton Bill Clinton Memes", "Hillary Clinton US Election Memes", "Hillary Clinton President Memes", etc. For crawling and downloading these images, we use Selenium 19 , a Python framework for web browser automation.
Additionally, for certain categories, we also crawl memes off Reddit. Specifically, We focus on r/CoronavirusMemes, r/PoliticalHumor, 18 https://images.google.com/ 19 https://github.com/SeleniumHQ/selenium r/PresidentialRace subreddits. Instead of using the Python Reddit API Wrapper (PRAW), we use the Pushshift API 20 , which has no limit on the number of memes crawled. We crawl all memes for coronavirus from 1st November 2019 to 9th March 2021. For Biden, Trump, etc., we crawl memes from the other two subreddits and use a set of search queries, a subset of the overall queries we utilized. After scraping all possible memes, we perform de-duplication using dupeGuru 21 , a cross-platform GUI tool to find duplicate files in a specified directory. This eliminates intra-and inter-category overlaps. We then remove any meme which is either unimodal, i.e., memes having only images (c.f. Fig. 3 (c)), or text-only blocks (c.f. Fig. 3 (a)). Additionally, to ensure further tractability of our setup, we manually filter out code-mixed (c.f. Fig. 3 (b)) and code-switched memes and memes in languages other than English. Annotating multilingual memes can be a natural extension of our work. We further segregate memes that have cartoons/animations (c.f. Fig. 3 (d)). We also filter out memes with poor image quality, low resolution, etc.

B.2 Context Document Curation
There might be scenarios where:   page might not contain valid evidence about the information being conveyed within the meme. Since the primary objective of this study is to investigate and model multimodal contextualization for meme, we initially mine Wiki documents for topics like 'politics' or 'history,' for which memes are present online in abundance, thereby leveraging diversity and comprehensiveness facilitated by both the availability of memes and the exhaustive information on a corresponding valid Wiki page. In order to induce generalizability across the topics, types of memes, and context sources, we consider various topics (c.f. Appendix B.1) and associated memes and mine the relevant (standard) online information sources (c.f. Fig. 4d towards curating the corresponding context document by performing a Google search for the scenarios where a valid meme-Wiki combination did not hold.

B.3 Annotation Process
Two annotators annotated the dataset. One of the annotators was male, while the other was female, and their ages ranged from 24 to 35. Moreover, both of them were professional lexicographers, researchers and social media savvy. Before starting the annotation process, they were briefed on the task using detailed guidelines.
For performing annotations, we build an annotation platform using JQuery 22 and Flask 23 . A screenshot of the platform is given in Fig. 5. The status of the annotation is displayed at the top. It shows a "nan" since the image has not been saved yet; after saving, the status is updated to "updated". Below the status, the meme is displayed. There are three text boxes: the first interactive text box is for the OCR text (the annotators can correct and edit the text returned by the OCR pipeline). The other two text boxes are for the offsets and the selected text.

B.4 Analysis and description of MCC
It can be observed from Fig. 4d that the highest proportion is from Wikipedia-based sources, followed by smaller proportions for the alternatives explored like Quora, Britannica, Times of India, etc. Additionally, the word cloud depicted in Fig.  4c suggests that most memes are about prominent US politicians, history, and elections. Also, context length distribution, as depicted in Fig. 4a, suggests an almost normally distributed context length (in chars), with very few contexts having lengths lesser than ≈ 100 and more than ≈ 800 chars. Whereas, Fig. 4b depicts evidence length distribution, according to which most pieces of evidence contain fewer than 400 characters. This corroborates the brevity of the annotated pieces of evidence from diverse contexts.

B.5 Thematic Analysis from Meme Text
We perform thematic analysis of the memetic content, using just the text embedded within memes. We took the OCR extracted meme's text and project top-20 topics using BERTopic (Grootendorst, 2022), a neural topic modeling approach with a class-based TF-IDF procedure. We depict 0-based topic indexes and thematic keywords as 0-History, 1-Covid-19, 2-Politics, 3-War with Japan, etc., in Fig. 6. These topics are collectively referenced and described via the most likely keywords appearing for that particular topic. This depiction also highlights how generalizable our proposed approach is in optimally detecting accurate evidence from various topics within a given related context. Besides different high-level topics, MCC also captures the diversity of the sub-topics. Although, except for a few topics like Topics: 15 and 18, reasonably diverse memes can be found in MCC.

C Comparing contexts from KYM and MIME
Here, we compare the insights available on knowyourmeme.com (also referred to by KYM) and the ones generated by our proposed modeling framework MIME, about a particular meme. For