Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

Hateful memes pose a unique challenge for current machine learning systems because their message is derived from both text- and visual-modalities. To this effect, Facebook released the Hateful Memes Challenge, a dataset of memes with pre-extracted text captions, but it is unclear whether these synthetic examples generalize to ‘memes in the wild’. In this paper, we collect hateful and non-hateful memes from Pinterest to evaluate out-of-sample performance on models pre-trained on the Facebook dataset. We find that ‘memes in the wild’ differ in two key aspects: 1) Captions must be extracted via OCR, injecting noise and diminishing performance of multimodal models, and 2) Memes are more diverse than ‘traditional memes’, including screenshots of conversations or text on a plain background. This paper thus serves as a reality-check for the current benchmark of hateful meme detection and its applicability for detecting real world hate.


Introduction
Hate speech is becoming increasingly difficult to monitor due to an increase in volume and diversification of type (MacAvaney et al., 2019). To facilitate the development of multimodal hate detection algorithms, Facebook introduced the Hateful Memes Challenge, a dataset synthetically constructed by pairing text and images (Kiela et al., 2020). Crucially, a meme's hatefulness is determined by the combined meaning of image and text. The question of likeness between synthetically created content and naturally occurring memes is both an ethical and technical one: Any features of this benchmark dataset which are not representative of reality will result in models potentially overfitting to 'clean' memes and generalizing poorly to memes in the wild. Thus, we ask the question: How well do Facebook's synthetic examples (FB) represent memes found in the real world? We use Pinterest memes (Pin) as our example of memes in the wild and explore differences across three aspects: 1. OCR. While FB memes have their text preextracted, memes in the wild do not. Therefore, we test the performance of several Optical Character Recognition (OCR) algorithms on Pin and FB memes.
2. Text content. To compare text modality content, we examine the most frequent n-grams and train a classifier to predict a meme's dataset membership based on its text.
3. Image content and style. To compare image modality, we evaluate meme types (traditional memes, text, screenshots) and attributes contained within memes (number of faces and estimated demographic characteristics).
After characterizing these differences, we evaluate a number of unimodal and multimodal hate classifiers pre-trained on FB memes to assess how well they generalize to memes in the wild.

Background
The majority of hate speech research focuses on text, mostly from Twitter (Waseem and Hovy, 2016;Davidson et al., 2017;Founta et al., 2018;Zampieri et al., 2019). Text-based studies face challenges such as distinguishing hate speech from offensive speech (Davidson et al., 2017) and counter speech (Mathew et al., 2018), as well as avoiding racial bias (Sap et al., 2019). Some studies focus on multimodal forms of hate, such as sexist advertisements (Gasparini et al., 2018), YouTube videos (Poria et al., 2016), and memes (Suryawanshi et al., 2020;Zhou and Chen, 2020;Das et al., 2020). While the Hateful Memes Challenge (Kiela et al., 2020) encouraged innovative research on multimodal hate, many of the solutions may not generalize to detecting hateful memes at large. For example, the winning team Zhong (2020) exploits a simple statistical bias resulting from the dataset generation process. While the original dataset has since been re-annotated with fine-grained labels regarding the target and type of hate (Nie et al., 2021), this paper focuses on the binary distinction of hate and non-hate.

Pinterest Data Collection Process
Pinterest is a social media site which groups images into collections based on similar themes. The search function returns images based on userdefined descriptions and tags. Therefore, we collect memes from Pinterest 1 using keyword search terms as noisy labels for whether the returned images are likely hateful or non-hateful (see Appendix A). For hate, we sample based on two heuristics: synonyms of hatefulness or specific hate directed towards protected groups (e.g., 'offensive memes', 'sexist memes') and slurs associated with these types of hate (e.g., 'sl*t memes', 'wh*ore memes'). For non-hate, we again draw on two heuristics: positive sentiment words (e.g., 'funny', 'wholesome', 'cute') and memes relating to entities excluded from the definition of hate speech because they are not a protected category (e.g., 'food', 'maths'). Memes are collected between March 13 and April 1, 2021. We drop duplicate memes, leaving 2,840 images, of which 37% belong to the hateful category.

Extracting Text-and Image-Modalities (OCR)
We evaluate the following OCR algorithms on the Pin and FB datasets: Tesseract (Smith, 2007), EasyOCR (Jaded AI) and East (Zhou et al., 2017). Previous research has shown the importance of prefiltering images before applying OCR algorithms (Bieniecki et al., 2007). Therefore, we consider two prefiltering methods fine-tuned to the specific characteristics of each dataset (see Appendix B).

Unimodal Text Differences
After OCR text extraction, we retain words with a probability of correct identification ≥ 0.5, and remove stopwords. A text-based classification task using a unigram Naïve-Bayes model is employed to discriminate between hateful and non-hateful memes of both Pin and FB datasets.

Unimodal Image Differences
To investigate the distribution of types of memes, we train a linear classifier on image features from the penultimate layer of CLIP (see Appendix C) (Radford et al., 2021). From the 100 manually examined Pin memes, we find three broad categories: 1) traditional memes; 2) memes consisting of just text; and 3) screenshots. Examples of each are shown in Appendix C. Further, to detect (potentially several) human faces contained within memes and their relationship with hatefulness, we use a pre-trained FaceNet model (Schroff et al., 2015) to locate faces and apply a pre-trained DEX model (Rothe et al., 2015) to estimate their ages, genders, races. We compare the distributions of these features between the hateful/non-hateful samples. We note that these models are controversial and may suffer from algorithmic bias due to differential accuracy rates for detecting various subgroups. Alvi et al. (2018) show DEX contains erroneous age information, and Terhorst et al. (2021) show that FaceNet has lower recognition rates for female faces compared to male faces. These are larger issues discussed within the computer vision community (Buolamwini and Gebru, 2018).

Comparison Across Baseline Models
To examine the consequences of differences between the FB and Pin datasets, we conduct a preliminary classification of memes into hate and non-hate using benchmark models. First, we take a subsample of the Pin dataset to match Facebook's dev dataset, which contains 540 memes, of which 37% are hateful. We compare performance across three samples: (1) FB memes with 'ground truth' text and labels; (2) FB memes with Tesseract OCR text and ground truth labels; and (3) Pin memes with Tesseract OCR text and noisy labels. Next, we select several baseline models pretrained on FB memes 2 , provided in the original Hateful Memes challenge (Kiela et al., 2020). Of the 11 pretrained baseline models, we evaluate the performance of five that do not require further preprocessing: Concat Bert, Late Fusion, MMBT-Grid, Unimodal Image, and Unimodal Text. We note that these models are not fine-tuned on Pin memes but simply evaluate their transfer performance. Finally, we make zero-shot predictions using CLIP (Radford et al., 2021), and evaluate a linear model of visual features trained on the FB dataset (see Appendix D).

OCR Performance
Each of the three OCR engines is paired with one of the two prefiltering methods tuned specifically to each dataset, forming a total of six pairs for evaluation. For both datasets, the methods are tested on 100 random images with manually annotated text. For each method, we compute the average cosine similarity of the joint TF-IDF vectors between the labelled and cleaned 3 predicted text, shown in Tab. 1. Tesseract with FB tuning performs best on the FB dataset, while Easy with Pin tuning performs best on the Pin dataset. We evaluate transferability by comparing how a given pair performs on both datasets. OCR transferability is generally low, but greater from the FB dataset to the Pin dataset, despite the latter being more general than the former. This may be explained by the fact that the dominant form of Pin memes (i.e. text on a uniform background outside of the image) is not present in the FB dataset, so any method specifically optimized for Pin memes would perform poorly on FB memes.

Unimodal Text Differences
We compare unigrams and bigrams across datasets after removing stop words, numbers, and URLs. The bigrams are topically different (refer to Appendix E). A unigram token-based Naïve-Bayes classifier is trained on both datasets separately to distinguish between hateful and non-hateful classes. The model achieves an accuracy score of 60.7% on FB memes and 68.2% on Pin memes (random guessing is 50%), indicating mildly different text distributions between hate and non-hate. In order to understand the differences between the type of language used in the two datasets, a classifier is trained to discriminate between FB and Pin memes (regardless of whether they are hateful) based on the extracted tokens. The accuracy is 77.4% on a balanced test set. The high classification performance might be explained by the OCR-generated junk text in the Pin memes which can be observed in a t-SNE plot (see Appendix F).

Unimodal Image Differences
While the FB dataset contains only "traditional memes" 4 , we find this definition of 'a meme' to be too narrow: the Pin memes are more diverse, containing 15% memes with only text and 7% memes which are screenshots (see Tab. 2). Tab. 3 shows the facial recognition results. We find that Pin memes contain fewer faces than FB memes, while other demographic factors broadly match. The DEX model identifies similar age distributions by hate and non-hate and by dataset, with an average of 30 and a gender distribution heavily skewed towards male faces (see Appendix G for additional demographics).  Surprisingly, we find that the CLIP Linear Probe generalizes very well, performing best for all three samples, with superior performance on Pin memes as compared to FB memes. Because CLIP has been pre-trained on around 400M imagetext pairs from the Internet, its learned features generalize better to the Pin dataset, even though it was fine-tuned on the FB dataset. Of the multimodal models, Late Fusion performs the best on all three samples. When comparing the performance of Late Fusion on the FB and Pin OCR samples, we find a significant drop in model performance of 12 percentage points. The unimodal text model performs significantly better on FB with the ground truth annotations as compared to either sample with OCR extracted text. This may be explained by the 'clean' captions which do not generalize to realworld meme instances without pre-extracted text.

Discussion
The key difference in text modalities derives from the efficacy of the OCR extraction, where messier captions result in performance losses in Text BERT classification. This forms a critique of the way in which the Hateful Memes Challenge is constructed, in which researchers are incentivized to rely on the pre-extracted text rather than using OCR; thus, the reported performance overestimates success in the real world. Further, the Challenge defines a meme as 'a traditional meme' but we question whether this definition is too narrow to encompass the diversity of real memes found in the wild, such as screenshots of text conversations.
When comparing the performance of unimodal and multimodal models, we find multimodal mod-els have superior classification capabilities which may be because the combination of multiple modes create meaning beyond the text and image alone (Kruk et al., 2019). For all three multimodal models (Concat BERT, Late Fusion, and MMBT-Grid), the score for FB memes with ground truth captions is higher than that of FB memes with OCR extracted text, which in turn is higher than that of Pin memes. Finally, we note that CLIP's performance, for zero-shot and linear probing, surpasses the other models and is stable across both datasets.
Limitations Despite presenting a preliminary investigation of the generalizability of the FB dataset to memes in the wild, this paper has several limitations. Firstly, the errors introduced by OCR text extraction resulted in 'messy' captions for Pin memes. This may explain why Pin memes could be distinguished from FB memes by a Naïve-Bayes classifier using text alone. However, these errors demonstrate our key conclusion that the preextracted captions of FB memes are not representative of the appropriate pipelines which are required for real world hateful meme detection.
Secondly, our Pin dataset relies on noisy labels of hate/non-hate based on keyword searches, but this chosen heuristic may not catch subtler forms of hate. Further, user-defined labels introduce normative value judgements of whether something is 'offensive' versus 'funny', and such judgements may differ from how Facebook's community standards define hate (Facebook, 2021). In future work, we aim to annotate the Pin dataset with multiple manual annotators for greater comparability to the FB dataset. These ground-truth annotations will allow us to pre-train models on Pin memes and also assess transferability to FB memes.

Conclusion
We conduct a reality check of the Hateful Memes Challenge. Our results indicate that there are differences between the synthetic Facebook memes and 'in-the-wild' Pinterest memes, both with regards to text and image modalities. Training and testing unimodal text models on Facebook's pre-extracted captions discounts the potential errors introduced by OCR extraction, which is required for real world hateful meme detection. We hope to repeat this work once we have annotations for the Pinterest dataset and to expand the analysis from comparing between the binary categories of hate versus non-hate to include a comparison across different types and targets of hate.

A Details on Pinterest Data Collection
Tab. 5 shows the keywords we use to search for memes on Pinterest. The search function returns images based on user-defined tags and descriptions aligning with the search term (Pinterest, 2021). Each keyword search returns several hundred images on the first few pages of results. Note that Pinterest bans searches for 'racist' memes or slurs associated with racial hatred so these could not be collected. We prefer this method of 'noisy' labelling over classifying the memes with existing hate speech classifiers with the text as input because users likely take the multimodal content of the meme into account when adding tags or writing descriptions. However, we recognize that user-defined labelling comes with its own limitations of introducing noise into the dataset from idiosyncratic interpretation of tags. We also recognize that the memes we collect from Pinterest do not represent all Pinterest memes, nor do they represent all memes generally on the Internet. Rather, they reflect a sample of instances. Further, we over-sample non-hateful memes as compared to hateful memes because this distribution is one that is reflected in the real world. For example, the FB dev set is composed of 37% hateful memes. Lastly, while we manually confirm that the noisy labels of 50 hateful and 50 non-hateful memes (see Tab. 6), we also recognize that not all of the images accurately match the associated noisy label, especially for hateful memes which must match the definition of hate speech as directed towards a protected category. Table 5: Keywords used to produce noisily-labelled samples of hateful and non-hateful memes from Pinterest.

Noisy Label Keywords
Hate "sexist", "offensive", "vulgar", "wh*re", "sl*t", "prostitute" Non-Hate "funny", "wholesome", "happy", "friendship", "cute", "phd", "student", "food", "exercise"  East (Zhou et al., 2017) is an efficient deep learning algorithm for text detection in natural scenes. In this paper East is used to isolate regions of interest in the image in combination with Tesseract for text recognition. Figure 4 shows the dominant text patterns in FB (a) and Pin (b) datasets, respectively. We use a specific prefiltering adapted to each pattern as follows.

B.2 OCR Pre-filtering
FB Tuning: FB memes always have a black-edged white Impact font. The most efficient prefiltering sequence consists of applying an RGB-to-Gray conversion, followed by binary thresholding, closing, and inversion. Pin Tuning: Pin memes are less structured than FB memes, but a commonly observed meme type is text placed outside of the image on a uniform background. For this pattern, the most efficient prefiltering sequence consists of an RGB-to-Gray conversion followed by Otsu's thresholding.
The optimal thresholds used to classify pixels in binary and Otsu's thresholding operations are found so as to maximise the average cosine similarity of the joint TF-IDF vectors between the labelled and predicted text from a sample of 30 annotated images from both datasets.

C.1 Data Preparation
To prepare the data needed for training the ternary (i.e., traditional memes, memes purely consisting of text, and screenshots) classifier, we annotate the Pin dataset with manual annotations to create a balanced set of 400 images. We split the set randomly, so that 70% is used as the training data and the rest 30% as the validation data. Figure 2 shows the main types of memes encountered. The FB dataset only has traditional meme types.

C.2 Training Process
We use image features taken from the penultimate layer of CLIP. We train a neural network with two hidden layers of 64 and 12 neurons respectively with ReLU activations, using Adam optimizer, for 50 epochs. The model achieves 93.3% accuracy on the validation set.

D Classification Using CLIP D.1 Zero-shot Classification
To perform zero-shot classification using CLIP (Radford et al., 2021), for every meme we use two prompts, "a meme" and "a hatespeech meme". We measure the similarity score between the image and text embeddings and use the corresponding text prompt as a label. Note we regard this method as neither multimodal nor uni-modal, as the text is not explicitly given to the model, but as shown in (Radford et al., 2021), CLIP has some OCR capabilities. In a future work we would like to explore how to modify the text prompts to improve performance.

D.2 Linear Probing
We train a binary linear classifier on the image features of CLIP on the FB train set. We train the classifier following the procedure outlined by (Radford et al., 2021). Finally, we evaluate the binary classifier of the FB dev set and the Pin dataset.
In all experiments above we use the pretrained ViT-B/32 model.

F T-SNE Text Embeddings
The meme-level embeddings are calculated by (i) extracting a 300-dimensional embedding for each word in the meme, using fastText embeddings trained on Wikipedia and Common Crawl; (ii) averaging all the embeddings along each dimension. A T-SNE transformation is then applied to the full dataset, reducing it to two-dimensional space. After this reduction, 1000 text-embeddings from each category-FB and Pin -are extracted and visualized. The default perplexity parameter of 50 is used. Fig.3 presents the t-SNE plot (Van Der Maaten and Hinton, 2008), which indicates a concentration of multiple embeddings of the Pin memes within a region at the bottom of the figure. These memes represent those that have nonsensical word tokens from OCR errors.  To evaluate memes with multiple faces, we develop a self-adaptive algorithm to separate faces. For each meme, we enumerate the position of a cutting line (either horizontal or vertical) with fixed granularity, and run facial detection models on both parts separately. If both parts have a high probability of containing faces, we decide that each part has at least one face. Hence, we cut the meme along the line, and run this algorithm iteratively on both parts. If no enumerated cutting line satisfies the condition above, then we decide there's only one face in the meme and terminate the algorithm.