VL-BERT+: Detecting Protected Groups in Hateful Multimodal Memes

This paper describes our submission (winning solution for Task A) to the Shared Task on Hateful Meme Detection at WOAH 2021. We build our system on top of a state-of-the-art system for binary hateful meme classification that already uses image tags such as race, gender, and web entities. We add further metadata such as emotions and experiment with data augmentation techniques, as hateful instances are underrepresented in the data set.


Introduction
In this work, we present our submission to the Shared Task on Hateful Memes at WOAH 2021: Workshop on Online Abuse and Harms. 1 Detecting hateful memes that combine visual and textual elements is a relatively new task (Kiela et al., 2020). However, research can build on earlier work on the classification of hateful, abusive, or offending textual statements targeting individuals or groups based on gender, nationality, or sexual orientation (Basile et al., 2019;Burnap and Williams, 2014).

Shared Task Description
We only tackle Task A, which is predicting fine-grained labels for protected categories that are attacked in the memes, namely RACE, DISABILITY, RELIGION, NATION-ALITY, and SEX. The memes are provided in a multi-label setting. Table 1 shows the label distribution of the provided data set. 2 Our System Our system is built on top of the winning system (Zhu, 2020) of the Hateful Memes Challenge (Kiela et al., 2020), which was a binary * Equal contribution of the first two authors 1 https://www.workshopononlineabuse.com/cfp/ shared-task-on-hateful-memes 2 In the data set, memes are labeled as PC EMPTY if they are not hateful and none of the protected categories can be applied. In this paper, we use NONE instead of PC EMPTY for better intuition.

Labels
Train Dev %  hateful meme detection task. Zhu (2020) fine-tuned a visual-linguistic transformer-based pre-trained model called VL-BERT LARGE and showed that metadata information of meme images such as race, gender, and web entity tags (recommended textual tags for the image based on data collected from the web) improved the performance of the hateful meme classification system. We replicate this system for a more fine-grained categorization of hateful memes, as proposed by the current shared task. Considering the data scarcity in this novel task, we also propose several data augmentation strategies and examine the effects on our classification problem. The evaluation metric used by the shared task is the (micro-averaged) area under the receiver operating characteristic curve AUROC.
In addition, we consider emotion tags which are extracted from facial expressions available in the (a) (b) (c) Figure 1: Image pre-processing: Recovering the original image of the meme (a) Original meme image (b) Easy-OCR masking (c) Image inpainting meme images. Based on experimental results and the shared task leaderboard scores, the inclusion of emotion tags along with VL-BERT LARGE model equipped with race, gender, and web entity tags exhibits the best performance for Task A. We make our source code publicly available. 3

Related Work
Multi-modal hateful meme detection is the task of identifying hate in the combination of textual and visual information.
Textual Information In most previous works, hate speech detection has been performed solely in textual form. Despite many challenges (Vidgen et al., 2019), there have been several automatic detection systems developed to filter hateful statements (Waseem et al., 2017;Benikova et al., 2017;Wiegand et al., 2018;Kumar et al., 2018;Nobata et al., 2016;Aggarwal et al., 2019). One stateof-the-art model is BERT (Devlin et al., 2019). BERT is a contextualized transformer (Vaswani et al., 2017) based on a pre-trained language model which can be further fine-tuned for downstream applications such as hate speech classification.
Visual Information For hateful meme classification, the Facebook challenge team 4 proposed a unimodal training where a ResNet (He et al., 2015) encoder is used for image feature extraction. Apart from this, there has been a plenitude of work on extracting information from images, which is potentially useful for hateful meme detection. Image processing systems such as Faster R-CNN or Inception V3 models (Ren et al., 2016;Szegedy et al., 2015) are useful for detecting available objects in images. Smith (2007) and EasyOCR 5 can optically recognize the text embedded in an image. (2021) extracted visual-linguistic relationships by introducing cross-attention networks between textual transformers and transformers trained on visual features. Such networks deliver promising results on a variety of visual-linguistic tasks such as Image Captioning, Visual Question Reasoning (VQR), and Visual Commonsense Reasoning (VCR). Zhu (2020) and Lippe et al. (2020) exploited these networks for the binary classification of memes as hateful or non-hateful. The incorporation of additional metadata information as race, gender, and web entity tags, which are extracted from meme images, increased performance significantly in hateful meme classification (Zhu, 2020). Hitherto, meme classification, having been introduced only recently, has been a binary task. Except for the VisualBERT  based baseline 6 provided by the WOAH 2021 Shared Task, to our knowledge, there has been no work on detecting protected groups in hateful memes.

System Description
In this paper, we exploit the analysis proposed by Zhu (2020) for the fine-grained categorization of hateful memes.

Pre-processing
Both the visual and the textual parts of the memes are pre-processed. The data provided by the shared task consist of memes with their corresponding meme text. In this paper, we follow the steps proposed by Zhu (2020) to pre-process the provided input memes.
Text Pre-processing For text pre-processing, a BERT-based tokenizer (Devlin et al., 2019) is applied. This is also an integral part of the VL-BERT LARGE system (Su et al., 2020) (see Section 3.3).
Image Pre-processing The image part of the memes poses several challenges. First, meme images may consist of multiple sub-images, so-called patches. In this case, we segregate these patches using an image processing toolkit (Chen et al., 2019). Second, the text embedded in the images may add noise to the image features. Therefore, we aim to recover the original meme image before the text was added. To do so, we first apply EasyOCRbased Optical Character Recognition, which results in an image with black masked regions corresponding to the meme text as shown in Figure 1b. Then, inpainting, a process where damaged, deteriorating, or missing parts are filled in to present a complete image, is applied to these regions using the MMediting Tool (Contributors, 2020) (see Figure 1c).

Metadata
Understanding memes often requires implicit knowledge (e.g. cultural prejudice, clichés, histor-6 https://github.com/facebookresearch/mmf/tree/master/projects/ hateful memes/fine grained ical knowledge) that human readers must have to understand the content. Such knowledge might be a big help for the classifier if explicitly provided. Zhu (2020) used meme image metadata, such as race, gender, and web entity tags to enhance binary classification performance on hateful memes. We utilized the same metadata and, in addition to that, emotion tags for the fine-grained categorization into protected groups.

Race and Gender
We apply the pre-trained Fair-Face (Karkkainen and Joo, 2021) model to the provided meme images to extract the bounding boxes of detected faces with their corresponding race and gender metadata.
Web Entities Web entities are webrecommended textual tags associated with an image. They add contextual information to the images, making it easier for the model to establish the relationship between the meme text and image. We use Google's Web Entity Detection service 7 to extract these web entities.
Emotion Emotions are promising features for hate speech detection (Martins et al., 2018). Awal et al. (2021) investigated the positive impact of emotions in textual hate speech detection where emotion features are shared using a multi-task learning network. We exploit this in our system by extracting emotions based on facial expressions available in the meme image together with their corresponding bounding boxes. For this purpose, we use the Python-based emotion detection API 8 which classifies a face into the seven universal emotions described by Ekman (1992)-ANGER, FEAR, DISGUST, HAPPINESS, SADNESS, SURPRISE, and CONTEMPT. (Su et al., 2020) demonstrates state-of-the-art performance on binary hateful meme classification Zhu (2020). Therefore, we investigate it for the detection of protected groups in hateful memes. VL-BERT LARGE is a transformer (Vaswani et al., 2017) back-boned visual-linguistic model pre-trained on the Conceptual Captions data set (Sharma et al., 2018) and some other text corpora (Zhu et al., 2015). It provides generic representations for visual-linguistic downstream tasks. To do that, we use Google's Inception V2 Object Detection model. 9 We extract features from both modalities (image and text) in the provided data set to fine-tune the pre-trained VL-BERT LARGE representation. Afterward, these features are used to train a multi-layer feedforward network (also called a downstream network) to generate the final classifier. We train the model for a maximum of 10 epochs with the other default hyperparameters provided by Su et al. (2020).

Data Augmentation
Data scarcity often leads to model overfitting. As shown in the training set distribution in Table 1, non-hateful memes comprise the majority of the data set. The non-uniform distribution of labels makes this data set quite small for model training. Therefore, we artificially augment the samples labeled with the protected groups. For image augmentation, we use the image augmentation toolkit by Jung et al. (2020) which alters images by adding effects like blur, noise, hue/saturation changes, etc. Additionally, we use Google's Web Entity Detection service to obtain visually similar images. For text augmentation, we generate semantically related statements using nlpaug (Ma, 2019). Furthermore, since we have original and augmented versions of images and texts, we combine them in three different ways: i) the original image with augmented text, ii) augmented image with the original text, and iii) augmented image with augmented text (see Figure 2). 9 https://tfhub.dev/google/faster rcnn/openimages v4/ inception resnet v2/1

Ensemble
The predictions of a single system may not be generalized enough to be used on unseen data due to high variance, bias, etc. However, relying on multiple systems can overcome these technical challenges. Therefore, we choose our best three systems based on their AUROC scores. We apply the majority voting scheme on the prediction labels provided by each system. The label with the highest number of votes will be selected as the final prediction for the ensemble system. In cases when all systems disagree, we choose the label with the highest prediction probability. Table 2 shows the results for Task A on the provided development data set. We also compare our results with the VisualBERT  based baseline as provided by the shared task organizers. Among the different configurations of our system, VL-BERT LARGE model with race, gender, emotion, and web entity tags (called +W,RG,E in the table) achieves the best AUROC score. We find that the inclusion of emotion tags has a positive effect on the overall performance when compared to other systems. To analyze the statistical significance among the approaches, we apply the Bowker test (Bowker, 1948) on the contingency matrices created on the number of agreements and disagreements between the systems. To compensate for the chance significance, we apply the Bonferroni correction (Abdi, 2007) on p value. We find that approaches marked with * are statistically significant compared to the best-performing solution.

Results and Discussion
When the model is trained on the train set along with augmented data, hardly any significant performance improvement is encountered. This is  contrary to our expectations. We analyze the approaches with image and text augmentation (IT| +W and IT| +W,RG) (statistically significant from the best-performing system) and found a notable increase in False Negative errors, especially for RELIGION. During post-experiment analysis, we find that the predictions for DISABILITY and RELIGION labels are better compared to others when the model is at a low False Positive rate. However, NATION-ALITY performs relatively well at a high False Positive rate (see Figure 3). From the confusion matrices (Table 3), we find that the number of False Negatives is dominant in all classes. We believe that class imbalance is responsible for this behavior. To verify this, we train models on the undersampled training data set and found significant improvement on labels with low sample size. However, we also find a huge performance drop on the NONE label.
For the final submission, we generate predictions on the test set using our two best-performing models based on their AUROC score -VL-BERT LARGE +W,RG,E (winning solution) and +W,RG (2 nd rank) (see Table 2 for Shared Task leaderboard scores).

Summary
In this paper, we presented our approach to identify and categorize attacked protected groups in hateful memes. We performed experiments using a visual-linguistic pre-trained model called VL-BERT LARGE along with metadata information extracted from the meme image and text. Results show that the inclusion of metadata helps to improve system performance. However, the final system still lacks a robust understanding of hateful memes targeting protected groups.