Findings of the WOAH 5 Shared Task on Fine Grained Hateful Memes Detection

We present the results and main findings of the shared task at WOAH 5 on hateful memes detection. The task include two subtasks relating to distinct challenges in the fine-grained detection of hateful memes: (1) the protected category attacked by the meme and (2) the attack type. 3 teams submitted system description papers. This shared task builds on the hateful memes detection task created by Facebook AI Research in 2020.


Introduction
The spread and impact of online hate is a growing concern across societies, and increasingly there is consensus that social media companies must do more to counter such content (League, 2020;Vidgen et al., 2021). At the same time, any interventions must be balanced with protecting people's freedom of expression and ability to engage in open discussions. Ensuring that online spaces are both open and safe requires being able to reliably and accurately find, rate and remove harmful content such as hate. Scalable machine learning based solutions offer a powerful way of solving this problem, reducing the burden on human moderators.
To date, detecting online hate has proven remarkably difficult and concerns have been raised about the performance, robustness, generalizability and fairness of even state-of-the-art models (Waseem et al., 2018;Vidgen et al., 2019;Caselli et al., 2020b;Mishra et al., 2019;Davidson et al., 2019). To advance the field, and develop models which can be used in real-world settings, research needs to go beyond simple binary classifications of textual content. To this end, we have used trained professional moderators to reannotate the hateful memes dataset from (Kiela et al., 2020) 1 . It contains two sets of labels, which correspond to our two sub-tasks: the protected category that has been attacked (e.g., women, black people, immigrants) as well as the type of attack (e.g., inciting violence, dehumanizing, mocking the group).
Detecting hateful memes is a particularly challenging task because the content is multi-modal rather than uni-modal, such as text or images alone. When humans look at memes they do not think about the words and photos independently but, instead, combine the two together. In contrast, most AI detection systems analyze text and image separately and do not learn a joint representation. This is inefficient and limits the performance of systems. They are likely to fail when an image that by itself is non-hateful is combined with nonhateful text to produce content that expresses hate through the interaction of the image and text. For AI to detect hate communicated through multiple modalities, it must learn to understand content the way that people do: holistically. In this paper we present the results of the WOAH 5 shared task on fine-grained hateful memes detection.

Dataset Size
The dataset we present for the shared task is from phase 1 of the hateful memes challenge Kiela et al. (2020) Table 1 shows the distribution and data splits associated with the released dataset. We reannotated the hateful memes for the two finegrained categories (Protected category and Attack type). For the non-hateful memes we assigned a label of 'none' for both categories.

Dataset Labels
Each meme was originally labelled as 'Hateful' or 'Not Hateful' by Kiela et al. (2020). Hate is a contested concept and there is no generally agreed upon definition or taxonomy in the field (Caselli et al., 2020a;Waseem et al., 2017;Zampieri et al., 2019). For the purposes of this work, hate is defined as a direct attack against people based on 'protected characteristics' 3 . Protected characteristics are core aspects of a person's social identity which are generally fixed or immutable. Table 2 provides the set of fine-grained labels for protected classes and attack types.

Annotations
Each hateful meme was annotated by three annotators for the protected characteristic and the attack type (from the set defined in Table 2). If no clear protected group or attack type could be identified the annotator could select "not sure". Annotators were allowed to select multiple labels for both the protected characteristic and attack type.
Since our annotation is multi-label, we computed Krippendorff's α, which supports multiple annotators as well as multi-label agreement computation (Krippendorff, 2018). We obtain Krippendorff's α = 0.77 for the protected categories, and α = 0.66 for attack types, indicating that while there is some uncertainty, it is within usable range i.e α ≥ 0.66 (Krippendorff, 2004). This indicates 'moderate' to 'strong' agreement (Mchugh, 2012) and compares favourably with other abusive content datasets (Gomez et al., 2020;Fortuna and Nunes, 2018;Wulczyn et al., 2017), especially given that our labels contain five and seven levels respectively. We used a majority voting scheme to decide the final labels from the annotations.  (Pedregosa et al., 2011). 5 . We used the same splits from the original dataset as described in Table 1. Participants had access to the train, dev seen and dev unseen splits for developing and tuning their models. The final evaluation was done on the test seen split. The ground truth labels were not provided at time of submission and each participant was expected to submit their predictions with model scores. Each participant was limited to a maximum of 2 submissions per task.

System Descriptions
Majority Baseline A simple majority decisionrule, applied over the entire dataset. We predict the majority class for all instances, i.e. "pc empty" for Task A and "attack empty" for Task B.
VisualBERT Baseline A VisualBERT multimodal model ) that has been pre-trained on the MS COCO 4 Note that the characterisation and definition of some protected categories, such as race, is highly contested. For further analysis of the concept of 'race' see Omi and Winant (2005) 5 The evaluation script and fine-grained labels are available at https://github.com/ facebookresearch/fine_grained_hateful_ memes

Religion
A group defined by a shared belief system Race A group defined by similar, distinct racialised physical characteristics Sex A group defined by their physical sexual attributes or sexual identifications Nationality A group defined by the country/region they belong to Disability A group defined by conditions that generally lead to permanent dependencies (on people, medical treatments or equipment)

Attack Type Definition
Dehumanizing Explicitly or implicitly describing or presenting a group as subhuman

Inferiority
Claiming that a group is inferior, less worthy or less important than either society in general or another group Inciting violence Explicitly or implicitly calling for harm to be inflicted on a group, including physical attacks Mocking Making jokes about, undermining, belittling, or disparaging a group Contempt Expressing intensely negative feelings or emotions about a group Slurs Using prejudicial terms to refer to, describe or characterise a group Exclusion Advocating, planning or justifying the exclusion or segregation of a group from all of society or certain parts    (Lin et al., 2014). We use the setup in MMF (Singh et al., 2020) to pre-train the models. Each task is trained and evaluated independently. 6 VisualBERT was also used in the original hateful memes paper by Kiela et al. (2020), although here we set it up for multilabel detection.
Duisburg-Essen System 1 (LTL-UDE1) The solution builds on the multimodal approach used for the winning entry in the hateful memes challenge (Zhu, 2020) -a VLBERT multimodal model with image specific metadata. It was fine-tuned on the fine-grained data. The system was only submitted for Task A.
Duisburg-Essen System 2 (LTL-UDE2) An additional emotion tags are added to DE1 which are extracted from the facial expressions of persons objects available in the meme image.
The system was only submitted for Task A.

Queen Mary University London (QMUL)
The submitted system is a multimodal model that uses CLIP (Radford et al., 2021) image encoder to embed the meme images, and CLIP text encoder, LASER (Artetxe and Schwenk, 2019) & LaBSE (Feng et al., 2020) to embed the meme text. All the representations are concatenated, and a multi-label logistic regression classifier is trained, one for each task, to predict the labels.
Stockholm University System 1 (SU1) A BERT-base based model that only uses the text of the meme as input. The BERT model was fine-tuned independently for each task.
Stockholm University System 2 (SU2) A multimodal model (ImgBERT) which combines SU1 with image embeddings. The image embeddings were extracted using DenseNet-121 convolutional neural networks(CNNs), pretrained on ImageNet (Deng et al., 2009). The input to the multi-label classification layer is the concatenation of the text representation from the [CLS] token of SU1, and the image embedding. The final classifier is an ensemble between the ImgBERT model and the 6 See https://github.com/ facebookresearch/mmf/tree/master/ projects/hateful_memes/fine_grained for training configuration text-only model from SU1. The scores provided by each of the labels were averaged to decide the final label. Table 4 shows the performance on the 2 tasks across all the participants. All the systems used some variant of pre-trained multimodal representations fine-tuned on the shared task datasets. None of the submissions exploited the correlation across all the tasks, and instead trained the systems independently on each of the tasks. The systems from LTL-DE1 and LTL-DE2 were the only ones to exploit image level metadata as an additional signal that was not part of the provided training data that showed best performance on Task A. Moreover, the LTL-DE1 and LTL-DE2 submissions were the only ones to leverage state of the art multimodal representations from VLBERT (Su et al., 2019), while all other submissions encoded the image and text channel independently. Interestingly, SU1, which is a text BERT system finetuned on the tasks performed remarkably strongly, even outperforming their multimodal system and the provided baselines. It is unclear if the model is picking up some unintended biases in the data, considering the relatively small size of the datasets provided for the shared task. QMUL system encoded the text representation using multiple different pre-trained representations concatenated with the image representation, further supporting the evidence that potentially stronger encoding of text might be sufficient to achieve strong performance on this dataset.