LT3 at SemEval-2021 Task 6: Using Multi-Modal Compact Bilinear Pooling to Combine Visual and Textual Understanding in Memes

Internet memes have become ubiquitous in social media networks today. Due to their popularity, they are also a widely used mode of expression to spread disinformation online. As memes consist of a mixture of text and image, they require a multi-modal approach for automatic analysis. In this paper, we describe our contribution to the SemEval-2021 Detection of Persuasian Techniques in Texts and Images Task. We propose a Multi-Modal learning system, which incorporates “memebeddings”, viz. joint text and vision features by combining them with compact bilinear pooling, to automatically identify rhetorical and psychological disinformation techniques. The experimental results show that the proposed system constantly outperforms the competition’s baseline, and achieves the 2nd best Macro F1-score and 14th best Micro F1-score out of all participants.


Introduction
Propaganda is a mode of communication by which the interested party pursues the aim of influencing public opinion in favour of a specific agenda or ideas. This is achieved by disseminating onesided, biased or even fake news. With the advent of social media networks, propagandist text can reach an enormous audience. Given the overload of online text produced on a daily basis, it is not feasible to monitor this manually and researchers have started to investigate automatic methods to detect propaganda in text.
In the SemEval-2020 Task 11 on Detection of Propaganda Techniques in News Articles (Da San Martino et al., 2020), participants were asked to identify 14 different propagandist techniques in news articles. The task attracted a large interest, with 44 teams participating in the task. The best systems all used pre-trained transformers and ensemble techniques. We further refer to Da San Martino et al. (2020b) for a comprehensive review of computational propaganda detection techniques.
More recently, internet memes have emerged as a very popular mode of expression on social networks. While memes initially seemed to be used by users of specific online communities, they have gained popularity very rapidly and are today used by a very large and varied user base. Memes can be used for very different purposes: they can be used as a form of visual rhetoric (Huntington, 2013), for online bullying or trolling (Leaver, 2013), but they can also function as a kind of persuasive device, while the intended message is wrapped in humour (Shifman, 2013). As a result, they form an interesting object of study for automatically detecting propaganda techniques.
The goal of the SemEval-2021 shared task on the detection of persuasion techniques (Dimitrov et al., 2021) is to build models for identifying rhetorical and psychological techniques that are used to influence social media users in online disinformation campaigns. This paper reports on our participation in Subtask 3, which is a multi-modal task conceived as a multi-label classification problem: given a meme, the system has to identify which of the 22 techniques are used both in the textual and visual content of the meme. To solve subtask 3, we propose a multi-modal multi-task learning system, which incorporates "memebeddings", viz. joint text and vision features combined by means of compact bilinear pooling, to automatically identify rhetorical and psychological disinformation techniques in memes. around identifying 22 rhetorical and psychological disinformation techniques in internet memes. These techniques cover a wide array of phenomenons like Causal Oversimplification, Exaggeration/Minimisation, Name Calling/labeling or Presenting irrelevant data (red herring). For a full list of all categories, we refer to the task description paper (Dimitrov et al., 2021).
While some techniques in certain contexts may be accurately found just by processing the textual modality, it is very difficult to consistently identify all of the techniques without complete visual and textual context. Figure 1 shows examples of the two different cases, where in the first image, it is fairly obvious that the text contains all necessary information to predict the propaganda techniques accurately, whereas in the second meme, the textual modality is not sufficient to provide all information required to correctly predict the label. To tackle the task at hand, our approach incorporates information from both domain-related text and visual pre-training, and finally combines the two modalities using Multi-modal Compact Bilinear (MCB) Pooling (Fukui et al., 2016).

Proposed Models
Our multi-modal system is composed of three submodules: 1. The visual pre-processor: the visual module uses a ResNet-51 architecture (He et al., 2016) which is pre-trained to identify subreddits (E.g. /r/motivation, /r/pets /r/politics) from around 6200 Reddit memes. 3. Integration Network: the two sets of embeddings from the first two modules are combined with MCB pooling.

Visual Embeddings
We decided to use the Resnet-51 architecture for the Visual pre-processor. This model was trained to predict one of the 18 sub-reddits the memes were scraped from. We hypothesized that the learned embeddings are able to distinguish certain elements of the meme, since the model is forced to encode the sub-reddit it comes from, and the sub-reddits represent the genre (E.g. /r/politics, /r/sports) or the emotion associated with the meme (E.g. /r/motivation, /r/dankmemes).

Textual Embeddings
For the text pre-processor we used a pre-trained bert-large-uncased model from the HuggingFace transformers package 1 . We fine-tuned the model with additional linear layers for the multi-label task of predicting propaganda techniques in the PTC Corpus. BERT based fine-tuned models are often used for a lot of text classification tasks and obtain state-of-the-art performances in a large number of NLP tasks like GLUE (Wang et al., 2018) and SQuAD (Rajpurkar et al., 2016).

Combined Embeddings
We train the final model with the combined embeddings from the visual and text pre-processor, for the final task of multi-label prediction of the 22 propaganda techniques. While the visual preprocessor and textual pre-processor become excellent feature generators individually, combining the embeddings from two modalities with different dimensions (768d for text and 1024d for images) becomes very complicated. While a dot product will be simple and efficient to compute, it will only encode a linear mapping of features, i.e first order interactions where every visual feature only interacts with just one textual feature and not multiple features. In addition, a dot product cannot be computed with two vectors having different dimensions. A cross product on the other hand, computes the relation of every feature from one modality to every feature from the second modality. While this is closer to the representation we need, the cross product of two vectors, with 768 and 1024 dimensions respectively, will be 786,432 dimensional. To use this large a vector, the classification model would need billions of parameters, making it almost impossible to train such a model in practice.
MCB Pooling combines the computation efficiency of the dot product with the higher order representation of the cross product. It was first implemented for the task of Visual Question Answering (Antol et al., 2015), which also focuses on jointly encoding textual and visual content. It centers around constructing count sketch projections of the vectors by means of the Fast Fourier Transform (FFT) to reduce them to lower-dimensional vectors without loosing a lot of information. Once the vectors are projected to lower-dimensions, computing the cross-product becomes feasible again. Pham et al. (2013) demonstrated, though, that the count sketch of the outer product of two count sketch projections is the same as the convolution of the two count sketch projections, as shown in Equation 1: where x and q are the embeddings from the textual and visual modality respectively, Ψ(x, h, s) represents the count sketch projection of a vector x, and * represents the convolution operator. Figure 2 summarizes the proposed system architecture.

Experimental Setup
The ResNet-51 model for the visual pre-processor was trained with Stochastic Gradient Descent and penalized with the cross-entropy loss. For the initial state we used the model pre-trained on Ima-geNet, and fine-tuned by replacing the classification layer.
The BERT model for the text embeddings was fine-tuned by freezing the pre-trained model and adding a linear layer as well as a classification layer. The model was trained with the AdamW optimizer, with a rate decay of 0.01, and penalized with crossentropy as well.
The final MCB model uses embeddings from both models and combines them into a single vector of 8000 dimensions with MCB, then passes them through two linear layers of sizes 2048 and 1024 respectively, followed by a classification layer for the multi-label output for the 22 techniques. This final model is optimized with Adam and penalized with cross-entropy as well. We used the train set released by the task organizers for training, and the development set as a validation set for optimizing hyper-parameters. Table 1 summarizes the key results of our multimodal approach. While combining the visual and textual embeddings with a simple weighted averaging consistently beats the task baselines by a significant margin, using MCB Pooling results in a considerable performance increase over weighted averaging, both for Macro and Micro F1-scores. In addition, when analyzing samples that are misclassified by the weighted averaging approach, but correctly predicted by MCB Pooling, we noticed around 40 percent of the examples required combining information from both the visual and textual modalities like Figure 1(b). While the MCB model is able to better pool the understanding of both modalities, it still fails when a complex concept or inference is involved. Figure 3a represents a common meme format frequently found in the memes we were able to obtain from Reddit. Consequently, we expect the model has sufficiently learnt the corresponding visual features to come to a correct prediction. Figure 3b, however, requires some complex visual ideas the model has to infer in combination with the text, which it fails to do frequently. We believe that jointly training visual and textual embeddings (making the learning more coherent and not disjointed between the two modalities), instead of simply attempting to combine independent vi-   sual and textual information, would solve this issue. However, joint training can get computationally expensive and would require a much larger dataset.

Conclusion and Future Work
This paper presents the multi-modal approach we proposed for automatically detecting persuasion techniques in memes. As memes combine text and images to obtain the desired effect, we built a system where visual and textual embeddings are combined to classify 22 different propaganda techniques. The experimental results show that combining textual and visual embeddings by weighted averaging already beats the baseline. These results, however, are considerably improved by combining both embedding sets by means of MCB Pooling.
In future work, we will investigate how we can incorporate additional semantic information in our model. A first step could consist of integrating more explicit argumentation information into our model. As these propaganda techniques use psychological and rhetorical techniques, we believe it might be interesting to include argumentation structures such as logical fallacies, where the reasoning is flawed, and by consequence the conclusion cannot be drawn from the premise(s) in the text. To this end, we will build on recent work on automatic fallacy detection (Habernal et al., 2017). In addition, we also aim to include automatic emotion detection features, as writers of propagandist text often use emotional language to convince their readers. Finally, we will investigate an approach to jointly train visual and textual embeddings, rather then combining separate embedding sets, as our error analysis showed that efficient analysis of memes often requires a combined approach.