1213Li at SemEval-2021 Task 6: Detection of Propaganda with Multi-modal Attention and Pre-trained Models

This paper presents the solution proposed by the 1213Li team for subtask 3 in SemEval-2021 Task 6: identifying the multiple persuasion techniques used in the multi-modal content of the meme. We explored various approaches in feature extraction and the detection of persuasion labels. Our final model employs pre-trained models including RoBERTa and ResNet-50 as a feature extractor for texts and images, respectively, and adopts a label embedding layer with multi-modal attention mechanism to measure the similarity of labels with the multi-modal information and fuse features for label prediction. Our proposed method outperforms the provided baseline method and achieves 3rd out of 16 participants with 0.54860/0.22830 for Micro/Macro F1 scores.


Introduction
The development of the Internet and Information Technology promotes the generation and dissemination of information, but also fuels the proliferation of disinformation. As one of the most popular types of content in disinformation, memes attract readers easily and brought further challenges to the detection of disinformation (Martino et al., 2020;Dimitrov et al., 2021).
Specifically, memes employ a number of techniques to influence users, which can be divided into the use of logical fallacies and appealing to the emotions of the audience (Dimitrov et al., 2021). In practice, the former misuses logical rules to disguise wrong conclusions as correct and objective, Figure 1: Examples of multi-modal samples, we rewrite the sentences on our own and collect the images from ‡ , § , ¶ , respectively. The first two rows illustrate the visual and the textual content, and in the last row, each line reveals the label (techniques) of the sample.
while the latter utilizes emotional language to induce the audience to agree with the speaker emotionally and prevent their rational analysis of the argumentation.
Identifying the techniques used in memes contributes to the understanding of user-generated content and further helps to the detection of disinformation. The subtask 3 of SemEval-2021 Task 6 (Dimitrov et al., 2021) is organized to stimulate the study of computational methods to detect persuasion techniques in memes that inhere in texts and images.
As shown in Figure 1, each sample consists of a set of textual sentences and an attached image. According to the task description (Dimitrov et al., 2021), the image and the sentence could convey the modality-specific persuasion techniques, respectively, and at the same time, images can be combined with sentence to express some techniques, which we named global techniques. Based on the understanding of the task, we attribute the main challenges of subtask 3 to the following three as- pects: 1) extracting essential features from each modality to predict modality-specific labels, 2) fusing multi-modal features to understand the content fully for predicting global labels, and 3) capturing the connections among multi labels.
Correspondingly, we present the methods to handle these challenges. Specifically, our method employs the powerful feature extractor including the pre-trained RoBERTa (Liu et al., 2019) and ResNet-50 (He et al., 2016) to extract textual and visual features, respectively. Besides, inspired by the work Augenstein et al. (2019), our method adopts a label embedding layer to learn how semantically close the labels are to one another implicitly, and the embedding layer maps each label to a learnable fixed-size vector. Before the label prediction, multi-modal features are fused according to their relevance with each label useing attention mechanism (Bahdanau et al., 2015), and the final prediction is based on the fused features.

Related Work
Subtask 3 is a multi-label classification task based on multi-modal data. As for the multi-label classification tasks, it was earlier handled by many machine learning methods. Zhang et al. (Zhang and Zhou, 2005) used a k-nearest neighbor-based algorithm to conduct experiments on real-world multilabel bioinformatic data. Vens et al. (Vens et al., 2008) proposed a hierarchical multi-label classification method based on Decision trees. With the rapid development of deep learning, multi-label classification methods based on deep neural networks have become mainstream. Wang et al. (Wang et al., 2016) introduced and multi-label image classification network with the fusion of CNN and RNN.
Chernyavskiy et al. (Chernyavskiy et al., 2020) used a RoBERTa-based network combined with additional CRF (Lafferty et al., 2001) layers and transfer learning mechanism (Pan and Yang, 2010) to address a multi-label classification task in SemEval-2020. However, these previous multi-label classification tasks were often based on single modality data. These approaches fall short when the task requires the use of multiple modal data.
Moreover, in the field of multi-modal tasks, we focus on task of multi-modal fake news detection. Recent work (Jin et al., 2017;Wang et al., 2018;Khattar et al., 2019) mainly concern the fusion of multi-modal features and adopt a binary classifier, which is not applicable to current multi-label classification scenarios.

Task Formulation
The task of identifying the techniques used in memes is defined as a multi-label classification problem of given multi-modal sample. We refer the textual sentences as S and the attached image as I, and use M to denote the multi-modal model which map inputs S and I into a set of N binary values that represent the corresponding label. The task is formulated as follows: In the Equation 1, φ denotes the multi-modal feature extractor for textual and visual content, respectively, and F denotes the fusion of the multimodal features. The length of predicted results is the same as the number of labels and 1 indicates the corresponding label is predicted.

Method
In this section, we demonstrate the method used by our team for subtask 3. As shown in Figure 2, our method consists of three main layers: Extraction Layer, Fusion Layer, and the final Classifier. In the rest of this subsection, we describe each layer in detail.

Extraction Layer
In the Extraction Layer, the pre-trained RoBERTa is used to extract textual features. Specifically, given that RoBERTa receives at most two sentences as input while some samples may contain multiple pieces of sentences, we splice all sentences into a single sentence and retain the character "\n\n" as the separator. As for the outputs of RoBERTa, we merely reserve the representation of each token as sequential features T for the post-processing.
For the image input, we use the ResNet-50 pretrained on ImageNet to extract visual features. Before the image is input to the ResNet-50 network, it needs to be normalized and cropped into 3*224*224. Afterward, we select the last convolution layer's feature maps with size 2048*7*7 as visual features and transform it into a sequential features V with size of 49 * 2048.

Fusion Layer
The Fusion Layer aims to select the features for the label prediction. As mentioned earlier, the labels implied in the memes include both modalityspecific labels and global labels. To promote the prediction of modality-specific labels, we perform the average-pooling on both textual and visual features to extract the modality-specific features T avg and V avg ("avg" in Figure 2).
Meanwhile, to promote global labels' prediction, we adopt the attention mechanism to fuse multi-modal features. Particularly, As depicted in Equation 4-6, we first calculate the similarity between ith label embeddings and textual features. We then weighted-sum the textual features according to the similarity scores and obtain label-related representation T i,att ("att" in Figure 2). The similar operation is applied to the Visual and produce V i,att .
Finally, we concatenate the features obtained above as the final representation of the input and pass it into the Classifier.

Classifier
We adopt a three-layers fully connected network as the classifier, which maps the final representation R i obtained ahead into a scalar. Then we employ a sigmoid function to squeeze the scalar to the interval of 0-1. Notably, for each label, the process mentioned above is required and performed synchronously. Hence our model finally outputs a vector whose length is consistent with the number of labels.

Dataset
The dataset was provided by SemEval2021 Task6 subtask3, and the training set, development set, and test set contain 687, 63, and 200 samples, respectively. Each sample is combined with an image-text pair, id, and labels.

Evaluation Measures
The official evaluation measure for this technique classification is Micro-F1. The Macro-F1 is also reported, and we will consider both the performance of Micro-F1 and Macro-F1 during the experiment.

Parameter Settings
To train the model, we adopt the binary crossentropy loss as the objective function and employ the Adam method(Kingma and Ba, 2015) with a learning rate of 0.0001 to optimize it. We set the minibatch size at 64 and the dimensions of label embeddings at 256. Based on experimental verification, we fixed the parameters of ResNet-50 while fine-tuning the parameters of RoBERTa during the training. Our methods are implemented with Py-Torch and run on a single Nvidia 1080ti graphic card.

Ensemble
We use an ensemble of 5 models with different development set to predict the training set. Among the five ensembled models, one model uses the original training set and development set, and the remaining four models use the 64 samples randomly divided from the combined data of the training set and development set as the new development set, and use the rest as the training set.

Results and Discussion
The result of the ablation study is shown in Table 1.
As we can see, the baseline method is very ineffective since it utilizes only the average-pooling features of visual and textual information, indicating that the lack of the interaction between modalityspecific features and label information hinder the model to select vital features for prediction and leads to poor performance. So we introduce the attention mechanism to selectively choose valid information from visual features and textual features, respectively. As shown in the second group of Table 1, the use of the attention mechanism significantly improves the model's performance, especially the Micro F1 score.
Finally, the model that uses both visual features and textual features in combination with the attention mechanism has the optimal performance. During the test stage, we chose the model that performed best on the development set and got the final result through the ensemble. The final evaluation results are reported in Table 2.

Conclusion
This paper demonstrates the method that we proposed for subtask 3 in SemEval-2021 Task 6, which aims to identify which of 22 persuasion techniques are used in the textual and visual content of the specific meme. Our method uses RoBERTa and ResNet-50 to extract multi-modal features, introduces the attention mechanism to fuse multi-modal features, and adopts the label embeddings to learn the representation of labels. Our proposed model achieves noticeable improvements over the baseline method, and the official evaluation ranked our submission 3rd out of 16 teams.