MinD at SemEval-2021 Task 6: Propaganda Detection using Transfer Learning and Multimodal Fusion

We describe our systems of subtask1 and subtask3 for SemEval-2021 Task 6 on Detection of Persuasion Techniques in Texts and Images. The purpose of subtask1 is to identify propaganda techniques given textual content, and the goal of subtask3 is to detect them given both textual and visual content. For subtask1, we investigate transfer learning based on pre-trained language models (PLMs) such as BERT, RoBERTa to solve data sparsity problems. For subtask3, we extract heterogeneous visual representations (i.e., face features, OCR features, and multimodal representations) and explore various multimodal fusion strategies to combine the textual and visual representations. The official evaluation shows our ensemble model ranks 1st for subtask1 and 2nd for subtask3.


Introduction
With the recent interest in "fake news", the detection of propaganda or highly biased texts has emerged as an active research area (Da San Martino et al., 2020Martino et al., , 2019Chernyavskiy et al., 2020).
SemEval-2021 Task 6 (Dimitrov et al., 2021) provides three subtasks aiming to detect persuasion techniques in texts and images. We participate in subtask1 and subtask3, which are defined as follows: • subtask1: Given only the "textual content" of a meme, identify which of the 20 techniques are used in it. This is a multilabel classification problem.
• subtask3: Given a meme, identify which of the 22 techniques are used both in the textual and visual content of the meme (multimodal task).
For subtask1, we focus on using transfer learning to tackle problems related to the scarcity of data since deep learning models require a whole lot of data while it is difficult to obtain vast amount of the labeled data. Especially, we first fine-tune the pre-trained language models on an external dataset from SemEval-2020 Task 11 (Da San Martino et al., 2020) and then continue to fine-tune them on the training dataset of SemEval-2021 Task 6. The probabilities of these tuned models are averaged to make the final prediction.
For subtask3, we concentrate on multimodal fusion to combine textual and visual representation. Heterogeneous visual representations are extracted, including face, OCR and multimodal representations. Face representation consists of recognized human faces and facial expressions. OCR representation can capture the relations among snippets in an image. Multimodal pre-trained model is capable of simultaneously processing multimodality inputs for joint visual and textual understanding. After that, we explore three multimodal fusion strategies (i.e., Average, Concat and MLP) to combine the textual and visual representations.
The experimental results show that transfer learning can leverage knowledge from source data to tackle problems related to the scarcity of data, and heterogeneous visual representation (i.e., face, OCR, and multimodal representation) can be used as complementary features to better detect persuasion techniques. Our ensemble model ranks 1st for subtask1 and 2nd for subtask3.

System Overview
In this section, we provide a general overview of our systems for the two subtasks. We consider the propaganda detection task as multimodal multiclass multi-label classification task, predicting one or more labels given an input text and an input image.

Model
Various pre-trained models are explored to extract textual and visual features, and these textual and visual features are fused to predict labels.
Textual Representation In this paper, five pretrained language models (PLMs) are used. Representations of the special token [CLS] are passed to the classification layer. We briefly describe each PLM: • BERT (Devlin et al., 2019) is a powerful transformer-based PLM and enables bidirectional training using a "masked language model" (MLM) pre-training objective. The masked language model randomly masks some input tokens and aims to predict the masked tokens. BERT also use next sentence prediction (NSP) objective during pretraining, which is a binary classification loss for predicting whether two segments follow each other in the original text. With tailored finetune objectives, BERT can improve performance on downstream tasks such as classification tasks.
• RoBERTa (Liu et al., 2019) proposes an improved recipe for training BERT models and boosts the performance on GLUE (Wang et al., 2019), RACE (Lai et al., 2017) and SQuAD (Rajpurkar et al., 2016). It shares the same model architecture with BERT, and mainly improves BERT by dynamic masking and a larger byte-level Byte-Pair Encoding (BPE) (Sennrich et al., 2016).
• XLNet  integrates the segment recurrence mechanism and relative encoding scheme of Transformer-XL  into pretraining with reparameterizing. It can capture the dependency between the masked positions and alleviate a pretrainfinetune discrepancy.
• DeBERTa (He et al., 2020) disentangles attention mechanism and encodes each word with two vectors representing content and position, respectively. An enhanced mask decoder is also used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. These methods enables DeBERTa to obtain competitive performance of both natural language understand (NLU) and natural language generation (NLG) downstream tasks.
• ALBERT (Lan et al., 2019) replaces the next sentence prediction (NSP) loss with a sentence order prediction (SOP) loss to better model inter-sentence coherence. Besides, it equips two parameter reduction techniques to lower memory consumption and increase the training speed of BERT. With fewer parameters compared to BERT-large, ALBERT establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks.
Visual Representation Three visual representations are adopted, including face representation, OCR representation and single-stream multimodal representation.
• Face Representation It is important to recognize faces and facial expressions for propaganda detection. We use a state-of-the-art face recognition model, which is a ResNet-34 network (He et al., 2016) with 29 conv layers. In an image, each face is encoded as a 128 dimensional vector using the published toolkit 1 and adopt mean pooling for the final face representation.
• OCR Representation For text in an image, the 2-D position of the text can capture the font size and the relationship among tokens within the image. Therefore, we use a 2-D position embedding to jointly model interactions between text and layout information across the image. We extract the bounding box 2-D position using the Microsoft OCR 2 .
• Multimodal Representation Recent studies on vision-language pre-training have pushed the limits of a variety of Vision-and-Language (V+L) tasks, and both the image and text content can help understand the semantics of the meme for propaganda detection. Therefore, we also extract a region-based image features with Faster R-CNN (Ren et al., 2015) to represent the image. Then, we follow (Li et al., 2021) and use a pre-trained multi-modality model SemVLP to better learn the multimodal fusion between the image and text.
Multimodal Fusion For multimodal propaganda detection, we employ 3 fusing methods to combine the textual and visual features.
• Average The predicted probabilities of text and image features are averaged for prediction: where h t and h v stand for textual and visual representations, respectively.
• Concat The text and image features is concatenated to predict probabilities: • MLP Before making prediction, we map text and image features to the same semantic space:

Training
Multilabel Classifier We provide an additional label-wise feed-forward network(FFN) and a linear layer to extract label. At training time, we propose to minimize the binary cross-entropy (BCE) objective L as follows: where y c is the ground truth of class c andŷ c is the predicted value. At test time, we predict the label asỹ c = I(ŷ c > T ) where T is a probability threshold and I is the indicator function.
As for label imbalance problem, focal loss (FL) (Lin et al., 2017), which down-weights easy examples and focus training on hard negatives, is adopted during training.
Transfer Learning It is difficult to get vast amounts of labeled data for supervised models. Transfer Learning enables us to utilize knowledge from previously learned tasks and apply them to newer, related ones. We use transfer learning from the news articles domain: we first train the model using the news data, and then we continue training for this task. In preliminary experiments, we find that fine-tuning layers in the process is better than freezing them as feature extractors.

Dataset
We conduct experiments with the train, the dev and the test datasets provided by SemEval-2021 Task 6 (Dimitrov et al., 2021), which contains 687, 63 and 200 memes for subtask1 and subtask3, respectively.

External Resources
We use the annotations of the PTC corpus (more than 20,000 sentences) from SemEval-2020 task 11 (Da San Martino et al., 2020) as external resource. Although its domain is news articles and fewer techniques are considered, the annotations are made using the same guidelines as SemEval-2021 task 6.

Evaluation Measures
Subtask1 and subtask3 are multi-label classification tasks. The official evaluation measure for both tasks is micro-F1. We also report macro-F1.

Parameter Settings
We adopt the large models and select hyperparameters using validation on a subsample of the training data. The cased models are used because that upper cases contain strong emotion signals in this task. We use adamW optimizer (Loshchilov and Hutter, 2019) with 500 warm-up steps and train for 10 epochs with a 2e-5 learning rate and a 8 batch size. The last checkpoint is used for evaluation.

Submitted Systems
Post-processing Repetition means repeating the same message over and over again so that the audience will eventually accept it. Therefore, we assign a Repetition label in case if there exists a bigram appears more than 3 times. Ensemble We use model ensemble for final submission. In particular, for subtask1 we explore 5 pre-trained models (using BCE Loss, Focal Loss and Transfer Learning, respectively), and for sub-task3 we additionally explore face, OCR, multimodal representations and the fusion strategies. We take the probabilities of these settings and average them to make the final prediction. Table 1 and Table 2 list the results of the topperforming teams for subtask1 and subtask3. We can see that our proposed model is ranked 1st for subtask1 and 2nd for subtask3 among all teams.

Discussion
More thorough studies and analyses are conducted in this section, trying to answer two questions: (1) How is the performance of transfer learning on less data? (2) How is the performance of multimodal fusion on multimodal data? Moreover, we give  error analyses on the test dataset to provide an overview of problematic labels.

Transfer Learning
We perform ablation study for each PLM (row) and each learning method (column) in Table 3 for subtask1. It shows that: First, RoBERTa and DeBERTa were generally the best performing models. Given that RoBERTa and DeBERTa are carefully tuned models base on BERT, this result is reasonable.
Second, both Focal Loss and Transfer Learning help to alleviate data sparsity problems. Focal Loss help DeBERTa and ALBERT improve 0.7 and 0.6 points. Because Focal Loss assigns higher weights to sparse samples and reduces the weights to frequent samples. Transfer Learning helps BERT, RoBERTa, XLNet improve 1.0, 4.0, 5.7 points, respectively. RoBERTa with Transfer Learning achieves the best single model score. Transfer Learning help transfer the parameters trained on related data or task to the newer model. Instead of learning from scratch, the newer model can leverage knowledge to tackle problems related to the scarcity of data.

Multimodal Fusion
For subtask3, we compare different multimodal representations in Table 4 and fusion strategies in Table 5. We find that: (1) both OCR Representation and Multimodal Representation models outperform the Text Representation model. OCR Representation can additionally capture the relative space relationship instead of sequential information among texts in an image.   (2) Multimodal Representation model achieves the best single model performance since it jointly aligns the semantics between image and text and thus is effective for the vision-language understanding task.
(3) Table 5 lists the results of different fusion strategy. We combine the text and face representations since they are the minimal semantic elements in the image. Concat obtains the best result on both Macro-F1 and Micro-F1 metrics, though it is the simplest strategy for fusion.

Error Analysis
To provide an overview of problematic labels, We give error analysis in Table 6 and Table 7 . We find that: (1) Loaded Language and Name Calling, which are the most frequent labels, show reasonably good performance (0.8190 and 0.6667 F1 score).
(2) On the other hand, as to labels with fewer training samples (less than 20), the system tends not to predict. Additionally, we find rules for Repetition do not work and all the predicted label are wrongly classified.
(3) Slogans, Glittering generalities and Smears are relative hard to identify. Meanwhile, Recall values of Transfer and Strong Emotions for subtask3 are less than 0.1. It lacks enough training samples to well fit the network parameters.

Conclusion
In this paper, we adopt transfer learning to handle data sparsity problems for subtask1, and fuse heterogeneous multimodal representation for sub-task3. The experimental results show that transfer learning can leverage knowledge from source data to tackle problems related to the scarcity of data, and heterogeneous visual representation (i.e., face, OCR, and multimodal representation) can extract complementary features.
In future work, we plan to explore fine-grained multimodal fusion with token representations in text and object features in images.