LIIR at SemEval-2021 task 6: Detection of Persuasion Techniques In Texts and Images using CLIP features

We describe our approach for SemEval-2021 task 6 on detection of persuasion techniques in multimodal content (memes). Our system combines pretrained multimodal models (CLIP) and chained classifiers. Also, we propose to enrich the data by a data augmentation technique. Our submission achieves a rank of 8/16 in terms of F1-micro and 9/16 with F1-macro on the test set.


Introduction
Online propaganda is potentially harmful to society, and the task of automated propaganda detection has been suggested to alleviate its risks (Martino et al., 2020b). In particular, providing a justification when performing propaganda detection is important for acceptability and application of the decisions. Previous challenges have focused on the detection of propaganda techniques (Martino et al., 2020a), based on news articles. However, many use cases do not solely involve text, but can also involve other modalities, notably images. Task 6 of SemEval-2021 proposes a shared task on the detection of persuasion techniques detection in memes, where both images and text are involved. Substasks 1 and 2 deal with text in isolation, but we focus on subtask 3: visuolinguistic persuasion technique detection.
This article presents the system behind our submission for subtask 3 (Dimitrov et al., 2021). To handle this problem, we use a model containing three components: data augmentation, image and text feature extraction, and chain classifier components. First, given a paired image-text as the input, we paraphrase the text part using back-translation and pair it again with the corresponding image to enrich the data. Then, we extract visual and textual features using the CLIP (Radford et al., 2021) image encoder and text encoder, respectively. Finally, we use a chain classifier to model the relation between labels for the final prediction. Our proposed method, named LIIR, has achieved a competitive performance with the best performing methods in the competition. Also, empirical results show that the augmentation approach is effective in improving the results.
The rest of the article is organized as follows. The next section reviews related works. Section 3 describes the methodology of our proposed method. We will discuss experiments and evaluation results in Sections 4 and 5, respectively. Finally, the last section contains the conclusion of our work.

Related work
This work is related to computational techniques for automated propaganda detection (Martino et al., 2020b) and is the continuation of a previous shared task (Martino et al., 2020a).
Taks 11 of SemEval-2020 proposes a more finegraind analysis by also identifying the underlying techniques behind propaganda in news text, with annotations derived from previously proposed propaganda techniques typologies (Miller, 1939;Robinson, 2019).
This current iteration of the task tackles a more challenging domain, by including multimodal content, notably memes. The subtle interaction between text and image is an open challenge for state of the art multimodal models. For instance, the Hateful Memes challenge (Kiela et al., 2020) was recently proposed, as a binary task for detection of hateful content. The recent advances in pretraining of visuolinguistic representations (Chen et al., 2020) lead the model closer to human accuracy (Sandulescu, 2020).

Methodology
In this section, we introduce the design of our proposed method. The overall architecture of our method is depicted in figure 1. Our model consists of several components: a data augmentation component (Back-translation), a feature extraction component(CLIP), and a chained classifier. Details of each component are described in the following subsections.

Augmentation Method
One of the challenges in this subtask is the low number of training data where the organizers have provided just 200 training samples. To enrich the training set we propose to use the back-translation technique (Sennrich et al., 2016) for paraphrasing a given sentence by translating it to a specific target language and translating back to the original language. To this end, we use four translation models, English-to-German, German-to-English, Englishto-Russian, and Russian-to-English provided by (Ng et al., 2019). Therefore, for each training sentence, we obtain two paraphrased version of it. In the test time, we average the probability distributions over the original and paraphrased sentenceimage pairs.

Feature Extraction
Our system isProbabilities of a combination of pretrained visuolinguistic and linguistic models.
We use CLIP (Radford et al., 2021) as a pretrained visuolinguistic model. CLIP provides an image encoder f i and a text encoder f t . They were pretrained on a prediction of matching image/text pairs. The training objective incentivizes high values of f i (I).f t (T ) if I and T are matching in the training corpus, and low values of they are not matching 1 . Instead of using a dot product, we create features with element-wise product f i (I) f t (T ) of image and text encoding. This enables aspect-based representations of the matching between image and text. We experimented with other compositions (Sileo et al., 2019b) which did not lead to significant improvement.
We then use a classifier C on top of f i (I) f t (T ) to predict the labels.

Chained Classifier
In this task, we are dealing with a multilabel classification problem, which means we need to predict a subset of labels for a given paired imagetext sample as the input. We noticed that label  co-occurrences were not uniformly distributed, as shown in figure 2. To further address the data sparsity, we use another inductive bias at the classifierlevel with a chained classifier (Read et al., 2009) using scikit-learn implementation (Pedregosa et al., 2011). Instead of considering each classification task independently, a chained classifier begins with the training of one classifier for each of the L labels. But we also sequentially train L other classifier instances thereafter, each of them using the outputs of the previous classifier as input. This allows our model to model the correlations between labels. We use a Logistic Regression with default parameter as our base classifier.
Our chain classifier uses combined image and text features as the input. We transfer the predicted probabilities of the classifier via the sigmoid activation function to make the probability values more discriminating (Ghadery et al., 2018). Then we apply thresholding on the L labels probabilities since the task requires a discrete set of labels as output. We predict a label when the associated probability is above a given threshold. We optimize the threshold on the validation set by a simple grid search using values between 0.0 and 0.9 with a step of 0.005.

Datasets
We use the dataset provided by SemEval-2021 organizers for task 6. The dataset consists of 687 (290)   is an image and its corresponding text. We use 10% of the training set as the validation set for hyperparameter tuning.

Results
In this section, we present the results obtained by our model on the test sets for Subtask 3. Table 2 shows the results obtained by the submitted final model on the test set. All the results are provided in terms of macro-F1 and Micro-F1. Furthermore, we provide the results obtained by the random baseline, the best performing method in the competition, and median result for the sake of comparison. Note that, we used the first released training set at the time of final submission which contained just 290 training samples. Therefore, we also provide results obtained by our model after using all the provided 687 training samples. Results show that LIIR has achieved a good performance compared to the majority class baseline and the median result which demonstrates that our model can effectively identify persuasion techniques in text and images. Also, we can observe LIIR has achieved a competitive performance compared to the best result obtained by the best team in the competition when it uses all the training samples.

Ablation Analysis
In this part, we provide an ablation study on the effect of different components of our proposed method on the dev set. First, we show the effect of using just visual features, just textual features, and both. Furthermore, we examine how well the final results of our model was influenced by the augmentation method. Table 3 shows the ablation study on the effect of using different features. The first observation is that image features contain more information compared to the textual features. Also, we can observe that the best Micro-F1 score is obtained when we combine both visual and textual features. These results show the effectiveness of our method in making use of both visual and textual information.  In Table 4, the effect of the augmentation technique is shown. As the results show, the augmentation approach is quite effective in improving the model performance by a high margin.

System
Macro-F1 Micro-F1 LIIR w/o Augmentation 0.25090 0.54952 LIIR w Augmentation 0.29972 0.58312 Table 4: Ablation analysis for the effect of augmentation method on the dev set.

Negative Results
We also tried to use CLIP as a zero-shot classifier for propaganda technique detection. To do so, we constructed prompts such as : ( For each input image/text, we generated a prompt for each labels, and used CLIP to estimate the estimate an affinity score between the prompt and the image. CLIP is designed to predict relatedness between the input image and text, and we expected that an input text mentioning the relevant propaganda technique should be associated with higher probabilities that the others.
However, this method did not seem to perform better than chance. This suggests that propaganda detection technique task might be too abstract for CLIP in zero-shot settings.

Conclusion
We described our submission for the shared task of multimodal propaganda technique detection at SemEval-2021. Our system performances that are competitive with other systems even though we used a simple architecture with no ensemble, by leveraging non-supervised learning. We believe that further work on zero-shot learning would be a valuable way to improve propaganda detection techniques for the least frequent labels.

Acknowledgments
This research was funded by the CELSA project from the KU Leuven with grant number CELSA/19/018.