UMSIForeseer at SemEval-2020 Task 11: Propaganda Detection by Fine-Tuning BERT with Resampling and Ensemble Learning

We describe our participation at the SemEval 2020 “Detection of Propaganda Techniques in News Articles” - Techniques Classification (TC) task, designed to categorize textual fragments into one of the 14 given propaganda techniques. Our solution leverages pre-trained BERT models. We present our model implementations, evaluation results and analysis of these results. We also investigate the potential of combining language models with resampling and ensemble learning methods to deal with data imbalance and improve performance.


Introduction
Propaganda techniques are used to promote or publicize a particular political cause or point of view, especially of a biased or misleading nature (Orlov and Litvak, 2018). To achieve its desired outcome, psychological and rhetorical techniques are frequently used. While initially people tend to agree with propaganda messages due to their misuse of logic and/ or arousal of emotional response, they later change their opinion and realize arguments are not convincing. Indeed, for maximum effect propaganda techniques are intended to go unnoticed. It is therefore important to detect propaganda in initial stages and identify the specific propaganda techniques used. By successfully detecting and classifying propaganda, people can look at information more rationally and logically.
This paper describes the solutions towards the Techniques Classification (TC) task of the SemEval 2020 Task 11 "Detection of Propaganda Techniques in News Articles" competition. The task requires classifying textual fragments that relate to at least one of the 14 given propaganda techniques. Our solutions leverage the pre-trained BERT (Devlin et al., 2018) based classifier. Moreover, fine-tuning allows us to conveniently adjust its pre-trained bottom layer weights on the given propaganda detection corpus shared by the organizers. Among the 46 participating teams, our submission is ranked 17.
The rest of the paper is organized as follows. In Section 2 we give the problem definition. In Section we present the propaganda dataset, in Section 4 we describe our solution and elaborate on the architectural and implementation details of the model used, in Section 5 we present results achieved by our best submission and analyze of these results, and finally in Section 6 we conclude with directions for future work.

Problem Definition
Given a document and a propaganda-related text excerpt from the document, the task is to identify the specific propaganda technique present in the text excerpt from the 14 propaganda classes available (Da San Martino et al., 2020). Text excerpts can be overlapping and reflect multiple propaganda techniques simultaneously, therefore the propaganda identification task needs to be approached as a multi-label multiclass classification problem. In practice, for cases when multiple propaganda techniques are used in the same text fragment, the organizers created as many copies of the respective text fragment equal to the number of propaganda techniques present in the fragment. This allows us to approach the problem as a single-label multi-class machine learning classification problem. This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/.

Dataset
The data shared with us by the competition organizers contains 8,982 plain text articles retrieved from the Newspaper3k 1 library and divided into training/ validation/ testing sets. The training set contains propaganda labels, while for the validation and testing articles we need to predict the correct labels. In Table 1 we provide an overview of the propaganda training dataset we used in our experiments, along with the 14 propaganda techniques and their training set frequency in Table 2 (the validation and testing set propaganda labels are not shared with competition participants).  Each article in the dataset is made up of a title specified on the first line, followed by an empty second line, and the article content starting from third line onwards, one sentence per line. In Figure 1 we include an example of a training set article from the Propaganda Detection corpus for which the propaganda labels are given. For instance, the term "babies" on the first line denotes Name Calling, Labeling type of propaganda. On the fourth line, "stupid and petty" is Loaded Language propaganda and "not looking as though Trump killed his grandma" is an instance of Exaggeration and Minimisation. In Figure 2 we include an example of a test article from the Propaganda Detection corpus for which we need to identify the propaganda techniques present.

Method
We address propaganda detection as a multi-class text classification problem, and our solution relies on the pre-trained BERT (Devlin et al., 2018) model. We choose BERT since it is already pre-trained on large amounts of data and has demonstrated strong performance in a multitude of natural language processing tasks including text classification. We fine-tune BERT-base on the textual fragments delimited by start and end indices from articles in the training set for which we know the correct labels. We apply basic text pre-processing techniques, such as lowercasing and tokenization. We also convert the textual fragments into a format compatible with BERT by adding special tokens to mark the start (  of each sentence, pad sentences with special token [PAD] to ensure all have the same length of 64 tokens, and differentiate real tokens from padding tokens using a mask list which contains 0 for padding tokens and 1 for all other tokens. In Figure 3 we include an example of a propaganda textual fragment after pre-processing. Not all 14 classes are equally represented in the propaganda dataset, and to alleviate data imbalance issues we use oversampling and undersampling techniques. We oversample the number of examples for minority classes, and undersample the number of examples for majority classes; please see Figure 4. When oversampling, we resample each class containing less than 400 samples to 400 samples by bootstrap sampling, i.e random sampling with replacement. When undersampling, we resample every class for which the number of samples is greater than 400 to 400 samples also by bootstrapping. We choose 400 as the number of samples per class given there are 6,129 propaganda samples in total in the training data and the goal is to have all 14 propaganda classes equally represented. The data obtained after the resampling process is used in combination with bagging (Quinlan and others, 1996)  We implement our solution in PyTorch and use HuggingFace (Wolf et al., 2019) library. We train the BERT-base-uncased model with 12 Transformer (Vaswani et al., 2017) layers and 110 million parameters. We chose model hyper-parameters by doing grid search and for our submission we use the following values: max sequence length set to 64, batch size equal to 32, a learning rate of 2e-5 and Adam (Kingma and Ba, 2014) as the optimization algorithm. We use 4 fine-tuning epochs with 10-fold cross validation on the training data set provided by competition organizers. We optimize for the F1 score in our evaluation.

Result and Analysis
In what follows we present results for our final BERT-based submission which ranks 17 among 46 participating teams. Our final run achieves 0.5729 validation F1-score and 0.5514 testing F1-score. Comparing our submission with the performance of other teams in the competition, we observe that submissions ranked 4-17 achieve comparable F1-scores to ours (team 4 reports 0.589 F1 score), and that teams ranked 1-3 achieve higher F1 scores than us (0.617 F1 score and beyond).
Next we assess the performance of our run at finer granularity for each propaganda technique. In  3 we present the F1 scores for each propaganda class in the development and testing sets. On the one hand, we observe that propaganda classes which are most represented in the training data achieve best F1-scores on the validation and testing sets. Propaganda techniques such as Loaded Language and Name-Calling, Labeling which contain the most samples in the training dataset also rank highest on validation and testing sets. On the other hand, propaganda techniques with less than 200 samples in the training data have validation F1-scores close or equal to 0. This is illustrative of the severe class imbalance present in the training dataset.
In Table 3 we also present the performance of the BERT based model when doing bagging in combination with oversampling and undersampling and selecting the majority vote of all nine classifiers as the final model prediction. We observe that when training on balanced data samples with bagging, we alleviate the lack of enough data problem for propaganda classes for which we previously obtained very low F1-scores. The overall Validation F1-score for BERT with bagging is 0.5093, which is lower value compared to BERT-based without bagging (0.5729 Validation F1-score). Given we did not perform extensive hyperparameter tuning for BERT with bagging due to time constraints, we believe it is possible to considerably improve its performance when hyperparameters are carefully selected. The results also show that ensemble methods are a promising approach to reduce variance between propaganda classes and provide better model stability.