DUTH at SemEval-2020 Task 11: BERT with Entity Mapping for Propaganda Classification

This report describes the methods employed by the Democritus University of Thrace (DUTH) team for participating in SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles. Our team dealt with Subtask 2: Technique Classification. We used shallow Natural Language Processing (NLP) preprocessing techniques to reduce the noise in the dataset, feature selection methods, and common supervised machine learning algorithms. Our final model is based on using the BERT system with entity mapping. To improve our model’s accuracy, we mapped certain words into five distinct categories by employing word-classes and entity recognition


Introduction
According to the Institute for Propaganda Analysis 1 , propaganda is an expression of an opinion or an action by individuals or groups deliberately designed to influence the opinions or the actions of other individuals or groups concerning predetermined ends. With the rapid change that the world wide web has made, it is evident that the means available for propaganda to be spread are more than ever before. The fact that, nowadays, news outlets can reach out to millions of people through their websites or social media demonstrates how easy it is to manipulate people with propaganda techniques or fake news. For example, political forecasts severely underperformed in predicting the results of the 2016 US presidential election and the United Kingdom European Union membership referendum (Brexit) as opposed to the consensus in social media, which is indicative of the new challenges that are upon us (Hall et al., 2018).
The SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles aims to produce models that can identify text fragments with various propaganda techniques. The first subtask is a binary sequence tagging task in which a model has to return the spans that contain at least one propaganda technique. The second subtask is a multi-class classification task in which given a text fragment and the article it occurs in, participants must classify the fragment into one of 14 different propaganda classes. More details on the Task can be found on the Task Description paper (Da San Martino et al., 2020).
The rest of this paper is structured as follows. Section 2 outlines some previous studies of propaganda identification. Section 3 describes our approach, while Sections 4 and 5 present experiments and results respectively. Conclusions are summarized in Section 6.  presented Proppy, a publicly available real-time propaganda detection system that is used for online news. The system used four modules that include article retrieval, event identification, deduplication, and propaganda index computation. To organize the news based on their propagandistic content, they showed that when identifying propaganda, approaches that use word n-grams are less effective than those that use character n-grams and other style features. Additionally fragments that contain propaganda techniques as well as their type, as opposed to addressing propaganda detection at the document level. Rashkin et al. (2017) described the need for examining lexical features when trying to understand the differences between more and less reliable digital news sources. They studied the usefulness of linguistic morphology in different types of fake news such as propaganda, satire, and hoaxes. They also created a corpus of categorized news articles with labels such as propaganda, trusted, hoax, or satire. In another major study, Rashkin et al. (2019) noted the importance of discovering relationships between different propaganda techniques. They hypothesized that finding common traits could prove helpful in classification tasks. In our approach, we investigated the effects of entity mapping in certain classes, and our conclusions are in line with Rashkin et al. (2017) concerning the existence of conceptual and linguistic relationships between propaganda techniques.

Background
In recent years, there have been some significant landmarks in the NLP field. ELMo (Peters et al., 2018), ULMFiT (Howard and Ruder, 2018), OpenAI GPT (Radford et al., 2018), and BERT (Devlin et al., 2018) are some large scale models that have massively improved the results in many NLP tasks. These systems provide models that have been pre-trained in massive corpora of unlabeled data and require fine-tuning in task-specific data. Although these systems offer excellent results, there is a need for further experimentation, as noted by Hua (2019) which highlights BERT's shortcomings in real-world scenarios.

Approach
This section describes our approach to mapping certain words into five distinct categories by employing word-classes and entity recognition. It also introduces the BERT model which was employed for our final submission.

Mapping the Dataset
The main idea of our method was to investigate the relationship between different entities and whether they have relevant usage. This is demonstrated with the examples in Table 1. The Flag Waving technique is an example of how words that bear no similarity in a bag-of-words representation, have the exact same semantic value for propaganda technique classification. In this example, 'Soviet Union' and 'Iran' have the same value (being both countries) for propaganda classification.

Propaganda Technique Propaganda Extract Flag Waving
'This is not the Soviet Union, this is not Iran or Riyadh this is America.' Name Calling,Labeling 'fascist propaganda tropes.' Slogans 'Make America Great Again' Table 1: Samples from different labels The same applies for entities such as 'communists' and 'fascist' (political ideologies) and 'Christians' and 'Muslims' (religious groups). The hypothesis is that, for propaganda classification, when someone wants to attack another nation or a certain group through propaganda, it is less important which group or nation initiates or receives the attack. Thus, we made three lists that aim to reduce the noise in data that is produced from various countries, religious or political groups. We also made a list that contained different slogans to help with the Slogans category.
The lists we created are the following and can be found on github 2 : • List Countries: The names of 255 countries as well as some variations such as 'America' or 'UK'.
• List Religion: 35 words that relate to religion such as 'Catholic' and 'Muslim'.
• List Politics: 23 words that relate to politics such as 'Democrat' or 'Republicans'.
• List Slogans: 41 slogans such as 'War on Terror' or 'Build the wall'.
We scanned the dataset for those instances and replaced them with the following tags: NATION, RELI-GION, POLITICS, and SLOGANS. The final results showed that this approach improved significantly the basic BERT model.

Named Entity Recognition
Named Entity Recognition is the process of identifying proper names and classifying them into categories such as persons, organizations, locations, etc. This process is vital for many NLP applications (Petasis et al., 2001). Carrying on with our previous hypothesis, we also experimented with entity recognition. We noticed that in many instances of propaganda, there was a use of names of politicians that could be grouped to help the accuracy of our model. Although we experimented with many different entity groups/types such as Nationalities and Organisations, the best results came with the People entities.
To achieve this, we used SpaCy's 3 named entity recognizer which has been trained on the OntoNotes 5 corpus (Pradhan et al., 2007). After the recognition, we replaced the entity with the PERSON tag. This approach yielded our best results in the Flag Waving category.

BERT -Bidirectional Encoder Representations from Transformers
BERT is a language representation model that was introduced by Devlin et al. (2018). It stands for Bidirectional Encoder Representations from Transformers. BERT pre-trains deep bidirectional representations from text that has not been labeled. The fact that BERT is deeply bidirectional allows it to learn information during training from both sides of a token's context. The following two steps are involved in BERT.
The BERT model has been pre-trained in the BooksCorpus (800m words) (Zhu et al., 2015) and English Wikipedia (2,500m words). In the first step, we fine-tuned the BERT model on different versions of the dataset that was provided by the organizers. BERT requires input data to be in a specific format. To mark the beginning, the [CLS] special token is used and for the separation or end of the sentences the [SEP] is used. The input is represented as: The next step was to tokenize the propaganda extracts into tokens that match BERT's vocabulary. For tokenization, we used BERT's BertTokenizer. BertForSequenceClassification 4 which is the model that we used for fine-tuning. This BERT transformer has a sequence classification/regression head on top (a linear layer on top of the pooled output). According to the recommendations of Devlin et al. (2018), for training we used a batch size of 32, a learning rate of 2e-5, and the number of epochs was set to 4.

Experimental Setup
In this section, we describe the experimental setup of this study, providing information for the dataset and the parameters of machine learning algorithms, respectively.

Dataset
The organizers provided three datasets Training, Development, and Test. The training dataset consisted of 357 articles in text format, retrieved with Python's newspaper3k 5 . For the second subtask, the organizers provided a text file with 6,129 propaganda text fragments, belonging to 13 categories, alongside their respective article id and the spans in which the technique was located in the article. The 13 categories/labels are shown in Table 4. The dataset is imbalanced since the Name Calling,Labeling and Loaded Language labels jointly constitute 50% of the dataset.

Pre-Processing
We tested various pre-processing techniques and by using the conclusions of Symeonidis et al. (2018) we applied the following: Remove Numbers, Remove Punctuation, Remove Symbols, Lowercase, and Replace all URL addresses normalizing them to 'URL'.
We prefer not to remove stopwords due to the results of our previous work on SemEval-2019 Task 8: Fact Checking in Community Question Answering Forums (Bairaktaris et al., 2019). In that work, we concluded that stopwords can prove important for certain tasks. For example, a common word such as 'believe' can strongly indicate opinion and as such is useful.

Machine Learning Model
Before using the BERT model for our final submission, we used standard machine learning methods for our experiments. We will briefly present these methods, which, just in two classes (Oversimplification and Flag Waving), performed better than the BERT model techniques on the development set.
For the training of our classifiers, we used Python's Scikit-Learn library (Pedregosa et al., 2011). We split the given pre-labelled data into 2/3 training and 1/3 development set (2:1 ratio). After the split, the training set was shuffled, and tested a sequence of tuning parameters on the development set. When the test set was provided by Task organizers, we re-trained the classifiers into the total training set and tested on the organizers' test set.
Vectorizer: We compared three common vectorizers such as CountVectorizer, HashingVectorizer, and TfidfVectorizer. Finally, our selection was the TfidfVectorizer since it yielded the best results.
Classifiers: We tested various classifiers and decided to use the following three: SGDClassifier, RidgeClassifier, and LinearSVC, as they yielded the best micro-averaged F 1 Results.

Results
This section summarizes our experimental results. Before our officially submitted run, we present some additional experiments.

Machine Learning Model results
In Table 2, we present the results of our machine learning Baseline Model. The Baseline Model is with the RidgeClassifier, as described in Section 4.3, since it yielded the best results on the training process. We show the F 1 -score of the classifier when the Baseline Model was trained with the mapped datasets that we described in Sections 3.1 and 3.2. For the entity recognition we used a variety of entities such as persons, nationalities, organisations, countries, cities and locations. As we mentioned in Section 3.2 the PERSON entity achieved the best results.
The Baseline Model achieved some notable results on the development set for two labels. In the Oversimplification label, the baseline model yielded a micro-averaged F 1 of 29%, as opposed to the basic fine-tuned BERT which failed to recognize this class. Furthermore, for the Flag Waving label, the Baseline Model scored 1% more than our best BERT model. However, as we can see in Table 3, the BERT model performed better overall results and was selected for our final submission.

Bert Model Results
When fine-tuning the BERT model, we tried various approaches with the dataset. We tried using the raw dataset as well as a pre-processed one. Although pre-processing (with the techniques that we mentioned in Section 4.2) improved results over the raw dataset, when we applied the mapping and the named entity recognition techniques we observed that pre-processing did not help achieve better results. The results are presented in Table 3.

Final Submission Results
By examining the results of our BERT models, we concluded that the best results came with mapping the dataset with the NATION, RELIGION, and POLITICS labels. The second best approach was with the PERSON tag that outperformed our best model in the Bandwagon, Flag Waving, Labeling, and Cliches categories. Our official submission to the competition ranked our team to the 10th place from 32 teams.
The results of our model are shown in Table 4.

Conclusions
We presented a supervised learning model for classifying text fragments from news articles in thirteen propaganda categories. We used standard classification techniques as well as modern NLP models such as BERT. We examined the task from a sociological point of view and we tried to experiment with the fact that different entities of the same type can have the same value for propaganda classification. The results were promising and further experiments could improve them.