JUST at SemEval-2020 Task 11: Detecting Propaganda Techniques Using BERT Pre-trained Model

This paper presents the submission to semeval-2020 task 11, Detection of Propaganda Techniques in News Articles. Knowing that there are two subtasks in this competition, we have participated in the Technique Classification subtask (TC), which aims to identify the propaganda techniques used in a specific propaganda span. We have used and implemented various models to detect propaganda. Our proposed model is based on BERT uncased pre-trained language model as it has achieved state-of-the-art performance on multiple NLP benchmarks. The performance results of our proposed model have scored 0.55307 F1-Score, which outperforms the baseline model provided by the organizers with 0.2519 F1-Score, and our model is 0.07 away from the best performing team. Compared to other participating systems, our submission is ranked 15th out of 31 participants.


Introduction
The high rate of using social media and the spread of digital news and online blogs have enabled the massive amount of data to reach a wide range of audience as well as allowed non-journalist to disseminate false news, misinformation, hoax and propaganda to mislead people and deceive them (Tandoc Jr et al., 2018;Rubin et al., 2015;Baisa et al., 2017). Moreover, the main way of accepting this news by society is the mass media and digital news (Gavrilenko et al., 2019). The reasons behind disseminating false information and news could be for financial and political purposes or to mislead the readers and influence their opinion negatively as well it has been argued to influence elections and threaten democracies (Shao et al., 2017;Abedalla et al., 2019). Propaganda is defined as"efforts by special interests to win over the public covertly by infiltrating messages into various channels of public expression ordinarily viewed as politically neutral (Sproule, 1994)". It aims to mislead audiences by influencing them toward a particular political or social agenda in news media (Volkova and Jang, 2018;Barrón-Cedeño et al., 2019). Therefore, several propaganda techniques and tools are designed to propagate certain ideologies. These techniques usually appeal to audience emotions and reach of their desires . with different word embeddings to detect propaganda techniques. However, the final submission was based on BERT's uncased pre-trained language model that achieved a significant performance. The rest of the paper is organized as follows.Section 2 describes the related work.Section 3 presents the dataset and per-processing.Section 4 presents the model and architecture. Section 5 presents the experiments and section 6 discusses the results.Finally, section 7 draws conclusions and sketches of future work.

Related Work
It is not surprising that people have been knowing and using propaganda for centuries (Shu et al., 2017;. For example, during World War One in 1914, propaganda was used on a global scale with the rise of the Nazi propaganda machine (Jewett, 1940) to mobilize hatred against the enemy. Detection of propaganda has gained massive interest in the research community in recent years. Therefore, the automatic detection of propaganda is studied as part of the propaganda analysis project in (Barrón-Cedeño et al., 2019). The researchers provided the first propaganda detection system, which is available publicly and is called poppy.This system is a real-world and real-time monitoring system to unmask propagandistic articles in online news. In (Gavrilenko et al., 2019), the researchers discussed the problem of identifying propaganda in online news content. Therefore, they performed several neural network architectures, such as Long Short-Term Memory (LSTM), hierarchical bidirectional LSTM (H-LSTM) and Convolutional Neural Network (CNN) in order to classify the text into propaganda and non-propaganda. Effective techniques were used for text pre-processing as well and different word representation models including word2vev, Global Vectors(GloVe), and TF-IDF. Moreover, the models performed to data provided from Twitter to the Internet Research Agency. The data was about the relevant activities of the IRA from September 1 to November 15, 2016. However, the results showed that CNN with word2vec representation outperformed other models with accuracy equals to 88.2 %. The researchers in (Da San Martino et al., 2019) released the shared task on Fine-Grained Propaganda Detection as part of the NLP4IF workshop at EMNLPIJCNLP 2019 that focused on detecting propaganda and specific propagandistic technique in news articles at sentence and fragment level, SLC and FLC tasks respectively. The wining system in SLC task (Mapes et al., 2019) was based on using an attention transformer using the BERT language model where the final layer of the model replaced with a linear softmax layer. However, to obtain multi-head results they have used ensemble attention neural networks with 12 attention heads and 12 transformer blocks.As well, the teams (Hua, 2019;Hou and Chen, 2019) fine-tuned BERT to tackle SLC task.Another team (Al-Omari et al., 2019) presented their proposed model to detect propaganda in the SLC task. Furthermore, they have experimented with various combinations of deep learning models including XGboost (Chen and Guestrin, 2016), BiLSTM and BERT cased and uncased with a set of features including effective words, lexical features as well as word embeddings based on Glove (Pennington et al., 2014). Moreover, their final model was based on an ensemble of XGboost, BiLSTM, and BERT-case and uncased where they achieved a 0.6112 F1 score. For the FLC task; a team (Yoosuf and Yang, 2019) achieved the best results by applying 20-way token level classification, each token in the input article classified into 20 token types. Moreover, they fined tune BERT uncased base model to classify the tokens by adding a linear head to the last layer of BERT.

Dataset and Preprocessing
The corpus which is provided by  includes 350 articles for training and 75 articles for development. In training articles, there are 6129 propaganda spans and in the development set, there are 1063 propaganda spans. Figure 1 shows the distribution of classes in the training set, and as can be seen, the classes are imbalanced; loaded language is the most frequent class with 2123 samples whereas Bandwagon, Reductio_ad_hitlerum is the least frequent class with 72 samples. Text preprocessing has been performed for each span of training and development set that includes: removing punctuation, tokenization, cleaning text from special symbols, and cleaning contractions. All preprocessing steps were performed using the NLTK library.

System Overview
We have used Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018), which is one of the most powerful pre-trained language models that achieved state-of-the-art results in a wide variety of NLP tasks. There are two types of pre-trained models, BERT-Base and BERT Large, however, we have used the base model as it needs less computational resources. In each type, there are five pre-trained models, moreover, we have used Uncased and Cased. We have performed the pre-trained BERT models for multi-class classification, and retrieve bert-uncased, which contains 12 transformer layers and 110 million parameters as well bert-cased with 12 transformer layer, 768-hidden, 12-heads, 110M parameters.
BERT models have been downloaded from TensorFlow hub 1 . The architecture of the model is described in Figure 2. The cleaned span is represented using a special token [CLS] that is added in front of every span and another token [SEP] is added at the end. The input feeds the transformers' layers to generate the prediction for each class. Accordingly, the models are trained for 15 epochs using a learning rate of 2e-5, a maximum sequence length of 128, and a batch size of 128.

Experiments
Additionally, we have experimented with several deep learning models based on learning representations (embeddings) that have been performed on the dataset to detect propaganda techniques. The input to the models is represented using a pre-trained sentence and word level embeddings to encode the input into the embedding vector. we used the word2vec embedding that is trained on Google News (Mikolov et al., 2013), where the sentence is encoded to embedding layers, which is a lookup table that consists of 300-dimensional pre-trained vector to represent each word. It is worth to mention that we have also experimented glove (Pennington et al., 2014) and fastText (Joulin et al., 2016) embeddings, but the results were not promising. For sentence embeddings, we have performed universal sentence encoder (USE) (Cer et al., 2018) that generates 512-dimensional representation for each sentence. The following deep learning models have been performed to the dataset: • Neural Network (NN) model, the input is 512-dimensional representation for each span encoded using USE. This input passed to a fully connected neural network with three dense hidden layers of 128,128,75 neurons, respectively. A dropout of 0.4 and 0.2 have been added to avoid overfitting. The activation function for each layer is ReLU and for the output layer, one hidden dense layer used with the softmax activation function.
• Convolutional Neural Network (CNN) model, followed (Kim, 2014) architecture.The input to the CNN is a 300-dimensional representation for each input encoded using word2vec where the embedding is kept static. The second input is the 512-dimensional representation for the spans encoded with USE that is passed to a neural network with two dense hidden layers of 256 and 128 nodes, respectively, and dropout of 0.4. After that, the output from the first architecture concatenated with the output from the neural network passed to the dense output layer.
• Bidirectional Long short-term memory (BiLSTM) model (Hochreiter and Schmidhuber, 1997) with two different inputs that are fed into BiLSTM layers and Fully Connected Neural Network. The inputs encoded using word2vec embeddings and USE.
The models are implemented using Keras framework 2 .USE loaded from TensorFlow hub 3 .For training, we have used the Adam optimizer 4 with lr=0.001, and categorical Cross-Entropy as the loss function. The batch size is set to 32 and the number of epochs to 25.

Results and Discussion
We have directly evaluated all the models on the development set, and the best model is chosen to generate predictions of the test data. Table 1 shows the results of deep learning models, BERT-cased and BERT-uncased on the development set. The uncased model of the BERT language model gives the best prediction which applies that it works better than the cased model and outperforms other deep learning models. Hence, BERT able to understand the propaganda technique better than the other models. Also, we have tested the deep learning models with only word2vec embeddings and we have noticed that text classification using sentence embeddings (USE) outperforms word-level embeddings and provides a significant performance with minimal amounts of training data. This is expected since the word2vec embedding is context-independent and it does not encode the semantic relationships between words in the input sequence.
In the test stage, since we are only allowed to submit a single run on the test set, we choose the model with the highest F1 score on the development set (0.5766) to generate predictions, which is BERTuncased. The evaluated results on the test set are listed in Table2, the model yielded a test F1 score of 0.55307.
Table3 shows the class-wise scores. Accordingly, the model performs well on propaganda techniques that appear frequently in the article such as "Load language"and "NameCalling, Labeling "that achieved  0.71958 and 0.64727 F1 score respectively on the test set, while BERT acts poorly on some propaganda types, such as "Exaggeration, Minimisation", "Bandwagon ", "Reductio_ad_hitlerum"and "Repetition". However, some propaganda techniques are challenging due to the way that the span is shaped or the number of words in the span, for example, "Flag-Waving"can be shaped in different ways, and "Repetition"depends on the occurrence of a word (or more) and the repetition of it in the article. Therefore, it is not enough to only look at the span to make the prediction, more information needs to be given to the model such as article context and word counts for each word in the span across the article thus including the whole article as the context is needed. Due to the limited time, we didn't experiment with the effect of adding these features. Finally, we have noticed that BERT has a strong performance in detecting some complex propaganda techniques while may perform poorly on other techniques and that because of the problem of imbalanced classes, which are impacting the performance of the minority class and leading the model to favor predicting the majority class.

Conclusion
In this paper, we described our solution in the propaganda technique classification subtask at SemEval-2020 task 11. We have investigated several models such as CNN, BiLSTM, and NN with word and sentence embeddings including word2vec and USE to detect propaganda techniques. However, our final solution was based on the BERT-uncased language model, which showed significant performance. The evaluations were performed using the dataset that was provided by semeval2020 task11 organizers.
Our proposed model is ranked in 15 th place among 31 teams. Moreover, the F1-score that is achieved is 0.55307, which outperformed the baseline model (0.25196). The results confirmed that pre-trained language models(like BERT) are clearly a step forward for NLP and it has a strong performance in detecting propaganda techniques.