UAIC1860 at SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles

The “Detection of Propaganda Techniques in News Articles” task at the SemEval 2020 competition focuses on detecting and classifying propaganda, pervasive in news article. In this paper, we present a system able to evaluate on sentence level, three traditional text representation techniques for these study goals, using: tf*idf, word and character n-grams. Firstly, we built a binary classifier able to provide corresponding propaganda labels, propaganda or non-propaganda. Secondly, we build a multilabel multiclass model to identify applied propaganda.


Introduction
Propaganda is a strong component of media ideas making it easier for reputation of people with high stature and to organizations (Thota et al., 2018;Gifu et al., 2014). Research on detecting propaganda has focused, especially, on news articles (Fitzmaurice, 2018;Gifu and Dima, 2014). Using a range of psychological and rhetorical techniques, propaganda intends to manipulate deliberately people's beliefs, attitudes or actions (Al-Hindawi and Kamil, 2017). Consequently, automatic detection and classification of propaganda in news articles is a challenging work .
The goal of this paper is to implement an automatic system, which imply two fragment level classifications for the presence of propaganda in news articles. First, a binary classification model, able to provide corresponding propaganda labels, propaganda or non-propaganda. Second, a multilabel multiclass model in order to identify applied propaganda.
The rest of the paper is structured as follows: section 2 describes other works related to propaganda identification, section 3 presents the dataset and methodology of this study, section 4 briefly relates an analysis and the results we have obtained, followed by section 5 with the conclusions.

Related Work
This topic has attracted significant attention in recent years, evidenced by increasing number of workshops of the same competition (e.g. Fake News Challenge Stance Detection Task 2018, SemEval-2019 Task 4: Propaganda Analysis Meets Hyperpartisan News Detection). Thus, work on this topic was never followed by high results, as this problem is highly subjective and text classification even for humans is very controversial and biased. Most of the authors used Bag of Words features, usually normalized with tf*idf (Saleh et al., 2019;Barro´n-Ceden˜o et al., 2019a) or character n-grams features for stylistic purposes (Stamatos, 2009).
Because for SemEval-2020 Task 11 there were not enough instances to represent valuable information, some of these techniques had been merged (Whataboutism with Straw men and Red herring; Bandwagon with reduction ad hitlerum) and other were eliminated (Obfuscation, Intentional Vagueness, Confusion).

Dataset and Methods
This section contains details about both datasets built as part of SemEval-2020 Task 11 "Detection of Propaganda Techniques in News Articles" and the study methodology, which was the basis for solving both sub-tasks.

Dataset
The dataset consists of news articles, retrieved with the newspaper3k library, in plain text format, split in two parts. The first part has two folders, train-articles and dev-articles, and the second part, a third folder for the test set. Each article has the following structure: a title followed by an empty row and the body content, starting with the third row, one sentence per line. For automated sentence splitting, NLTK was used. For binary classification issue, we trained four models, using 370 news articles manually annotated by six annotators. The indexes for the fragments containing propaganda were in separate .TSV file. Retrieving the information from each article based on the given indexes, we identified 5468 sentences containing at least one of the propaganda techniques and 10577 sentences that do not contain propagandistic content (see Figure 1). We noticed that there is a high imbalance between distributions of classes in the dataset, which may lead to poor results when training the model. In order to solve this high data disproportion, we oversampled the minority class (see Figure 2). For multinomial classification problem, we have used 6129 propagandistic fragments distributed as in Table 1.

Methodology
For the sub-task Span Identification (SI), the first objective was to retrieve the fragments from the articles and classify them into two categories: those containing propaganda labeled with 'propaganda' and all the other fragments labeled with 'non-propaganda'. For the sub-task Technique Classification (TC), the first objective was to retrieve the fragments from the articles and classify them into multiple categories, considering those 14 propaganda techniques . Once our data frame was created, we pursued to the dataset pre-processing. In order to create a reliable dataset, we automatically striped the redundant information, like stop words and special characters using NLTK library. In addition, we created a custom transformer for removing initial and end white spaces and converting text into lower case. Once our dataset was cleaned, we took our next step to feature engineering. Features are generally designed by examining the training set with an eye to linguistic intuitions and the linguistic literature on the domain (Jurafsy and Martin, 2019). Given the consistent use of linguistic attributes for training machine learning models and results from previous papers for propaganda detection, we considered bag of words and tf*idf scores appropriate for this task.
Using bag-of-words model the text was converted into a matrix of occurrence of words within the given fragments. It focuses on whether given words appear or not in the text, and generates a document term matrix. Applying scikit-learn's CountVectorizer function and defining character n-grams in range (1, 6) we got the numerical representation of the texts. In addition, we included the statistical measure TF*IDF (Term Frequency-Inverse Document Frequency) in order to evaluate how relevant a token is to a document in a collection of news articles. This was a simple way of normalizing the Bag of Words by looking at each word's frequency in comparison to the document frequency. The reason of using tf*idf instead of the raw number of occurrences of a token in a given text is to scale down the influence of tokens that appear very often in the provided corpus and thus are generally less informative than features that occur in a smaller fragment of the training corpus.

Sub-Task 1: Span Identification (SI).
We analyzed the training dataset to identify the fragments for propaganda, labeled with three distinguished tags: id (the identifier of the article), begin offset (the character where the covered span begins, being included), end offset (the character where the covered span ends, being not included). Based on these labels, we crafted a set of rules to identify propaganda and sentences of news articles were randomly generated. The application code was written in the Python programming language and the results are presented in Table 2.

Sub-Task 2: Technique Classification (TC).
We analyzed the dataset in order to classify on the fragment-level into one of the 14 classes. The labelled file contained four columns: the article id, the propaganda technique, the begin offset, which is the character where the covered span begins (included) and end offset which is the character where the covered span ends (not included). We used this restriction and made the second submission, with the results presented in Table 3.

Results
Below, the results for each individual task using the development and test sets are presented. We report Precision (P), Recall (R) and F1-score (F1), for each baseline on all classes. The official submission for the SI task was 0.33 and 0.43 for the TC-task. In Table 2 we see that the Random Forrest model has the best performance on the development set for the detection of propaganda in news with a F1 of 0.30 and a Recall of 0.56, while the highest precision was achieved using Logistic Regression -0.22. However, the final submission on the test-set was done using Random Forrest algorithm. Analyzing the particular features, we observed that word and character n-grams perform almost identically, with character n-gram features performing slightly better in recall score for propaganda sentences and precision in non-propaganda phrases, while word n-grams achieving higher results in all the other cases. These two features correlate well with each other as well, reporting high results for both classes.  Table 3. Technique classification results Table 3 reflects the results for second sub-task, which yield an improvement in performance, especially for the classes with many instances. The overall F1 on dev-set is 0.43 and 0.41 on testing dataset. It seems like the system did not encounter any issues predicting Loaded Language or Name Calling. However, it found problematic to classify under-sampled techniques like Black-and-White Fallacy or Whataboutism, Straw Men, Red Herring. Taking a closer look at the misclassified examples can facilitate the development of machine learning models, pointing out instances that proved to be difficult in classification and can be analyzed for future improvements. Based on a short analysis of sentences, we assume that some of the model's errors are due to low number of examples in these classes or poor annotation, as it might be challenging to find specific patterns in highly biased sentences and even for a human it could be difficult to classify them correctly

Conclusion
This paper presents a system participating at SemEval Task 11. Since we performed an exhaustive investigation of propaganda detection at the fragment level, our experimental results showed that linguistic features like character and word n-grams are remarkably efficient for both tasks. The overall results are satisfactory and exceeds the baseline; however, there is still room for improvement, in predicting the techniques. Larger and well annotated dataset would provide more opportunities for exploring the issue of propaganda detection in news articles, in addition building a dataset sufficient in size and diversity will allow experiments with deep learning methods,