UTMN at SemEval-2020 Task 11: A Kitchen Solution to Automatic Propaganda Detection

The article describes a fast solution to propaganda detection at SemEval-2020 Task 11, based on feature adjustment. We use per-token vectorization of features and a simple Logistic Regression classifier to quickly test different hypotheses about our data. We come up with what seems to us the best solution, however, we are unable to align it with the result of the metric suggested by the organizers of the task. We test how our system handles class and feature imbalance by varying the number of samples of two classes (Propaganda and None) in the training set, the size of a context window in which a token is vectorized and combination of vectorization means. The result of our system at SemEval2020 Task 11 is F-score=0.37.


Introduction
Propaganda is a complex phenomenon that was studied in psychology (Childs, 1936), sociology (Jackall, 1995;Klaehn and Mullen, 2010), theory of communication (Jowett and O'Donnell, 2018;Severin et al., 1997), pedagogy (Hobbs and McGee, 2014;Smith, 1974), history (Taylor, 2013), linguistics (Lukin, 2013), and other sciences. Naturally, the task of automatic propaganda detection has been set and approached in different ways. Proppy, an on-line service, detects propaganda in news articles and clusters them according to an index of propaganda (Barrón-Cedeno et al., 2019). It daily analyzes emerging texts, identifies events described in them, discards near-duplicates, and, lastly, computes propaganda index on the basis of n-gram features, vocabulary, its richness, style, readability, and NEws LAndscape (NELA) (Horne et al., 2018). In the Propaganda Analysis Project, the propaganda detection is more focused on locating propaganda within a text. At the Hack News Datathon 1 Task 3 was to detect a fragment containing propaganda. However, it was paired with classifying it according to 18 techniques 2 , same as the Fragment Level Classification at the shared task of "Fine-Grained Propaganda Detection" (EMNLP2019) . The competition of Da San Martino et al. (2020) evaluates these two stages separately. Although it is hard to map results of previous competitions on the current one, the former revealed some life hacks that can be of use in the latter. For example, as noted by H.T. Madabushi (2019), extracting fragments of propaganda is similar to the task of Named Entity Recognition, in that they are both span extraction tasks. Their system ProperGander was partially based on the BERT (Devlin et al., 2019) solution for NER. However, probably, the main lesson learned from the first two events was about the necessity of using BERT (or at least a deep-learning neural network) to achieve a state-of-the-art result.
In the current article, we proudly present a no-BERT and even no-deep-learning solution for the task of Span Identification. Our choice not to utilize contextualized vectors such as BERT is grounded by the following reasons. (1) In the annotation to "Poor Man's BERT: Smaller and Faster Transformer Models" (Sajjad et al., 2020), it is noted that BERT-based models require certain laboratory facilities, and This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/. 1 https://www.datasciencesociety.net/hack-news-datathon/ 2 These 18 techniques have grown from the 1930s American Institute of Propaganda Analysis materials (Miller, 1939) and more recent investigations both into the tools of propaganda (Torok, 2015;Teninbaum, 2009, and others), and into the rules of good argument (Weston, 2018). our team is not from a lab. (2) Even in a lab, BERT takes time to train. So, it is also a slow man's choice that deprives one from quickly testing many hypotheses. (3) We were unable to find a neural classifier that would replace a simpler classifier in our best solution before the end of the competition. Even a pre-trained BERT model of T. Wolf et al. (2019) takes about 40 minutes per 1 epoch to train in Google Colab, when our model takes 2-3 minutes to train and predict.
The paper is organized as follows: we first describe our idea of class and feature imbalance in propaganda; we, then, outline the basics of our classifier; we devote a paragraph to testing hypotheses about our data and conclude about the usability of our system.

Propaganda in News Articles
The verbal propaganda in a news article is a stretch of text written with an intent to influence the reader's opinion 3 with arguments that lead to conclusions advantageous to the author. The communicative intent is to control and manipulate the reader. The tools are cognitive, logical, emotional and other means to confuse the reader (verbal fouling), and the verbal representation adds up information to what has to be said as a fact in a news article as a genre: the known facts and the believed-to-be truth 4 . Hence, we expect that propaganda enlarges the size of a textual unit where it appears. This unit is usually a sentence: according to Da San Martino et al. (2019), propaganda takes about half a sentence. And the sentences with propaganda tend to be longer, as reported by creators of the PIG system at the Hack News Datathon 5 .
The attribute of secrecy requires that propaganda looks like its non-propagandist context. So, it should not differ much from the context in its lexical, grammatical or stylistic expression: there should be a difference, but slight. Rashkin et al. (2017) look at four kinds of news articles (reliable, hoax, propagandistic and satiric) and find that they have linguistic features that are more frequent in fake news (the difference is statistically significant): pronouns 'I' and 'you' and their forms, modal, action, manner adverbs, words semantically related to swear, sexual, see, negation 6 , strong and weak subjective etc. 7 At the same time they enlist some features which are not in the fake news, but these features are fewer. As mentioned, propaganda should look like news, but with a bit of additional "foul" content. Hence, the method of semantic vectors (with due regard for other methods) must be useful in propaganda detection to extract semantic features of the class "propaganda", but it will not be as useful in defining the class of "non-propaganda".
To sum up, there are linguistic features that are more common in propaganda than in non-propaganda, and there are nearly no linguistic features of non-propaganda that are uncommon in propaganda (feature imbalance). Further, propaganda is less common than non-propaganda (class imbalance). Finally, propaganda takes about half a sentence; sentences with propaganda tend to be longer; propaganda is emotionally colored.

System Architecture
Detection of propaganda as a fragment of an article (fragment or span identification) presupposes that the minimum textual unit that holds it is a word form. However, if it neighbors with punctuation marks, numbers, etc. the latter can also be attributed to the fragment. Hence, the per-token approach is very common in solutions to tasks organized by the Propaganda Analysis Project: Yoosuf and Yang (2019) (2019) use not only word but also character embeddings, as some morphemes, e.g. "-ist", can be frequent in propaganda.
As mentioned, in this project we decided to focus on a fast solution that would allow us to test many hypotheses about our data. As spans should not contain just words, we chose a token as a minimum unit of a span, regardless whether it is a word form, a punctuation mark, etc. With tokens that are not actual words, it is important to rely on the context of each token -this is what we base our model on. A context window is a chunk of text that is within n tokens to the left and n tokens to the right from the given token; the adjustable parameter of the context window is its size n. If the token has too few tokens to the left or right, the context window simply shrinks. We tokenize texts with SpaCy (Honnibal and Johnson, 2015): the EnCoreWebLg model 8 . The machine learning algorithm that we finally chose is the Logistic Regression (LR) implemented in Scikit-Learn (Pedregosa et al., 2011). The reasons for that were, first, the fastness of LR and, second, it proved to be one of the most efficient classifiers for the data models that we tested at the preliminary stage of research. Also, LR can be considered a shallow neural network and, hence, a baseline to test how neural networks treat the data model.

Data Model, Adjustable Parameters, Evaluation
As mentioned, the unit of classification is a token extracted with SpaCy. The token is considered within a context window of length n tokens to the right and left from the given token. Our data model is grounded by our approach. For each token it combines: 1. An embedding vector of the token: size 200, window = 7, acquired with Word2vec model of Gensim (Řehůřek and Sojka, 2010). This is a safety model in case SpaCy does not know the token. The model is trained on the sentences of all articles in the training, development and test sets. Prepocessing, lemmatization and sentence splitting are done with SpaCy 9 .
2. An embedding vector of the token: size 300, acquired with SpaCy "vector" command.
3. An embedding vector of the context window: size 300, acquired with SpaCy "vector" command. To get it and the two next vectors, we cut out the part of the text that belongs to the context window and only then vectorize it. 4. A sentiment vector of the context window: size 4, acquired with NLTK Vader Sentiment Intensity Analyzer (Hutto and Gilbert, 2014;Bird et al., 2009).
The final length of the vector is 843. At the competition, we normalized the vectors with Scikit-Learn "preprocessing.normalize".
We do not add such a parameter as the length of the sentence in which the token is found (although we mentioned it as a characteristic one) as, when added to the vector, even normalized, it significantly decreases result of the Logistic Regression.
Given the feature-and class-imbalance, we have the following parameters to tune in our data model: 1. Size of context window n.
2. Number of propaganda and non-propaganda samples in the training set. As concerns probability of the same tokens belonging to more than one fragment of propaganda (due to difference in classes), we do not consider it in our model. The heat-map on Figure 1, the left-most square, shows that propaganda fragments do not co-occur so often: most of the field is blue. This co-occurrence seems to depend on how frequent the class is (see class "Loaded Language", for example) with one exception: class "Doubt", although it is not among the most frequent, tends to co-occur with Loaded Language and Name Calling more often than its neighbors. When we enlarge the interval including not only overlaps, but neighboring within n characters (the three other squares), this trend holds.
To test our algorithm before the competition we applied F-score for per-token classification: true positive values are tokens correctly classified as propaganda and false negative are tokens correctly classified as non-propaganda. During the competition, we had difficulty aligning our results with the results we got for the development set in the system offered by the organizers. The organizers' system is per-character F-score that also takes into account the number of fragments (spans). For now, it is hard to say whether our results are hard to align due to per-token and per-character approach or that the organizers' approach punishes for too few or too many fragments. In this article, to evaluate the result we will simply be using percentage of correctly and wrongly classified tokens for the two classes: Propaganda and None.

Testing
To tune our model, we experimented with some of the adjustable parameters mentioned above (each parameter is a hypothesis about our data): the number of features of each class and the size of the context window. The ML classifier is LogisticRegression(solver='liblinear',penalty='l2',C=0.1). Given two sets of token vectors: 10,000 from fragments with propaganda and 10,000 -without propaganda 10 of the competition's official train set, and the vectors of test set, let us first check how the number of vectors of each class influences the mentioned percentages. The radar chart on Figure 2, left, shows that an equal number of vectors of each class (Propaganda and None) in the train set gives an even performance, but when we create disbalance, e.g. put 7,000 vectors of class Propaganda and 10,000 of None in the train set, it can lead to a more favorable treatment of a class. For further testing we will stick with the combination of 7,000 Propaganda and 10,000 None as we chose this combination at the competition. It gives a very good performance for the class None and a fairly good one for Propaganda.
Now, let us test the size of the window. Figure 3 shows that there are two peaks where classification of None is best: 5 and 7. And, although correct classification of Propaganda is lower than usual at them, it is  quite even compared to performance at other context windows. For the competition we have chosen the context window of size 7.
As for inclusion of different vector types in the final vector, the radar chart in Figure 2, right, shows how adding these types (if we go clockwise starting with just Word2vec vector) balances the proportion of False Positive and True Negative cases, and increases the number of correctly classified Propaganda tokens. However, normalizing the final vector now seems to us wrong, as it tangibly decreases the number of correctly classified None vectors.

Conclusion
We have described our system of automatic propaganda detection at SemEval-2020. The system is designed so as to allow us to quickly test many feature parameters. We call it a kitchen solution in the title: coming in small, but handy and utilizable blocks. In Google Colab, our classifier takes 141 seconds (without accelerator) and 128 seconds (with GPU accelerator) to train and calculate the result on the test data.
The idea of feature imbalance has led us to using an unbalanced combination of Propaganda and None token vectors in the data set. Also, we have experimented with the size of a context window in which propaganda becomes most discernible and ended up with size 7, the number which peculiarly correlates with data on human short-term memory: "the 7-item limit" (Coon and Mitterer, 2008). It seems to us that this correlation is suggestive of the idea that, in the end, what we, as readers, are able to call propaganda, is propaganda.