Team DiSaster at SemEval-2020 Task 11: Combining BERT and Hand-crafted Features for Identifying Propaganda Techniques in News

The identification of communication techniques in news articles such as propaganda is important, as such techniques can influence the opinions of large numbers of people. Most work so far focused on the identification at the news article level. Recently, a new dataset and shared task has been proposed for the identification of propaganda techniques at the finer-grained span level. This paper describes our system submission to the subtask of technique classification (TC) for the SemEval 2020 shared task on detection of propaganda techniques in news articles. We propose a method of combining neural BERT representations with hand-crafted features via stacked generalization. Our model has the added advantage that it combines the power of contextual representations from BERT with simple span-based and article-based global features. We present an ablation study which shows that even though BERT representations are very powerful also for this task, BERT still benefits from being combined with carefully designed task-specific features.


Introduction
The purpose of propaganda is to use communication to foster predetermined agendas, or to achieve a response that furthers a desired outcome (Jowett and O'Donnell, 2018).
Prior research has focused on creating machine learning models that label whole news articles or even entire news outlets as propagandistic (Rashkin et al., 2017;. To increase the granularity of these coarse models, a new data set was developed in a study by Da San Martino et al (2019), which enabled models to jointly identify fragments of propaganda within a document, while also classifying their respective propaganda techniques . This paper presents our solution (DiSaster, finishing at 11th place) to the technique classification (TC) sub-task at SemEval 2020 task 11 "Detection of propaganda techniques in news articles". 1 TC is a multi class classification problem in which a system needs to identify the propaganda techniques of a given span of an article. For instance, when given the span "stupid and petty" the system should classify it as Loaded Language. Our system is an ensemble model based on stacked generalization (Wolpert, 1992) which enables the incorporation of both traditional engineered features (Nalini and Sheela, 2014) and the Transformer (Vaswani et al., 2017) based language model BERT (Devlin et al., 2019).

Related work
In addition to formulating the original problem of fine-grained propaganda identification and creating the corpus needed to solve the task, Da San Martino et al. (2019) also designed a multi-granularity neural network. This model outperformed several strong BERT baseline models in the high granularity fragment-level classification by using information from low granularity classification (e.g. document-level) to drive higher-granularity classification (e.g. paragraph-level).
As the TC sub-task of this competition does not require span detection, a multi-granularity approach is not necessary. Instead, our model is inspired by a project by Zhang and Li (2019), in which they outperformed BERT baseline models by combining a BERT model with linguistic features.

Data
The provided training data for this competition contains 371 articles in which all fragments of propaganda are annotated with one of the 18 different propaganda techniques described in Da San Martino et al.
(2019). However, due to a low frequency of some of the techniques, similar underrepresented techniques were merged into a superclass, while one of the techniques was eliminated completely. Thus, the TC task was a 14-class classification problem, where two of the classes were superclasses representing several techniques each (Da San Martino et al., 2020). Table 1 contains a list of all the labels along with their respective IDs that we defined. The class distribution in the training data was very skewed (as the support in Table 1 shows); four of the labels accounted for more than 70% of the training data. As the score for the competition was calculated as the micro-average F1 over all the labels, it was crucial to get good predictions on these four classes. In order to create hand-crafted features that would increase the model's performance for these techniques, we performed a thorough data analysis whose main results are summarized in Table 1. Table 1 shows that there is a considerable spread in the average number of words per span among the different techniques. In particular, the spans from the Repetition category were much shorter than other techniques, and more than 40% of its spans only contained a single word. By examining the instances of Repetition in which the span only contained a single word, we found that the Porter stemmed version of the word (Porter and others, 1980) often occurred several times within the article. 2 This effect is displayed as Avg one word counter in Table 1, which is the average number of times the stem of single word span occurs within an article. However, this value cannot be calculated for classes in which every span contains more than one word. Furthermore, a similar effect for Repetition was discovered when spans with more than one word were examined. The average number of times an entire span with more than one word was repeated within an article was generally much higher for Repetition than any other technique. This effect is shown in Table 1 as Avg span sentence counter. Additionally, we found that if a label was in an article, there was a much higher probability of finding another span with the same label elsewhere in the article (see Figure 2).

Model overview
Our simplest baseline model worked by always predicting highest prior probability. The label with the highest prior probability in the training set was Loaded Language. We compare the results of this baseline model with our final model in Section 5, Table 2.
We tackle the problem of propaganda technique identification as a classification task where we combine  Table 1.
three components using stacked generalization (Wolpert, 1992). Our model is illustrated in Figure 1 and consists of: (1) a contextualized embedding representation of the span using BERT, (2) hand-crafted features extracted from both the span and the global article structure, and (3) the scores of a traditional logistic regression model trained on the hand-crafted features. These components are combined using a feed-forward neural network as the topmost stacking classifier. All the components are described in the next subsections.

BERT fine-tuning
The BERT component of the pipeline consists of BERT-large with a single linear layer on top of the output, similar to the approach described in Devlin et al. (2019). This component was only used on a span-level (i.e. the actual propaganda fragments), in order to get a 14 dimensional vector of logits corresponding to the 14 propaganda technique classes.
To obtain the logits from BERT for all of the training set spans, a 10-fold stratified learning strategy was used: 10 stratified train/test splits were created from the training set. BERT-large was then initialized and fine-tuned, as suggested by Devlin et al. (2019), on each of the 10 training sets and made to predict the logits for the corresponding test sets. This method insured that logits were predicted on the whole training set without predicting on data that it was trained on. The logits for the development and test sets were created by fine-tuning on a stratified 90% sub-set of the training data and stopping early when the loss of the remaining 10% stopped decreasing.
BERT was optimized on the cross-entropy loss using the AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 2 × 10 −5 and an epsilon value of 1 × 10 −8 .

Feature extraction
In addition to the BERT logits, we extracted several additional features from the data. In total we extracted 54 features. 3 We found that the following five improved the performance of the model the most: • If the span is only one word, article one word counter (aowc) is a count of how many times the Porter stem of that word appeared in the article. Otherwise it is 0.
• If the span is more than one word long, article span sentence counter (assc) is a count of how many times that span appeared elsewhere in the article. Otherwise it is 0.
• span word length (swl) is a count of the number of words in the span.
• word count span sent (wcss) is the number times that a span appears within the sentence it is presented in. E.g. the span "fake news" appears twice in the sentence "it is fake news about a fake news story." • word resemble factor (wrf) is the inverse uniqueness of words in a span and is calculated as number of words in span number of unique words in span . Furthermore, a logistic regression was performed over the hand-crafted features alone using a similar stratified learning strategy as for BERT, and the resulting 14-dimensional output was used downstream in our pipeline. We compare and discuss the importance of the features in Section 6.

Feed-forward network
The last component of our model is a fully connected neural network with three hidden layers each consisting of 500 neurons. The network was optimized using the AdamW (Loshchilov and Hutter, 2019) optimizer with a learning rate of 2 · 10 −5 and an epsilon value of 1 · 10 −8 .
As illustrated in Figure 1, the network was fed a concatenation of BERT logits, the hand-crafted features and the output of the logistic regression over the features. The resulting output was a 14 dimensional vector of logits corresponding to the 14 propaganda classes.

Model performance
The most compute-intense step of our model is extracting the BERT task-specific representations for the entire dataset (as outlined in Section 4.1). Fine-tuning BERT and obtaining the representations took roughly 12 hours using the Tesla K80 GPU available on Google Colab. However, once the model is trained, new representations can be obtained in seconds. The extraction of global features is quicker, taking less than 30 minutes for both the training, development and test set on a 2017 MacBook Pro with 3,1 GHz Quad-Core Intel Core i7 processor. A clear advantage of our model is its simplicity. Once the BERT features are extracted, our stacked model can be trained in about five minutes.

Experiments
To test the importance of the different components, a feature ablation study was performed. A 10-fold stratified cross-validation was then performed on the training set and the micro-average F1 score was recorded. The models used in this experiment were implemented in Python using a PyTorch framework (Paszke et al., 2019). The features were created using a mixture of SpaCy and NLTK (Honnibal and Montani, 2017;Bird et al., 2009), whereas for BERT we use the Huggingface library (Wolf et al., 2019).
All results obtained from the ablation study are summarized in Table 2 along with the model's scores from a 10-fold cross validation on the training set. Furthermore, Table 2 also shows the final micro-average F1 score we got on the official competition development and test sets.

Discussion
As evident from the ablation study (Table 2), the most important component of our learning setup was BERT. However, as BERT is used at the span level, it is only able to predict a label based on the tokens in a given span. Due to this local behavior, BERT alone was struggling to correctly predict the Repetition class. This was most likely because the words or phrases that were repeated were not necessarily in the span, but spread throughout the article, which the data exploration in Section 3 also supports. However, this was a problem as Repetition was the third most frequent class in the training set. It was for this reason that we decided to extract and use additional global (article level) features from the data set. The most important extracted feature for Repetition, was the article one word counter. This features directly tell the final neural network if a word has been repeated in the article and removing this global feature shows its importance, as the f1 score for the class Repetition drops from 0.646 to  0.621 (Table 2). This is also supported by our data analysis (Section 3, Table 1) which shows that the Avg one word counter are much higher for Repetition compared to the other labels.
The fact that we obtained better quality predictions by augmenting BERT-predictions with additional information about the text shows that feature engineering is still a relevant discipline as other recent research also suggests (Wu et al., 2018;Zhang and Li, 2019).
The augmented BERT approach worked well on both the training set and the development set, but our score dropped significantly when predicting on the test set (0.628 dev set micro F1 → 0.566 test set micro F1). As we do not have access to the test set labels, a detailed error analysis is difficult for now and left for future work. However, by comparing the F1 score from the official test set with the cross validation scores in Table 2, we do see a particularly large drop in F1 for the Repetition category (from 0.646 cross validation → 0.204 test). This drop in Repetition F1 can also be observed for the other participants in the competition. This may be due to overfitting the model to the training and development sets. It may also be due to the test set having a slightly different distribution than the training and development sets.
Finally, we explored several approaches to exploit the phenomenon of labels co-occurring in articles (Section 3, Figure 2), namely using an RNN and an attention-based model over all the spans in an article. Additionally we tried feeding a neural network the average BERT prediction for all spans in an article in addition to the other features we included. Unfortunately, we were unable to improve the performance using these approaches.

Conclusion
In this paper we have presented the model we used in the SemEval 2020 competition, Task 11: "Detection of Propaganda Techniques in News Articles". The model consists of several components, the most important one being the BERT component. We combined BERT with valuable global and local features extracted from the articles and spans which improved the predictive power of our model, especially in the Repetition category. We ended up with a micro average F1 score of 0.56648 on the official test set, earning us an 11th place (out of 32 teams) overall in the competition.
As visualized in Figure 2, the labels were not distributed uniformly throughout the articles. In particular, if a technique was used in an article, there was a much higher chance than expected of finding it elsewhere in the article. We still believe that the model can be improved by including new features that contain information about the different labels' trends. Furthermore, we would like to exploit the tendency that a label within an article has a higher chance of occurring later in the same article.