NLFIIT at SemEval-2020 Task 11: Neural Network Architectures for Detection of Propaganda Techniques in News Articles

Since propaganda became more common technique in news, it is very important to look for possibilities of its automatic detection. In this paper, we present neural model architecture submitted to the SemEval-2020 Task 11 competition: “Detection of Propaganda Techniques in News Articles”. We participated in both subtasks, propaganda span identification and propaganda technique classification. Our model utilizes recurrent Bi-LSTM layers with pre-trained word representations and also takes advantage of self-attention mechanism. Our model managed to achieve score 0.405 F1 for subtask 1 and 0.553 F1 for subtask 2 on test set resulting in 17th and 16th place in subtask 1 and subtask 2, respectively.


Introduction
Nowadays, many publishers provide articles online to reach their audience faster. This shift from traditional journalism via printed press enabled more extensive misuse of objective news for own agenda using different propaganda techniques. Propagandistic articles try to convey their message using different techniques. Propaganda is the spread of purposefully shaped information in media to promote some sort of agenda . This kind of information is produced by increasing number of biased news outlets in attempt to sway or manipulate public opinion. However, there is also a connected problem to propaganda, online spreading of fake news and misinformation in general .
However, nowadays the widespread use of internet and access to multiple sources of information has caused traditional techniques of propaganda, like spreading of disinformation, to become far less effective. Along with this trend, propaganda techniques are naturally getting more complex to maximize their impact. Use of logical fallacies and appealing to the emotions of the audience is getting much more prominent because these techniques are significantly harder to spot at first glance and require much deeper analysis to get revealed.
Research on detecting propaganda has focused primarily on news articles (Rashkin et al., 2017;Barrón-Cedeno et al., 2019). However, analysis of articles produced by propagandistic news outlets indicates that these sources can often publish objective articles in attempt to increase their credibility (Horne et al., 2018). Recently the attention has moved towards fragment and sentence level propaganda detection. A new problem has been introduced -to detect the use of specific propaganda techniques in articles along with large corpora to assess this problem . To tackle this problem many different neural model architectures were proposed, such as fine-tuned BERT pre-trained model (Yoosuf and Yang, 2019), ensemble of BERT, Bi-LSTM, and XGBoost (Al-Omari et al., 2019), or ensemble of logistic regression, CNN and BERT (Gupta et al., 2019).
The main goal of this SemEval task was to develop tools for automatic detection of propaganda and consisted of two subtasks (Da San Martino et al., 2020): • Subtask 2 -Technique Classification was focused on identifying propaganda technique in the given propagandistic text fragment. Data has been annotated with 18 techniques, but some of the classes were merged together due to their low frequency, which resulted in 14 classification classes for this subtask.
In this paper, we present our proposed neural network architecture consisting of different types of neural layers. We experimented with different word representations, such as ELMo, GloVe, and different encoder layers, such as GRU, LSTM, and also with transformer model BERT. We report description and results for multiple model architectures for both subtasks in the following sections.

Model
We experimented with multiple model setups for each subtask. We tried out multiple pre-trained embeddings, which fed Bi-LSTM layers with word representations (Pecar et al., 2019). For subtask 2, we also experimented with a fairly big pre-trained transformer model. The general architecture portraiting a structure of our proposed model is shown in Figure 1. Preprocessing Preprocessing can be considered as an important step in each natural language processing task. The dataset for this task consist of news articles thus the input text is not as much noisy as the texts from social media but still contains significant number of tweet footers, timestamps, web surveys, hyperlinks, and advertisements. In attempt to reduce number of input tokens that are very likely not propaganda, we decided to partly or completely remove these samples. In addition, we also removed all non-Latin characters, such as Hebrew, emoji characters, and also twitter mentions. We also substituted unicode quotation marks, apostrophes and hyphens with their ASCII equivalents (Pecar et al., 2018).
Word Representation To be able to feed input samples into models, we need to use appropriate word representations. In our experiments, we used different word representations to represent those samples. First, we used deep contextualized word representation also known as ELMo (Peters et al., 2018) along with its available pre-trained model. We also experimented with an unsupervised learning algorithm for obtaining vector representations for words -GloVe (Pennington et al., 2014). In subtask 2, we also experimented with transformer model known as BERT (Devlin et al., 2019) and its pre-trained model (Wolf et al., 2019).
Encoder Layer Word representation produced by pre-trained embeddings are fed to encoder layer. After initial experiments with both, unidirectional and bidirectional GRU (Cho et al., 2014) and LSTM layers (Hochreiter and Schmidhuber, 1997), we decided to continue with only bidirectional LSTM (Schuster and Paliwal, 1997) in our experiments. We also tried to tweak the size of a hidden state (sizes of 128, 256, 512 and 1024 were used), but it resulted in negligible performance change. For both subtasks, we also experimented with different number of stacked encoder layers (2 and 3 stacked layers were considered), but the most stable results was produced by single encoder layer. To enhance performance of LSTM layers for long input samples we also experimented with the self-attention mechanism on top of the encoder layer.
Decoder Layer We used standard linear layer to decode output representation of both setups of recurrent layers without and with self-attention mechanism for both tasks. We also experimented with conditional random fields (CRF) for subtask 1. The results especially in combination with GloVe were highly unstable regarding to the run and we decided not to continue in further experiments with CRF.
Loss Function As a loss function, we used standard cross-entropy loss. In addition, we also experimented with weight setup for classes in both subtasks, but unfortunately this yielded slightly worse results especially in subtask 1, due to massive class imbalance of propaganda tokens versus non-propaganda tokens.
Model Ensemble In order to further improve the performance, we decided to use model ensemble. Models were trained for up to 6 epochs and we summed predictions of multiple epochs (all epochs after 2 initial ones) and picked the class with maximum value. This method balanced precision and recall metrics in subtask 1, however it did not seem to have an impact on performance for subtask 2, thus we did not further investigate this option for subtask 2.
Regularization In attempt to prevent overfitting on the train dataset, we used dropout regularization which we applied to embedding and encoder layers. We also experimented with different values of learning rate which proved beneficial. Lowering learning rate value improved the performance of models using ELMo word representations (specific values are reported in Experimental Settings section).

Evaluation
In this section, we briefly provide information about the dataset, experimental settings used for our models and performance results of the proposed models.

Dataset
The dataset for this task consisted of around 550 news articles containing propaganda fragments that have been annotated with one of 18 propaganda techniques. As we mentioned, the dataset was noisy making the task more difficult. Table 1 shows basic information about the dataset, for more information see the task description paper (Da San Martino et al., 2020 Original dataset annotation contained only identified span (offset of starting and ending character of fragment) for propaganda techniques. In order to be able to feed inputs to the network we can code each character whether it is part of propaganda or not. In table 2, we depict an example of the annotated string from the dataset. Characters are considered a propaganda and they are labeled 1 when their position is inside one of propaganda spans. Characters not belonging in any span are labeled as 0. A character position is determined by its offset from the beginning of an article.
Character h o w s t u p i d a n d p e t t y t h i n g s Propaganda 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0   However, in order to be able to use pre-trained word representations trained on large-scale datasets and to transfer knowledge about input words, we decided to transform the annotations from character to word level (see Table 3). For every word, we saved offset of its beginning character (included) and offset of its ending character (not included) and replaced the annotation on word level. A word would be then labeled as propaganda or not propaganda based on whether its span intersects one of the propaganda spans or not. After final prediction for words being part of propaganda, character offsets were reconstructed.

Experimental Settings
For our proposed models, we experimented with many hyperparameters which could be tweaked in process of development. For both subtasks, we trained model for 6 epochs, and took the best performing model measured by F1 score on the dev set. For encoder single bidirectional LSTM layer with hidden size of 128 and 512 was considered for subtask 1 and 2, respectively. We also experimented with sizes of 128, 256, 512 and 1024 along with multiple stacked layers (2 and 3) for both subtasks. As optimizer, we used Adam (Kingma and Ba, 2015) for both tasks, the learning rate for the first subtask is set to 0.0005 and for subtask 2 as 10 −3 , while experiments also with default values of 10 −4 were considered. Batch size for subtask 1 was set to 64 and for subtask 2 to 32.
Dropout was used as a regularization technique for both tasks. We used different dropout values in different model architectures. After ELMo layer default value 0.5 was used, after GloVe layer only 0.2 dropout value was considered. Value 0.2 was also applied to the output of LSTM layers.

Results
For both subtasks, we experimented with several different architectures. Table 4 shows basic information about the architecture and performance comparison of models for subtask 1. Evaluation metrics for this subtask are computed by the function that gives credit to partial matching between two spans. For detailed description see task description paper (Da San Martino et al., 2020). We experimented with both GloVe and ELMo word representation, the latter proved to perform significantly better. We also tried to make use of self-attention mechanism on top of Bi-LSTM layers which performed significantly worse for substask 1. We also experimented with model using conditional random fields in combination with ELMo and Bi-LSTM.   Table 5 shows that ELMo performed better than GloVe also in subtask 2. Due to class imbalance, the evaluation measure for this subtask is micro-averaged F1 measure. However, self-attention mechanism on top of Bi-LSTM, in contrary to subtask 1, improved performance significantly in subtask 2. We also experimented with BERT pre-trained model, but due to high memory requirements we could not fully utilize the biggest pre-trained model that could potentially outperform our proposed model for subtask 2.

Technique
ELMo

Submitted Models
In Table 6, we provide information about our submitted models and their performance on the test set for both subtasks. For both subtasks, we chose ELMo word representations as embedding layer followed by single Bi-LSTM encoder layer. As we mentioned, LSTM hidden size seemed to not have an impact on performance and we settled with 128 units for subtask 1 and 512 units for subtask 2. On top of Bi-LSTM, we also used self-attention layer for subtask 2. For both subtasks we chose standard linear layer as decoder.

Further Results
In addition, we provide results on selected models for several model modifications, such as change class weight for loss function, and insight on model ensemble performance. In Tables 7 and 8, we provide performance of the best model for each task with different weight classes in loss function. For subtask 1, ELMo-BiLSTM model was considered while for subtask 2 ELMo-BiLSTM-Att model was used. As we can observe, higher weight of minority class has slightly improved the performance on recall measure but several points worsen on precision which resulted in lower performance in F1 score. For subtask 2, using distribution of the train set for class weights showed the best performance .

Conclusions
We proposed a neural model architecture for automatic detection of propaganda techniques in news articles. For both subtasks of the SemEval-2020 Task 11, we used ELMo embeddings, which fed word representations to Bi-LSTM encoder. For subtask 2, we also utilized the self-attention mechanism on top of the Bi-LSTM layer. We experimented with multiple stacked Bi-LSTM layers, however single layer proved to perform the best for both subtasks. Our experiments with embeddings showed significantly better performance of ELMo over GloVe for both subtasks. For subtask 1, we tried to employ CRF on top of Bi-LSTM with its perfomance coming closest to our proposed model. For subtask 2, we experimented with BERT pre-trained model that showed potential to outperform our proposed model, but due to high memory requirements it could not be fully utilized. We tried to improve performance of our proposed models by using model ensemble, but unfortunately our strategy seems not to have a significant impact on performance. Code for this submission is available via GitHub repository 1 . To obtain better results for subtask 1 we would probably try to fine-tune Bi-LSTM model with CRF and further experiment with ensemble of this model and BERT. As we mentioned, the biggest pre-trained BERT model utilized to its full potential could outperform our model for subtask 2. Another interesting possibility would be ensemble of BERT and Bi-LSTM with the attention.