NTUAAILS at SemEval-2020 Task 11: Propaganda Detection and Classification with biLSTMs and ELMo

This paper describes the NTUAAILS submission for SemEval 2020 Task 11 Detection of Propaganda Techniques in News Articles. This task comprises of two different sub-tasks, namely A: Span Identification (SI), B: Technique Classification (TC). The goal for the SI sub-task is to identify specific fragments, in a given plain text, containing at least one propaganda technique. The TC sub-task aims to identify the applied propaganda technique in a given text fragment. A different model was trained for each sub-task. Our best performing system for the SI task consists of pre-trained ELMo word embeddings followed by residual bidirectional LSTM network. For the TC sub-task pre-trained word embeddings from GloVe fed to a bidirectional LSTM neural network. The models achieved rank 28 among 36 teams with F1 score of 0.335 and rank 25 among 31 teams with 0.463 F1 score for SI and TC sub-tasks respectively. Our results indicate that the proposed deep learning models, although relatively simple in architecture and fast to train, achieve satisfactory results in the tasks on hand.


Introduction
Propaganda is the expression of an opinion or an action by individuals or groups deliberately designed to influence the opinions or the actions of other individuals or groups with reference to predetermined ends (Miller, C. R, 1937(Miller, C. R, 1938. It is often combined with misinformation and fake news and all together have the potential to polarise public opinion, to promote violent extremism and hate speech. To deal with this problem, automatic identification and categorisation of propaganda, fake news and hyperpartisan content has been heavily addressed in recent research, mainly in the article level (Da San Martino et al., 2019a;Rashkin et al., 2017;Kiesel et al., 2019). This task is a follow-up to the NLP4IF shared task on fine-grained propaganda detection (Da San Martino et al., 2019c) and aims to produce models capable of spotting propaganda techniques in text fragments and then categorise them in one or more of 14 propaganda techniques. The first sub-task is called Span identification (SI) and its goal is to spot propaganda fragments given a plain text document. The evaluation metric for this task is a modified F1 measure appropriate for taking into account partial matching between the spans. Detailed information about this metric can be found in (Da San Martino et al., 2019b;Da San Martino et al., 2020). The second sub-task which is called Technique classification (TC) consists of text fragments as inputs, labelled as one or more of 14 propaganda classes such as Name Calling and Loaded Language. The goal is to classify the fragments to one or more of those techniques. Due to the fact that the distribution of the techniques in gold labels is rather imbalanced, results were evaluated with the micro-averaged F1 measure.
In the current paper, we propose two novel deep learning architectures to deal with the two subtasks of the competition. In addition, we compare our deep learning methods with classic machine learning methods such as Logistic Regression provided with word embeddings pre-trained on large external corpora. Transfer Learning architectures showed promising results in various natural language processing (NLP) tasks (Jiang et al., 2019;Baris et al., 2019;Oberstrass et al., 2019), so there was a strong intuition to try this kind of architectures in this demanding task as well.
The rest of the paper is organised as follows. Section 2 describes the previous related research work in the field of propaganda and fake news detection. Section 3 describes the data and the system overview. Section 4 analyses the results and the errors of the methods proposed. Finally, Section 5 draws conclusions and suggests future work and improvements in the existing one.

Related work
A very similar task was shared previously in 2019 called Fine-Grained Propaganda Detection. This was a part of natural language processing for internet freedom (NLP4IF'19) workshop and included to sub-tasks, Sentence Level Classification (SLC) and Fragment Level Classification (FLC). The dataset used is a subset for this task's dataset and FLC was basically a combination of SI and TC in terms of including both detection and classification of text fragments. This was the first time propaganda and fake news detection was addressed in a fragment level. Previous work in propaganda and hyperpartisan news have been tackled mostly in the article level (Kiesel et al., 2019). In Hyperpartisan news detection task at SemEval 2019 two datasets were used. One relatively small labelled manually and one large corpus suitable for deep learning methods that was labelled using distant supervision, assuming that all articles from a given news outlet share the label of that outlet. (Rashkin et al., 2017) created a corpus of news articles labelled as propaganda, trusted, hoax or satire with distant supervision as well. Apparently, this method introduces noise, and in the hyperpartisan news detection task a lot of participating teams claimed that using the large corpus resulted in worse performance than not using it at all.

Data
The input data for the SI consists of news articles in plain-text format. Specifically, the data for this subtask includes 371, 75 and 90 articles for training, development and test partitions respectively. More details about the data collection, the annotation and statistics about the corpus can be found in (Da San Martino et al., 2019d). TC input data consists of text fragments identified as propaganda within their document context. Table 1 illustrates the total number of instances per technique in the train set, the percentage with respect to the total number and the evaluation results achieved by our best performing model in development and test set.

Preprocessing
Given that the dataset has been partially preprocessed (title and sentences splitting has been performed automatically with NLTK sentence splitter by the organisers) we only did minimum additional preprocessing by removing most punctuation marks which do not include any useful information for text classification. Our solution for the SI task performs token-level classification while the data labels are at the character level. Due to the conversion from character level labels to word level labels (model training), as well as the reverse process (for prediction), a small information loss incurs, that affects the performance of the models used in this sub-task. In addition, our approach for the TC task does not consider overlapping labels, that occur when the token belongs to multiple propaganda techniques simultaneously. These suggest that there is much space for future improvement.

Span Identification
During the competition three model architectures have been explored for this sequence labelling subtask. The first is a very simple approach based on classic machine learning algorithms such as Logistic Regression and pre-trained word vectors from Word2Vec (Mikolov et al., 2013). This model was used as a baseline in order to compare it with more state-of-the art deep learning methods. The biggest drawback of this approach is that a lot of words of the corpus are lost because only pre-trained vectors are used and about 10,000 words do not match with any of the pre-trained embeddings. The main reason for this loss of words is that no fine tuning on these vectors is done on our data. In addition, the sequential (Lipton et al., 2015) nature of the textual data suggests the utilisation of Recurrent Neural Networks. Taken that into consideration, the second model utilises a bidirectional LSTM network architecture. Pre-trained word vectors from GloVe (Pennington et al., 2014) were used for encoding words to a vector space. Results were slightly better when using 300-dimensional GloVe vectors (in comparison with experiments done using different dimensions of GloVe model). These vectors were fine tuned on our corpus through the embedding layer. LSTM model outperforms the baseline significantly even without any fine tuning of its hyperparameters. The final model proposed in this paper consists of an embedding layer, two bidirectional layers, a residual connection to the first BiLSTM layer and a final fully connected layer followed by a softmax activation function. Figure 1 illustrates the model's architecture. This model is supplied with contextualized word representations generated from the pre-trained ELMo model (Peters et al., 2018). These embeddings are a function not only of the word itself but also of its context, enabling word disambiguation into different semantic representations. The input sequences of BiLSTM layers are the sentences of the corpus. We set the size of the sentences to a maximum of 80 words, as a compromise between the representation's expressiveness and its computational cost (losing only few longer samples). Shorter sentences are padded. Architecture. As Figure 1 illustrates, our model is based in Bidirectional LSTM architecture. In particular, two BILSTMs layers are used in order to make predictions that take past and future information into account, since the context covers past and future labels in a sequence. A bidirectional LSTM is a combination of two LSTMs, ones runs forward and one backwards. In addition, the residual connection allows the network to skip training of the layers that are not useful and do not add value in overall accuracy. The ELMo (Embedding from Language Models) is the key element of our approach. One of biggest benefits is that there is no need of feature engineering but only the sequences and the token-level labels are needed. There are three word representations in the ELMo model, contextual: each word depends on the entire context, deep: the word representations combine all layers of a deep pre-trained neural network, character-based: ELMo are purely character based allowing the network to use morphological clues to form representations for tokens not seen in training.
Implementation. 10% dropout is used on all hidden layers. In addition, the BiLSTM layers use 512 units and 10% recurrent dropout. The ELMo contextualized embeddings are used with default number of features (1024) for every token. Adam optimizer is selected with default parameters and

Technique Classification
The algorithm for this multiclass classification task is based on components used in SI task as well. Our first approach consists of a Logistic Regression classifier and Word2Vec pre-trained embeddings. This model surpassed the given baseline by 15 per cent. In addition, we implemented one of the most widely used artificial neural networks, the Multi-layer Perceptron classifier from scikit-learn which is suitable for a multiclass classification problem. The features of the MLP classifier model are the pre-trained vectors of Wrd2Vec model. For the word representation, the best model used embeddings from GloVe project of 300 dimensions which are fine tuned on the input data through the embedding layer. Each sentence is padded to the maximum length sentence since the data size is very small. Each token (word) was replaced with its vector. The resulting sequence is then fed to the bidirectional LSTM layer. Architecture. Figure 2 illustrates the architecture of our best model for the Technique classification task. The input sequences are fed into the embedding layer which is initialised with the weights of the pre-trained GloVe model and then trained with respect to our corpus. Due to the small size of the data there is only one bidirectional layer used followed by a Dense layer where the classification is done through the softmax function.
Implementation. Our implementation is based on Keras framework with Tensorflow back-end. To accelerate training, model is trained in accuracy metric function and using the following hyper-parameters: batch size 32, units 128, recurrent dropout 0.1, dropout 0.1, sequence length 799, and early stopping . The best model is trained for 4 epochs. Adam optimiser is used with a learning rate 0.003. During the competition participants had access only to the training set labels for both of the sub-tasks.

SI Task
Multiple experiments have been made in order to maximize the performance of our models for this task. Table 2 illustrates the results of the three main algorithms over the development set. As already mentioned, deep learning models have shown dramatic increase in precision metric and a significant improvement on the F1 score. The first algorithm which consists of a logistic regression classifier and Word2Vec achieved a Precision of only 0.09 and a relatively high recall of 0.47. The lack of a high precision mainly caused by the fact that the model classifies one word at a time and no context information is taken into account in the classification. The LSTM algorithm in the other hand solves this problem even without any fine tuning of its hyperparameters. This increase in precision of at least +0.10, is followed by a simultaneous decrease of the recall metric from 0.47 to 0.37. Despite this inverse proportional relation the F1 average metric is significantly increased using the LSTM algorithm. After fine tuning the LSTM algorithm, experimented with different word vectors and used different techniques to avoid overfitting (such as early stopping and a small number of LSTM layers (2)), the LSTM achieves performance of 0.30, 0.25, 0.37 for F1, precision and recall respectively. Finally, the last model which includes ELMo contextualised embeddings and LSTM has the best F1 score of 0.31213 on the development set, so we used this model for our final submission on the test set. There is a fluctuation on the results of almost all the participating teams' submissions between development (best) and test set in terms of precision and recall. Our team achieved 0.33596, 0.46052 and 0.26444 in F1, precision and recall respectively, and placed 28 th in SI task. Table 4 illustrates the results in comparison with the first team.

TC Task
In this task the goal is to specify the propaganda type of a given fragment. Initially, 18 propaganda techniques were given but due to the fact that some classes had very few samples, they were reduced to 14. Even though, the data still remained very imbalanced between the classes and the dataset very small. That was the main reason that only one layer of BiLSTM used in our final model. In addition, it was observed that almost all the teams had decreased results in the test set submission results in comparison with the ones in development set. That lack of robustness is very likely to arise from the small size and imbalance of the data. In Table 1 we demonstrate the performance of our final algorithm for every class and the number of samples of each of these classes. The results of our models in the development set are illustrated in Table 3. Our team achieved 0.46 micro-averaged F1 score in the final test set submission and placed 25 th among 31 teams. The conclusion is that the classes with more samples are easier to predict.

Conclusion and Future Work
In this paper, we presented methods which combined transfer learning and recurrent neural networks capable of detecting and classifying propaganda fragments in news articles. In the competition various techniques and architectures were explored and the results show that deep learning methods improve performance as was initially expected. For future work, we plan to explore different methods such as BERT based sequence tagger and to experiment with techniques that will help to tackle the class imbalance problem, mainly in the TC task.