PsuedoProp at SemEval-2020 Task 11: Propaganda Span Detection Using BERT-CRF and Ensemble Sentence Level Classifier

This paper explains our teams’ submission to the Shared Task of Fine-Grained Propaganda Detection in which we propose a sequential BERT-CRF based Span Identification model where the fine-grained detection is carried out only on the articles that are flagged as containing propaganda by an ensemble SLC model. We propose this setup bearing in mind the practicality of this approach in identifying propaganda spans in the exponentially increasing content base where the fine-tuned analysis of the entire data repository may not be the optimal choice due to its massive computational resource requirements. We present our analysis on different voting ensembles for the SLC model. Our system ranks 14th on the test set and 22nd on the development set and with an F1 score of 0.41 and 0.39 respectively.


Introduction
In contemporary times, fake news and propaganda have gained a lot of traction. A contributing factor to these problems is the easy dissemination of information on social media and various alternative news outlets on the Internet which house a vast repository of content which is tough to effectively moderate. Propaganda is often used to promulgate news articles or content that is misleading. In conjunction, Fake News not only contrives hysteria and spreads lies, in extreme cases it leads to physical violence (Kang and Goldman, 2016). Most of the workaround propaganda detection has been limited to document-level classification (Shu et al., 2017;Barrón-Cedeno et al., 2019;Rashkin et al., 2017). In the past, Shared tasks such as the NLP4IF 2019 have dealt with Sentence Level Classification (SLC) and Fragment Level Classification (FLC) of propaganda . Fine-grained propaganda techniques provide a more suitable method of detecting propaganda because its classification provides the reasoning behind why an instance has been flagged as propaganda. The SemEval shared Task 11 makes progress in this aspect with its two tasks namely, Span Identification (SI) that has the objective of finding propaganda spans, and Technique Classification (TC) that labels the propaganda technique employed in a propaganda span (Da San Martino et al., 2020).
In this paper, we have focused on the Span Identification task that involves character level tagging of propaganda spans in text. To achieve this, we used ensemble transformer-based architectures to first perform SLC which is followed by token level tagging of spans of only propaganda sentences. These are fed to the BERT-CRF span identification model, predictions of which are later processed to obtain the character level tagging of propaganda fragments. In addition to this, we carry out various experiments to deal with the class imbalance in the provided data corpus and obtain a generic model that can be employed to detect propaganda fragments in any text.
The remaining paper is organised as follows. Section 2 gives a background on existing work on propaganda detection, the task we worked on and the novelty of our approach. Section 3 provides the rationale behind the system setup and the system's working. Following this, Section 4 describes the This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/. experimental setup to expatiate the basis on which we conducted our experiments. The next section, Section 5, provides the results of all our enlisted experiments and our final system along with the final model configurations that can be used to replicate our results. Section 6 finally concludes the paper and presents our error analysis.

Related Work
The Span Identification Task aims at providing a more fine-grained analysis of propaganda in text. Owing to the massive importance of regulating the quality of content being circulated among the populace, researchers have explored several techniques for achieving sequence labelling and sentence level classification, both of which provide an important background for the SI task.
The task organizers provide a corpus of 550 news articles that are labelled against character level offsets of the propaganda spans within the articles. It is a binary sequence tagging task. The annotation is done manually. The example below explains it clearly where character offsets from 19 to 40 contains a propaganda span "nefarious connections". For Sentence Level Classification, Rashkin (2017) used an LSTM model and presented a comparison of its performance with Naive Bayes and Maximum Entropy models. Da San Martino (2019) used multi-granularity BERT for fine-grained propaganda detection. Work by Graves (2005) demonstrated the use of LSTMs in sequence tagging. To further this work and leverage the learning from both the future and past inputs in a sequence, Graves (2013) discussed the use of Bi-LSTMs. Recently, Huang (2015) proposed BiLSTM-CRFs which promise bidirectional comprehension whilst making use of the sentence level tag information mapped by the CRF layer. CRF's efficacy was further demonstrated by Lample (2016) who reported higher F1 scores in NER with four different languages without leveraging any knowledge specific to those languages.

System Overview
We adopt a two-step method of detecting propaganda spans by first performing SLC and then detecting spans in sentences which have been flagged as propaganda sentences. Comparing the results of span identification: with and without SLC -we observe an F1 Score improvement of nearly 0.13 using the former method.

Classification of Sentences
One of the most recent strides in NLP has been that of transfer learning. Transformers like BERT, RoBERTa, XLNet and AlBERT are trained on a large corpus of data and these language models can be fine-tuned on different downstream tasks (Devlin et al., 2019;Liu et al., 2019;Yang et al., 2019;Lan et al., 2019) to achieve advanced results in the field. One advantage of these large language models is that they tend to generalize well on smaller datasets like ours, this was one of the deciding factors for us to choose these language models. Upon experimentation, we came to the conclusion that an ensemble model having XLNet and RoBERTa as the base models -performed better for the classification of sentences when compared to AlBERT, BERT, and several other ensemble permutations of these models. We primarily credit RoBERTa's advanced performance to the fact that Liu (2019) pretrained RoBERTa on a larger corpus of data that includes the CC-News dataset which has public news articles which is similar to the data provided in our task. Secondly, RoBERTa is trained with larger mini-batches and learning rates. As far as XLNet is concerned, Yang (2019) uses a novel permutation language modelling objective that helps the autoregressive (AR) model to capture bidirectional context which may be otherwise lost in AR models. Besides, XLNet and RoBERTa's superior performance to BERT on several downstream tasks was an indication that it may perform better on the task given to us.

Ensemble of Transformers
Different base models were used to analyze which ensemble configuration worked best. While we explored conventional ensemble criteria for the SLC model results of which can be reviewed in the results section, we also considered two other criterion including an OR Based Ensemble method in which if either of the base models predicted an instance as being propagandistic in nature the sentence would be flagged as a propaganda element. On similar lines, AND Based Ensemble was considered in which only if both of the base models predicted the sample to be a propaganda sentence would the final sentence be deemed as a propaganda instance.

Span Identification
Span identification is a binary sequence labelling task, which we tackle using BERT-CRF. In the context of our task, Transfer Learning proves to be a powerful and efficient approach because of the lack of training data. Fine-tuning is used as a transfer learning method in this task. BERT is used as the encoder to do fine-tuning and a CRF layer is used to decode and get sequence predictions. The architecture can be observed in Figure 2, where the BERT language model is connected to a fully connected layer that is finally connected to the CRF layer.
We use linear-chain CRF as a decoder. Every character of the input sequence x is converted into a vector w. The posterior probability of y given x is: Z(x) is the normalization factor for x, h k (y k ; x) is the output of the previous layer of Softmax and gives the probability of y k at k position and n is the sequence length. The transition score matrix A can be learned by the model or set manually, we let the model learn the parameter itself. The probability from tag y k to y k+1 is given by A y k ,y k+1 . The most probable tag sequence of x is represented byŷ (Sutton et al., 2012).ŷ = argmaxP (y|x)

Experimental Setup
This section describes our train-test setup which includes our analysis of the dataset, the data processing steps and the various models that we use to achieve our results.

Data
Preliminary data analysis on the provided corpus concluded a class imbalance between the propaganda and non-propaganda samples with only 3211 sentences containing propaganda spans out of the 15275 training samples that we had. To generate balanced classes -we explored two techniques : a Minority Class oversampling: The sentences that contained propaganda spans were a clear minority in the provided dataset. Hence, We oversampled this class and concluded that the resulting oversampled train corpus produced a higher F1 score. Results from our experiments with the use of both, oversampled and non-oversampled can be found in Table 3.1 in the results sections.
b Paraphrasing: Wei et al. (2019) propose data augmentation techniques such as synonym replacement, random insertion, random swap, and random deletion which were explored to compensate for the lack of propaganda samples 1 .

Models
For sentence classification we experiment with BERT, RoBERTa, XLNet and AlBERT transformer architectures and our experiments conclude that RoBERTa and XLNet be chosen for the base models for the final ensemble. All sentences in the training corpus are fed to RoBERTa and XLNet for training the SLC model in the proposed pipeline. While the ensemble model is trained on the entire training corpus -The SI model is only trained on the set of sentences flagged as containing a propaganda span. In the prediction cycle, SLC is carried out on the entire test set -Following which, we feed only the sentences which have been flagged as having a propaganda element by the SLC model to the BERT-CRF model.

Results
In this section, we present the results of all the experiments discussed in Section 4 which justify our choice of the proposed model and its configurations. The metrics used in the results are the same metrics used to evaluate the task results for the leaderboard 2 . In Table 1 we discuss results which inspired our choice of using a sequential approach with an SLC model. The training configurations for these results include a learning rate of 1.00e-05, batch size of 16, number of epochs as 10 and RoBERTa was used as the SLC Model. All the results presented here are evaluated on the dev set that was provided by the organizers. As observed, The SI model's performance improved by nearly 0.15 when used with the SLC Model. A notable difference was also noticed with the use of oversampling, after which we sought to explore some data augmentation techniques to create a more balanced training corpus. The most relevant results are summarised in Table 2.
As observed in Table 2 -The model's performance was much better with oversampling in comparison to its performance with paraphrasing. Since Wei (2019) had already discussed that EDA may not be as effective for use with pre-trained models -further experiments were not conducted for the same.  As observed, the XLNet-RoBERTa ensemble produced the best F1 score for the SI Task and hence it was employed in our final model pipeline. We also attempted to use the BERT -Large for the task but achieved an F1 of 0.35. Additionally, We analyzed several hyperparameters including sets of higher learning rates such as that of 1e-4 (produced F1 -0.38), more number of training epochs such as 20 epochs producing an F1 score of 0.372 and smaller batch sizes such as 8 which resulted in an F1 score of 0.377. Having analysed all these results -our proposed model's configurations were decided as shown in Table 4.

Model Used
Task Training Configuration BERT-CRF (BERT-Base Uncased-uncased L −12 H −768 A −12)  This paper explains our teams' submission to the Shared Task of Fine-Grained Propaganda Detection in which we propose a sequential BERT-CRF based Span Identification model where the fine-grained detection is carried out only on articles that are flagged as containing propaganda by an ensemble SLC model. We propose this setup bearing in mind the practicality of this approach in identifying propaganda spans in the exponentially increasing content base where the fine-tuned analysis of the entire data repository may not be the optimal choice due to its massive computational resource requirement.
In the future, we intend to explore more advanced and efficient transformer models including T5 and Reformer respectively.

Error Analysis
We identify two possible scopes of errors which arise from assumptions that we make during data processing. Firstly, we suspect programming fallacies in the data post-processing steps where the BERT token-based predictions are mapped to their original token form ( where each token is a word from the sentence and against which we have the character offsets ) to get the original character offsets against the predicted tokens. Errant punctuation processing may have produced errors in the computed character offsets. Further, some assumptions are made while doing this post-processing such as assuming an 'X' token to be a 'p' token if it is succeeded and preceded by a 'p' token respectively, which may not necessarily be the case.