LTIatCMU at SemEval-2020 Task 11: Incorporating Multi-Level Features for Multi-Granular Propaganda Span Identification

In this paper we describe our submission for the task of Propaganda Span Identification in news articles. We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda. The ”multi-granular” model incorporates linguistic knowledge at various levels of text granularity, including word, sentence and document level syntactic, semantic and pragmatic affect features, which significantly improve model performance, compared to its language-agnostic variant. To facilitate better representation learning, we also collect a corpus of 10k news articles, and use it for fine-tuning the model. The final model is a majority-voting ensemble which learns different propaganda class boundaries by leveraging different subsets of incorporated knowledge.


Introduction
Propaganda (Bernays, 1928) is "the deliberate and systematic attempt to shape perceptions, manipulate cognition, and direct behavior to achieve a response that furthers the desired intent of the propagandist" (Jowett and O'donnell, 2018). The rise of digital and social media has enabled propaganda campaigns to reach vast audiences (Glowacki et al., 2018;Tardáguila et al., 2018), influencing or undermining the democratic processes, amplifying the effects of misinformation, and creating echo-chambers.
As a tactic of manipulation, propaganda is considered most effective when it is veiled: the readers are not able to identify it, but their opinions are shaped according to the propagandist's hidden agenda ; it is thus very difficult to identify propaganda automatically. However, automatic identification and analysis of propaganda in news articles and on social media are essential to understand propaganda at scale and develop approaches to countering it (King et al., 2017;Starbird, 2018;Field et al., 2018).
Prior research on propaganda detection focused primarily on identifying propaganda at a document level, due to the dearth of finer-grained labelled data (Rashkin et al., 2017;. This has resulted in classifying all news articles from a particular source as propaganda (or hoax, disinformation, etc.), which is often not the case, and which can obfuscate a finer-grained analysis of propagandistic strategies Horne et al., 2018).
Recently, Da San Martino et al. (2019) carried out a seminal task of fine-grained propaganda detection and curated a dataset consisting of about 550 news articles. The dataset contains word-span level annotations provided by high-quality professionals along with additional information about the propaganda technique employed in the span. The Span Identification (SI) sub-task of SemEval 2020, Task 11, (Da San Martino et al., 2020) employs this aforementioned corpus and requires participants to detect propaganda spans in news articles.
In this paper, we describe our solution to the SI task. We propose a BERT-BiLSTM based multigranularity model, inspired from . The model's parameters are jointly Figure 1: Overview of our Multi-granular BERT-BiLSTM model. The sentence is passed into BERT and the token representations are concatenated with token features before being passed to a BiLSTM layer. The pooled representations are concatenated with sentence and document level features before getting the logits of the sentence containing propaganda. This gating value is then multiplied with the BiLSTM outputs of each token before being used to predict token labels. The model is jointly trained using loss from both token and sentence classification (Details in Section 2). optimized on the tasks of sentence-level and word-level propaganda detection. We fine-tune BERT-BiLSTM end-to-end on the joint-loss to obtain competitive scores on the SI task. Furthermore, we explore the benefits of incorporating word, sentence and document-level syntactic and affective features extracted from dictionaries like LIWC (Tausczik and Pennebaker, 2010) and Empath (Fast et al., 2016). We find that incorporating these features leads to significant improvements in performance (discussed in Section 3). We also leverage articles from multiple well-known propaganda news sources to fine-tune BERT in an unsupervised manner. Furthermore, in order to tackle the problem of unbalanced data, we use a weighted cross-entropy loss. We present our model and additional features in Section 2, and show our results and ablation analysis in Section 3.

Model
In this section, we describe the modeling aspects we have adopted for the task. The overview of our model is given in Figure 1.

Multi-granular BERT-BiLSTM
We implement a model that leverages sentence-level propaganda detection to guide the task of detecting propaganda word-spans . Both stages have their separate classification layers L sent , L tok and the feature representation learnt at the sentence-level H sent influences H tok .
To classify each sentence, we mean-pool the token representations output from BERT, concatenate it with the sentence (F sent ) and document (F doc ) level features, and pass it through a fully-connected layer to obtain the logits p sent .
[; ] denotes the concatenation operator. For word-span level predictions, we input the token representations output from BERT and concatenate them with word-level features for each token (F word ) before passing them into a BiLSTM: where ⊕ represents the token-wise concatenation operator and H tok is the list of contextual representations for each token output by the BiLSTM. F word , F sent , and F doc are described in Section 2.2.
The sentence-level representation is passed through a masking gate that projects p sent to a single dimension and applies sigmoid non-linearity. To control the information flow, each element of the tokenlevel representation H i tok is multiplied with the gate to obtain gated token representations G tok . This biases the token-level model to ignore samples strongly classified as negative in the sentence-level task, allowing it to selectively learn additional information to improve performance on token-level prediction: We use a cross-entropy loss with sigmoid activation for sentence-level classification, whereas the token-level classification is trained using a cross-entropy loss with softmax activation. The losses L sent , L tok are jointly optimized using hyper-parameter α to control the strength of both losses: (4)

Additional Features
We incorporate additional lexical, syntactic, linguistic, and topical features at word, sentence and document levels to better inform the model. Syntactic Features: Inspired by prior work that leveraged syntactic features effectively (Vashishth et al., 2018;Kumar et al., 2020), and motivated by the observation that many propaganda spans are persuasive or idiomatic phrases, we extract phrasal features from constituency parse trees to explicitly incorporate structural syntactic information. We encode the path from a word to the root in the parse tree as a d cdimensional embedding. 2 Stanford CoreNLP (Manning et al., 2014) was used to extract the constituency parses as well as part-of-speech tags that were also used as features. Affective and Semantic Features: In addition to structural cues, prior work has shown that propaganda is marked with affective and emotional words and phrases (Gupta et al., 2019;Alhindi et al., 2019). Motivated by that, we append to word embeddings of the i th token (after BERT) features extracted using affective lexicons, including the LIWC (Tausczik and Pennebaker, 2010) dictionary, NRC lexicons of Word Emotion, VAD and Affect Intensity (Mohammad and Turney, 2013; Mohammad, 2018a; Mohammad, 2018b). We also assign a score to each token that corresponds to the frequency of the word in the propaganda spans as opposed to the non-propaganda ones. For example, words like 'invader', will have a high score, as it is salient to propaganda spans. Furthermore, we incorporate semantic class features. These include named entities (such as Person, Place, Temporal Expressions) as identified by the CoreNLP NER tagger (Manning et al., 2014) and finer-grained topical categories from Empath (Fast et al., 2016) (such as Government, War, Violence).
These word-level features (F word ) are concatenated to the token-level BERT representations. Sentence-level Features (F sent ): We also encode a sentence using the BERT-large-case model finetuned on news articles from different sources to get a 1024 dimensional vector. Document-level Features (F doc ): Similarly, we obtain a 1024-dimensional embedding of the document, by averaging across BERT embeddings obtained for each sentence in the document.
The motivation for incorporating sentence and document features is to inform the model about the overall topical content of the article. We hypothesize that these features are especially helpful in detecting a specific type of propaganda called "repetition" where certain events are mentioned several times in the document. We append these to the pooled BERT representation to improve sentence-level classification.

Unsupervised Fine-tuning
Neural language models like BERT (Devlin et al., 2019) unlock their real power from the large amounts of data they are pre-trained on. The original BERT models are trained on BookCorpus and Wikipedia datasets that are essentially a text-based knowledge source. Therefore, the vanilla BERT learns language properties from an objective dataset and thus missing the nuances of persuasive and metaphorical language extensively used in propaganda . To alleviate this issue we pre-train the original BERT models on a large collection of news articles collected from different sources. To ensure equal representation, we scraped approximately 10k articles from propaganda websites mentioned in Da San Martino et al. (2019), e.g., Lew Rockwell, SHTFplan.com, and 10k articles from trusted non-propagandist sources like CNN and New York Times. 3 We scraped articles from 2018 to mid-2019 to ensure that the articles follow the same topical distribution as that of the training articles. We provide the unlabeled data, split to sentences, and train BERT on masked LM and next sentence prediction losses (Devlin et al., 2019). We leverage the pipeline provided by the Huggingface transformers library. 4

Class-Imbalance
Since our corpus suffers from class imbalance in both sentence classification and word-level span identification task, we associate higher weight to the loss incurred by samples of the minority class. Following Khosla (2018), we calculate these weights as inverse of the normalized frequency of samples of each class and plug them into the cross-entropy loss. Namely L = − 1 N N n=1 w n * loss n , where w n is the weight associated with the loss for each sample.

Experimental Setup
We used the BERT model from hugging-face library which was then fine-tuned on the corpus. Since the test-set was not made available during the competition, we hold out a small part of the training data as dev-set which was used to tune the models. The submitted model was chosen based on its Span-level Normalized F1-Score on the validation set provided by the competition organizers. 5 The hyperparameter choices are provided in Table A1 in the Appendix. We run all our experiments on Nvidia Geforce GTX 1080 Ti GPUs with 12GB memory.

Results
In this section, we present our method's results on the official validation set. Table 1 shows the performance of different multi-granularity BERT and BERT-BiLSTM. We report the mean of 5 independent runs (with different seeds). We find that BERT-large variants perform significantly better than BERT-base with BERT-large-cased performing the best. This suggests that case-based signals might be important for identifying spans as the writer might use capitalization to put more emphasis on propaganda information.
We also analyze the importance of word, sentence and document-level features (as detailed in Section 2.2) to our model. For brevity, we only present the results on multi-granularity BERT-large-uncased BiSLTM (MGU-LSTM) and BERT-large-cased BiLSTM (MGC-LSTM) models due to their superior performance. Initial few rows in Table 3 depict the results for concatenating word-level features to wordlevel BERT representations. We find that concatenating Affective, LIWC, (A) and Syntax (X) features increases the performance of MGU-LSTM and MGC-LSTM models by 0.70 and 0.24 F1 respectively. Further adding NER and Empath (N) based features only seems to help MGC-LSTM attaining 0.18 additional points. This suggests that the embeddings learnt by the two models might differ in the kinds of features they represent. MGU-LSTM might already encode most of the information present in C (dimensions highly correlated with C). Nevertheless, testing this hypothesis requires an in-depth analysis of individual features and the word-level representations which is out of scope for this paper.
Our experiments with sentence (S) and document-level (D) features (Table 3) suggest no clear pattern as adding sentence-level features to MGC-LSTM-AXN increases F1 by 0.1 but degrades MGU-LSTM-AX's performance from 44.11 to 44.07 F1. However, adding document-level features seem to benefit MGU-LSTM-AX-S but show no significant improvement (p = 0.05) in MGC-LSTM-AXN.
Finally, we present the contributions of the weighted cross-entropy loss (W) and the unsupervised finetuning (F) performed on articles from propaganda news sources in Table 2. Adding inverse-frequency based weighting gives a boost of 0.66 F1 points to MGC-LSTM-AXN-S. This is expected as the corpus is highly imbalanced towards non-propaganda words. Fine-tuning on news articles also provides a significant performance jump indicating the advantages of learning domain-informed contextualized representations. Our best performing (single) model is a multi-granularity BiLSTM architecture with BERT-large-cased embeddings fine-tuned on news articles, incorporating Affect, LIWC, Syntax, NER, Empath and sentence features, and optimized using a weighted cross-entropy loss.    Ensemble We created a majority-voting ensemble of a subset of the model variants discussed above. We observe that we obtain the best results when the ensembled models are the most dissimilar. This pattern has also been shown to be beneficial in stacking where the same machine-learning problem is tackled with different types of learners. We find that an ensemble of 7 models namely BERT We observe that our model is not only able to detect short spans like loaded language or named calling but also large spans corresponding to repetition and slogans that rely on contextual information. Final Submission We submit the results from the ensemble followed by post-processing like merging disjoint spans separated by by 1-2 words, removing stop-words or stray characters like quotes from the beginning and end of spans, as well as label words frequently labelled with the 'loaded language' label. Our final submission scored 49.06 F1-Score on the official dev-set and 47.66 on the official test-set.

Conclusion
This paper describes our 4 th place submission to the Span Identification (SI) subtask of SemEval 2020 Task 11. Our approach is based on a multi-granularity BiLSTM with BERT embeddings. We explore the contributions of several word, sentence and document-level features concatenated with token and pooled embeddings output from BERT. Our work also highlights the importance of tackling class-imbalance in the corpus and learning domain-informed representations through unsupervised fine-tuning of BERT on latest news articles. We submit a majority-voting ensemble of multiple models with potentially dissimilar decision boundaries to make robust predictions on the official validation and test-set.  Leave voters as knuckle-scraping, racist bigots, but the news that someone can be stabbed to death on one of LondonâȂŹs busiest, best-known and most upmarket streets .... And if that chilling reality doesnâȂŹt put people off moving here from overseas, then perhaps they may be dissuaded by the paralysis on LondonâȂŹs roads ... Rough sleepers and aggressive beggars are a permanent fixture in the West End. Yet while innocent blood runs in the gutters, and police budgets are restrained,