Hitachi at SemEval-2020 Task 11: An Empirical Study of Pre-Trained Transformer Family for Propaganda Detection

In this paper, we show our system for SemEval-2020 task 11, where we tackle propaganda span identification (SI) and technique classification (TC). We investigate heterogeneous pre-trained language models (PLMs) such as BERT, GPT-2, XLNet, XLM, RoBERTa, and XLM-RoBERTa for SI and TC fine-tuning, respectively. In large-scale experiments, we found that each of the language models has a characteristic property, and using an ensemble model with them is promising. Finally, the ensemble model was ranked 1st amongst 35 teams for SI and 3rd amongst 31 teams for TC.


Introduction
This paper shows our proposed system for the SemEval-2020 task 11: Detection of Propaganda Techniques in News Articles (Da San Martino et al., 2020). The goal of the task was to design a model to detect and classify propaganda. To this end, there are two subtasks: span identification (SI) for predicting propaganda spans and technique classification (TC) for predicting propaganda technique types used for a given span. SI can be assumed as a sequence labeling problem and TC as a multi-label classification problem.
Recent studies such as (Yoosuf and Yang, 2019;Vlad et al., 2019) proposed employing BERT (Devlin et al., 2019), a pre-trained language model, for propaganda detection. Since propaganda detection tasks require highly semantic understanding, leveraging such strong pre-trained language models is promising. However, new state-of-the-art pre-trained language models, such as XLNet  and RoBERTa (Liu et al., 2019), are being proposed rapidly, and there are no sufficient studies on them in the research on propaganda detection. Revealing the ability of capturing propaganda semantics with state-of-the-art models could contribute to discussing future applications. Therefore, we investigate state-of-the-art pre-trained language models (PLMs) for propaganda detection. We employ not only BERT (Devlin et al., 2019) but other PLMs such as GPT-2 (Radford et al., 2019), RoBERTa (Liu et al., 2019), XLM-RoBERTa , XLNet , and XLM (Lample and Conneau, 2019). The PLMs are fine-tuned by our proposed SI and TC models with various hyperparameters. We also propose an effective ensemble method with stacked generalization, which is generally better than a naive average ensemble.
Our ensemble model was ranked 1st in SI and 3rd in TC, showing that leveraging state-of-the-art PLMs is promising. We also empirically gained the following insights as described later.
1.RoBERTa and XLNet generally perform better for propaganda detection.
2.An ensemble model with all types of PLMs showed stable and better performance than employing a single PLM type. 3.Each PLM has a different optimal learning rate, and finding the optimal one is essential to elicit high performance.

Related Work
Propaganda generally intends to promote an agenda or point of view with specific information such as biased or misleading descriptions. Since detecting propaganda in text could be useful for further applications such as detecting fake news, research on propaganda detection has been attracting much attention.  proposed a model for assessing the "level" of propaganda in an article. Da San Martino et al. (2019) focused on more fine-grained analysis, proposing a model for detecting propaganda spans and classifying their techniques. Some recent studies (Yoosuf and Yang, 2019;Vlad et al., 2019;Hua, 2019;Fadel et al., 2019;Tayyar Madabushi et al., 2019) utilized PLMs such as BERT (Devlin et al., 2019) to detect fine-grained propaganda. Our work is related to the studies of (Fadel et al., 2019;Al-Omari et al., 2019), which employed ensemble models with PLMs. Different from these studies, we further investigate the number of PLMs.
3 Pre-Trained Language Models (PLMs) In this paper, six types of Transformer (Vaswani et al., 2017) based PLMs were used. The reason behind this is to unify our implementation and evaluation. We provide a brief description of each PLM: BERT (Devlin et al., 2019) is the epoch-making Transformer-based masked language model. We employ a pre-trained model called bert-large-cased-whole-word-masking. GPT-2 (Radford et al., 2019) is a model that followed Open AI GPT (Radford and Sutskever, 2018

Propaganda Detection Models
Given a split sentence 1 , our SI model predicts propaganda spans. The TC model in turn predicts propaganda techniques for given spans in a sentence. 2 The technique labels include Loaded Language, which uses words with strong emotional implications, Name Calling or Labelling, which describe an object or something that the audience fears or sees as undesirable, Doubt, which questions the credibility of something, and so on . Refer to (Da San Martino et al., 2020) for more details. The following subsections give details on the proposed SI and TC models. Figure 1a shows an overview of the proposed model for SI. Given a tokenized sentence, PLM ∈ {BERT, GPT-2, RoBERTa, XLM-RoBERTa, XLNet, XLM}, and bi-directional long short term memory (BiLSTM) (Graves et al., 2013) encode the sentence, predicting propaganda spans with token-level BIO tags, namely propaganda begins (B), is inside (I), or is outside (O) the spans. In addition, we provide two joint auxiliary tasks to effectively train the model. These auxiliary tasks are described later.

SI Model
where s (si) and c (si) are trainable parameters, and PLM ij is an embedding of the i-th word token in the j-th layer of a PLM. 3 We also concatenated part-of-speech (PoS) embeddings (h (si) PoS,i ) and named entity (NE) embeddings (h (si) NE,i ) for each token i. 4 Therefore, the i-th word token is represented as: where ⊕ is a concatenate operation. BiLSTM-CRF: We employed BiLSTM-CRF because we preliminarily found that stacking BiLSTM-CRF (Huang et al., 2015) on a PLM leads to better performance in SI. The input token representations h i are fed to the multi-layered BiLSTM to obtain a further contextualized token representation: We apply a feed forward network (FFN) with one hidden layer and a fully-connected layer to the recurrent states before classification:ŷ where W (si_bio) and b (si_bio) are parameters.ŷ (si_bio) i is the output of B, I, and O labels. 5 Finally, we employ a conditional random field (CRF) for training and prediction.
[Auxiliary1] Token-Level Technique Classification: This auxiliary task predicts token-level propaganda technique classes to add more information for span identification. 6 In fact, a previous study (Schulz et al., 2018) suggests that joint training with multiple token-level tasks helps improve performance in a low-resource setting. Our expectation is that spans for low-frequency propaganda techniques can be detected with this auxiliary task. We achieve this by simply providing another output layer: where W (si_tech) and b (si_tech) are parameters, andŷ (si_tech) i is the output of 14 propaganda technique classes and a non-propaganda class. 7 We also employ a CRF for training and prediction.
[Auxiliary2] Sentence-Level Classification: Given that sentences that contain propaganda are comparatively low in number when compared with non-propaganda sentences, we introduce a sentence-level auxiliary task. This auxiliary task predicts sentence-level classes with lower granularity on the basis of whether the sentence contains propaganda or not, and higher granularity token-level tasks are backpropagated only when a sentence contains propaganda. Therefore, we do not learn much information from non-propaganda sentences for detecting spans. A similar idea was derived from the work of Da San Martino et al. (2019), in which they proposed incorporating a higher granularity task on the basis of lower granularity information (sentence-level task) with a gating mechanism.
We provide another PLM layer attention and multi-layered BiLSTM to distinguish information between sentence-level and token-level tasks: Finally, we use the output from the BoS token 8 9 , predicting the probability that a sentence contains propaganda:ŷ where v (si_sent) and b (si_sent) are parameters, e (si_sent) BoS is the hidden state of the BoS token (e.g., normally e (si_sent) BoS = e (si_sent) 1 ), and σ is a sigmoid function. We also concatenated a structural feature vector φ including the sentence length and positional information in its article. The positional information includes binary signals if the sentence is located in the upper, middle, or lower. Objective: Given that we have distinct sentence-and token-level tasks, our objective is described as: if the sentence contains no propaganda spans L (si_bio) + λL (si_tech) else where L (si_sent) is a cross-entropy loss for the sentence-level task, and L (si_bio) and L (si_tech) are CRF loss with a negative log likelihood for span identification and technique classification, respectively. λ is a hyperparameter for controlling the auxiliary tasks. The objective means that if a sentence contains no propaganda spans, token-level tasks are ignored. Also, we assign a weight to the sentence-level loss according to the inverse proportion of positive samples to deal with class imbalance.
After training, we modify thresholds of the sentence-level task on the basis of validation scores. At inference, if the output probability in the task is below the threshold, it is regarded as a non-propaganda sentence. Propaganda spans are predicted only when the probability is above the threshold.  Figure 1b shows an overview of the proposed TC model. In TC, given a propaganda span, we predict propaganda technique(s) for each span. 10 Propaganda Span Representation: To produce a propaganda span representation, we provide two distinct FFNs, feeding input representation h (tc)

TC Model
i , that were obtained in the same manner as the SI model. One of the two FFNs is for the BoS token and produces sentence representations, and the other is for tokens in a propaganda span: BoS is a sentence representation obtained from the BoS token. The propaganda span representation is obtained by concatenating the representation of the BoS token (e (tc) BoS ), tokens located at span start (e (tc) start ) and end (e (tc) end ), and representations aggregated by attention (e (tc) att ) and maxpooling (e (tc) maxp ) in the span as follows.
BoS ⊕ e (tc) start ⊕ e (tc) end ⊕ e (tc) att ⊕ e (tc) maxp , Classifier and Objective: We provide an additional label-wise FFN and linear layer to extract labelspecific information for each propaganda technique before prediction: where v (tc) and b (tc) are trainable parameters, and denotes a technique label such as flag-waving.
Since TC is a multi-label problem, we provide a binary cross-entropy loss for each class. Similar to Da San Martino et al. (2019), we assign weight to a loss according to the proportion of positive samples to deal with class imbalance. After training, we multiply the output probability for each label on the basis of the validation scores automatically. At inference, we sort predicted labels for each sentence in descending order and assign labels according to the order in a multi-label span. 11

Ensemble with Stacking
We propose an ensemble strategy based on the concept of stacked generalization (Wolpert, 1992). Stacked generalization feeds prediction results (i.e., output probabilities from classifiers) into a metaestimator and trains the estimator with gold labels. In this study, the keys are hyperparameter search and cross-validation. The simplified training procedure of the ensemble model can be found in Figure 2.
In the procedure for model training, for either the SI and TC, assume we have k-fold cross-validation, N H hyperparameter sets, and N P PLMs. A hyperparameter set includes the dropout ratio and learning rate. For each hyperparameter set, we fine-tune the SI or TC models with training folds without using the validation fold. Therefore, N P × k models for each hyperparameter set are generated. To select better models, we use only the top N HT hyperparameter sets on the basis of the validation score, resulting in N HT × N P × k models. For example, as in Figure 2, N HT = 2, N P = 2 (i.e., BERT and XLM), and k = 3.
In the meta-estimator training procedure, we train a linear model (i.e., classifier or regressor) on the basis of the outputs of the fine-tuned SI or TC models. First, we predict the validation fold in the training data through the fine-tuned model. By concatenating the predicted validation folds for each hyperparameter set, the representative out-of-folds of each hyperparameter set are organized. This means that we have meta-features D ∈ R d×N HT to train the meta-estimator, where d is the size of the training data.
In the test procedure, we predict test labels with the fine-tuned models in the top hyperparameter sets that were selected in the training step. The predicted labels are then fed into the trained meta-estimator, obtaining final predictions. Meta-Estimators: We employ the ridge classifier (Hoerl and Kennard, 1970) implemented in scikitlearn (Pedregosa et al., 2011) for the meta-estimator. We estimate that even with a naive linear model, the final outputs are generally more robust and accurate. We provided the meta-estimators for the sentencelevel task in SI, BIO classification in SI, and all TC labels. The meta-estimators of the sentence-level task and TC labels receive the output probabilities of the corresponding labels as input representations.

Experiments
Implementation Detail: All SI and TC models were implemented with PyTorch (Paszke et al., 2019) and Hugging Face's transformer library (Wolf et al., 2019). Layer attentions were applied for the last eight layers in all PLMs, employing dropout. CRF classifiers were implemented using pytorch-crf 12 . At training, we split the network into two parameter groups: one for the parameters of PLM and one for all other non PLM parameters, applying discriminative fine-tuning (Kondratyuk and Straka, 2019). We froze PLM parameters for the first few epochs to improve training stability. Adam optimizer (Kingma and Ba, 2015) was used as an optimizer. We applied a 10% and 5% linear learning rate warm-up for all epochs for SI and TC, respectively. Submitted System: We employed only the official training dataset to train our model. For the SI submission, we generated 24 hyperparameter sets for each PLM, and the top 3 sets chosen on the basis of the validation score for each PLM were used for the stacked generalization. For the TC submission, the top 11 sets amongst 60 hyperparameter sets were used.
Fixed hyperparameter values are shown in Table 1. The tunable hyperparameter set included learning rates and the dropout ratio for the FFNs and BiLSTMs. Optuna (Akiba et al., 2019) was used to generate hyperparameter sets.
The hyperparameter search results for the optimal learning rates are shown in Table 2. Generally, learning rates of non-PLM parameters are larger than those of PLM parameters. Interestingly, most of the learning rates for TC were lower than for SI. This insight suggests that complicated models such as PLMs with BiLSTMs require a larger learning rate to produce a better model. Metrics: Overlap-based F1 scores for SI and micro-averaged F1 scores for TC were employed in the shared task. Refer to (Da San Martino et al., 2020) for more details.  3.7e-6 / 1.0e-6 3.5e-4 / 1.6e-4 GPT-2 1.8e-5 / 1.2e-5 5.4e-4 / 5.0e-5 RoBERTa 2.9e-6 / 1.9e-6 2.8e-4 / 1.0e-4 XLM-RoBERTa 4.4e-6 / 1.7e-6 7.3e-4 / 7.0e-5 XLNet 4.1e-6 / 2.7e-6 7.3e-4 / 1.5e-4 XLM 3.0e-6 / 6.9e-6 8.9e-5 / 1.5e-4    Table 3 shows the official test results of the top-performing teams for SI and TC. Our proposed SI model outperformed all the other teams with an improvement of more than 2 points. Our team was ranked third for TC; however, the performance of the top three teams seemed to be almost the same.
On the Role of PLMs: To show the role of PLMs, we show the ablation results for each PLM in Table 4.
In the table, we show the performance for the development data. The table shows that RoBERTa and XLNet were generally the best performing models. Given that RoBERTa is a carefully tuned model based on BERT, this result is reasonable. Interestingly, the table also shows that using all PLMs in an ensemble was better in most cases. For example, while the GPT-2-based model itself is not a better model for both SI and TC, excluding the GPT-2-based model results in worse performance. This result suggests that stacked generalization was effectively applied.
PLM Layer Weight: Comprehensive overviews of fine-tuned PLM states are shown in Figure 3, visualizing weights of PLM layers (Clark et al., 2019). The figure illustrates that the last several layers were generally weighted. GPT-2 and XLNet interestingly show different distributions, ranging widely.
On the Importance of Learning Rate Tuning: Through hyperparameter tuning, we found that tuning a PLM learning rate is essential to elicit better results. We show Figure 4 and Figure 5, visualizing the learning rate space for the two parameter groups, that is, PLM parameters and non-PLM parameters. We found that the SI models required tuning for either group, while the TC models required the tuning of PLM parameters rather than non-PLM parameters. We attribute this to the complexity of SI models. SI models employ BiLSTM-CRF in PLMs, and BiLSTMs are complicated when compared with FFNs in TC. Therefore, SI training requires a higher learning rate for either group. TC models use only FFNs in PLMs and therefore require a lower learning rate for non-PLM parameters.

Conclusion
We detected propaganda by leveraging heterogeneous pre-trained language models. The results suggested that employing heterogeneous pre-trained language models could result in better performance. Future work includes examining more effective methods for utilizing heterogeneous pre-trained language models.