ApplicaAI at SemEval-2020 Task 11: On RoBERTa-CRF, Span CLS and Whether Self-Training Helps Them

This paper presents the winning system for the propaganda Technique Classification (TC) task and the second-placed system for the propaganda Span Identification (SI) task. The purpose of TC task was to identify an applied propaganda technique given propaganda text fragment. The goal of SI task was to find specific text fragments which contain at least one propaganda technique. Both of the developed solutions used semi-supervised learning technique of self-training. Interestingly, although CRF is barely used with transformer-based language models, the SI task was approached with RoBERTa-CRF architecture. An ensemble of RoBERTa-based models was proposed for the TC task, with one of them making use of Span CLS layers we introduce in the present paper. In addition to describing the submitted systems, an impact of architectural decisions and training schemes is investigated along with remarks regarding training models of the same or better quality with lower computational budget. Finally, the results of error analysis are presented.


Introduction
The idea of fine-grained propaganda detection was introduced by Da San Martino et al. (2019), whose intention was to facilitate research on this topic by publishing a corpus with detailed annotations of high reliability. There was a chance to propose NLP systems solving this task automatically as a part of this year's SemEval series. It was expected to detect all fragments of news articles that contain propaganda techniques, and to identify the exact type of used technique (Da San Martino et al., 2020).
The authors decided to evaluate Technique Classification (TC) and Span Identification (SI) tasks separately. The purpose of the TC task was to identify an applied propaganda technique given the propaganda text fragment. In contrast, the goal of the SI task was to find specific text fragments that contain at least one propaganda technique. This paper presents the winning system for the propaganda Technique Classification task and the second-placed system for the propaganda Span Identification task.

Systems Description
Systems proposed for both SI and TC tasks were based on RoBERTa model (Liu et al., 2019) with task-specific modifications and training schemes applied.
The central motif behind our submissions is a commonly used semi-supervised learning technique of self-training (Yarowsky, 1995;Liao and Veeramachaneni, 2009;Liu et al., 2011;Wang et al., 2020), sometimes referred to as incremental semi-supervised training (Rosenberg et al., 2005) or selflearning (Lin et al., 2010). In general, these terms stand for a process of training an initial model on a manually annotated dataset first and using it to further extend the train set by automatically annotating other dataset. Usually, only a selected subset of auto-annotated data is used, however neither selection of high-confidence examples nor loss correction for noisy annotations is performed in our case. This is why it can be considered a simplification of mainstream approaches-the naïve self-training.

Span Identification
The problem of span identification was treated as a sequence labeling task, which in the case of Transformer-based language models is often solved by means of classifying selected sub-tokens (e.g., first BPE of each word considered) with or without applying LSTM before the classification layer (Devlin et al., 2019). Although pre-Transformer sequence labeling solutions exploited CRF layer in the output (Huang et al., 2015;Lample et al., 2016), this practice was abandoned by the authors of BERT (Devlin et al., 2019) and subsequent researchers developing the idea of bidirectional Transformers, with rare exceptions, such as Souza et al. (2019) who used BERT-CRF for Portuguese NER. Contrary to the above, we approached Span Identification task with RoBERTa-CRF architecture.
The impact of this decision will be discussed in Section 3 along with remarks regarding training models of the same or better quality with a lower computational budget in an orderly fashion. In contrast, the following narrative aims at a faithful reflection of the actual way the model which we used was trained.
Recipe Take one pretrained RoBERTa LARGE model, add CRF layer and train on original (gold) dataset until progress is no longer achieved with Viterbi loss, SGD optimizer, and hyperparameters defined in Table 1. Use the best-performing model to annotate random 500k OpenWebText 1 sentences automatically. Train the second model on both original (gold) dataset and autotagged (silver) one with hyperparameters defined in Table 1. Repeat the procedure two more times with the best model from the previous step, hyperparameters from Table 2, and other OpenWebText sentences.
Note that hyperparameters were indeed not overwritten during the first self-training iteration. Scores achieved by the best-performing models were respectively 50.91 (without self-training) and 50.98, 51.45, 52.24 in consecutive self-training iterations.
Many questions may arise regarding this procedure and the role of purely random factors. It is not a problem when rather the best score than its explanation is desired. In a leaderboard-driven exploration, one can simply conduct a broad set of experiments and choose the best-performing model without reflection, whether it is a byproduct of training instability. What actually happened here was investigated afterward and will be discussed in Section 3.  Figure 2: Comparison of span classification by means of special tokens (left) and in Span CLS approach (right). On the left, special [BOP] and [EOP] tokens are introduced, and the span is further classified as in the usual Transformer-based sentence classification task. On the right, an additional, small Transformer is stacked only over the selected tokens. It has no own embeddings apart from one for the [BOS] token, but uses representations provided by the host model instead.

Technique Classification
Transformer-based language models used in the sentence classification setting assume that representations of special tokens (such as [CLS] or [BOS]) are passed to the classification layer. Since TC task is aimed at the classification of spans, it might be beneficial to introduce information about the text fragment to be classified. We experimented with two approaches addressing this requirement.
The first assumes an injection of special tokens indicating the beginning and the end of the text marked as propaganda, such as a sample sentence before BPE applied appears as:

[BOS] Democrats acted like [BOP] babies [EOP] at the SOTU [EOS]
In this approach we continue with representation of [BOS], as in the usual sentence classification task. The second approach is to stack a small Transformer only on the selected tokens. 2 This one has no own embeddings apart from the ones for [BOS] but uses the host model's representations instead. This technique is roughly equivalent to adding consecutive layers and masking attention outside the selected span and will be referred to as Span CLS. Figure 2 summarizes differences between Span CLS and classification using special [BOP] and [EOP] tokens.
The initial experiments have shown that underrepresented classes achieve lower scores. To overcome this problem, we experimented with class-dependent rescaling applied to binary cross-entropy. In this setting (further referred to as re-weighting) factor for each class was determined as its inverse frequency multiplied by the frequency of the most popular class. The modified loss is equal to: where N is the batch size, n index denotes nth batch element, d is the number of classes, f stands for a vector of class absolute frequencies calculated on the train set, x is the output vector from the last sigmoid layer and y is a vector of multi-hot encoded ground truth labels. Note that the only difference from the original binary cross entropy for multi-label classification is the addition of the p k class weights.
In addition to the above, a part of the tested models took the use of the self-training approach. In the case of TC task one had to identify spans first and then predict their classes to generate silver train   Table 3: Best scores on the dev set achieved with RoBERTa large model on SI task. Mean, standard deviation and maximum across 10 runs with different random seeds. Numbers in brackets indicate how many self-training iterations were used. set ( Figure 1). We reused our best-performing model from SI task to identify spans, and the TC model trained on ground truth to automatically annotate these spans.
Regardless of the approach taken, context as broad as possible within the 256 subword units limit was provided on both sides of the span to be classified. Note that it was a maximum equal extension of the span text in both directions, and we did not limit the extension to the sentence boundaries. The winning TC model (described in the recipe below) was an ensemble of three models. Each of them used a different mix of previously described approaches with hyperparameters defined in Table 1 for first and second model, and those from Table 2 in case of the third model.
Recipe Add classification layer (described in Figure 2 on the left) to the pretrained RoBERTa LARGE model in order to obtain the first model and train until no score gain is observed on development set. Train the second model in the same manner, but this time using the re-weighting. Combine re-weighting, Span CLS and self-training approaches to get the third model, and again train until no score improvement on development set is observed. Finally, ensemble all three models by averaging class probabilities from their final layers.
As shown later, the approach we took and reported above turned out to be sub-optimal. An in-depth analysis of this system and a better one is proposed in Section 3.2.

Span Identification
Models with different random seeds were trained for 60K steps with an evaluation performed every 2K steps. This is equivalent to approximately 30 epochs, and per-epoch validation in a scenario without data generated during the self-training procedure. Table 3 summarizes the best scores achieved across 10 runs for each configuration. CRF has a noticeable positive impact on FLC-F1 (Da San Martino et al., 2020) scores achieved without self-training in the setting we consider. The presence of the CRF layer is correlated positively with the score (ρ = 0.27, p < 0.001). The difference is significant (p < 0.001), according to the Kruskal-Wallis test (Kruskal and Wallis, 1952) . Unless said otherwise, all further statistical statements within this section were confirmed with statistically significant positive Spearman rank correlation and Kruskal-Wallis test results. Differences in variance were confirmed using Bartlett's test (Snedecor and Cochran, 1989). The 0.05 significance level was assumed.
The statistically significant influence of CRF disappears when the self-training is investigated. In the case of first self-training, regardless of whether or not CRF was used, a considerable increase in median score can be observed. Self-trained models with and without the CRF layer, however, are indistinguishable.
Improvement offered by further self-training iterations is not so evident but is statistically significant. In particular, they slightly improve mean scores and decrease variance (see Table 3). As it comes to the latter, CRF-extended models generally have higher variance and scores achieved across the runs. Table 4 analyzes the importance of using different hyperparameters. Whereas use of a smaller batch size and dropout is beneficial for the initial training without noisy data, it negatively impacts the selftraining phase. The most substantial negative impact is observed when dropout is disabled during training on the small amount of manually annotated data. Figure 3 illustrates scores achieved by models trained for the same number of steps on subsets or supersets of manually annotated data. CRF layer has a positive impact regardless of the percentage of train set available. Once again, a large variance in scores of CRF-equipped models can be observed, however, it is substantially reduced with the increase of a batch size. Interestingly, figures suggest the proportion of automatically annotated data we used might be suboptimal since it was an equivalent of around 3000% in line with the chart's convention. One may hypothesize better scores would be achieved by models trained with 1 : 4 gold to silver proportion.

Technique Classification
6-fold cross-validation was conducted. The results are presented in Table 5. Folds were created by mixing training and development datasets, then shuffling them and splitting into even folds. Parameters were set according to Table 1 and Table 2, whereas experiments were carried out as follows. Each approach from Table 5 was separately evaluated on each fold using the micro-averaged F1 metric. Then, for each approach, the average score and the standard deviation were obtained using six scores from every fold.
(1) (2) (3) (4) (5) (6) (7)    Moreover, all the 247 possible ensembles 4 were evaluated in the same fashion as in experiments from Table 5. Table 6 shows the performance achieved by selected combinations when simple averaging of the probabilities returned by individual models was used as the final prediction.
Due to a large number of available results, it is beneficial to conduct a statistical analysis to formulate remarks regarding the general observed trends. Each component model of the ensemble was treated as a categorical variable with respect to the ensemble score. Spearman rank correlation between the presence of an ensemble component (approaches from Table 5) and achieved scores shows that adding model to the ensemble correlates with a significant increase in score, except for (6) model (see Table 7). Boxplots from Figure 4 lead to the same conclusions. 5 Re-weighting seems to be beneficial only when ensembled with other models. An interesting finding is that Span CLS offers a small but consistent increase of performance both in models from Table 5 and when used in ensembles. Bear in mind, we outperformed the second-placed team by ε, so an improvement of a point or half is not negligible.
What is most conspicuous, however, is that self-training based solutions from Table 5 seem to be detrimental in the case of TC task. This damaging effect can be potentially attributed to the fact that automatically generated data accumulate errors from both Span Identification and Classification. Another possible explanation is that much fewer data points are available for span classification task than for span identification attempted as a sequence labeling task. The latter would be somehow consistent with what was found in the field of Neural Machine Translation, where the use of the back-translation technique in low-resource setting was determined to be harmful (Edunov et al., 2018). On the other hand, self-training has a positive, statistically significant impact on the score when used in ensembles (see Figure 4 and Table 7). It is not surprising as the beneficial impact of combining individual estimates was observed in many disciplines and is known since the times of Laplace (Clemen, 1989).

Error analysis
In addition to providing an overview of problematic classes, the question of which shallow features influence score and worsen the results was addressed. This problem was analyzed in a no-box manner, as proposed by Graliński et al. (2019). The main idea is to create two dataset subsets for each feature considered (one for data points with the feature present and one for data points without the feature), rank subsets by per-item scores, and use Mann-Whitney rank U (Mann and Whitney, 1947) to determine whether there is a non-accidental difference between subsets. A low p-value indicates that feature reduces the evaluation score of the model.

Span Identification
Since the FLC-F1 metric used in the SI task gives non-zero scores for partial matches; it is interesting to analyze what was the proportion of entirely missed (partially identified) spans. Table 8 investigates this question broken down by the propaganda technique used.
Our system was unable to identify one-third of expected spans, whereas a majority of those correctly identified were the partial matches. The spans the easiest to identify in the text represented Flag-Waving, Appeal to fear/prejudice, and Slogans techniques. In contrast, Bandwagon, Doubt, and the group of {Whataboutism, Strawman, Red Herring} turned out to be the hardest. The highest proportion of fully identified spans was achieved for Flag-Waving, Repetition, and Loaded Language. Unfortunately, it is not possible to investigate precision in this manner, without training separate models for each label or estimating one-to-one alignments between output and expected spans.
Further investigation of problematic cases in a paradigm of no-box debugging with the GEval tool (Graliński et al., 2019) revealed the most worsening features, that are features whose presence impacts span identification evaluation metrics negatively (Table 9). It seems that our system tends to return ranges without adjacent punctuation. This is the case of sentences such as The new CIA Director Haspel, who 'tortured some folks,' probably can't travel to the EU, where only the quoted text was returned, whereas annotation assumes it should be returned with apostrophes and commas. This remark can be used to improve overall results with simple post-processing slightly. Returned and conjunction refers to the cases where it connects two propaganda spans. The system frequently returns them as a single span, contrary to what is expected in the gold standard. Figure 5 presents the normalized confusion matrix of the submitted system predictions. Interestingly, there are a few commonly confused pairs. Loaded Language and Black-and-white Fallacy were frequently misclassified as Appeal to fear/prejudice. Similarly, Causal Oversimplification was often predicted as Doubt and Clichés as Loaded Language.

Technique Classification
The most worsening features are presented in Table 10. One of the frequent predictors of low accuracy is a comma character present within the span to be classified. It can probably be attributed to the fact that its presence is a good indicator of span linguistic complexity. Another determinant of inefficiency turned out to be a negation-around half of the sentences containing word not were misclassified by the system. Suggested features of a quotation mark before the span and the digram according to after the span are related to reported or indirect speech. The explanation of the worsening effect of other features is not as evident as in the case mentioned above. Moreover, it seems there is no obvious way of improving the final results with our findings, and a more detailed analysis might be required.

Discussion and Summary
The winning system for the propaganda Technique Classification (TC) task and the second-placed system for the propaganda Span Identification (SI) task has been described. Both of the developed solutions used a semi-supervised learning technique of self-training. Although CRF is barely used with Transformerbased language models, the SI task was approached with RoBERTa-CRF architecture. An ensemble of   RoBERTa-based models has been proposed for the TC task, with one of them making use of Span CLS layers we introduce in the present paper.
Analysis conducted afterward can be applied in a rather straightforward manner to further improve the scores for both SI and TC tasks. It is because some of the decisions we have made given lack of or uncertain information, during the post-hoc inquiry turned out to be sub-optimal. These include the proportion of data from self-training in the SI task, and the possibility of providing a better ensemble in the case of TC.
The ablation studies conducted, however, have some limitations. The same subset of OpenWebText was used in experiments conducted within one self-training iteration. This means a random seed did not impact which sentences were used during the first, second, and third self-training phase, and in each, we were manipulating only the data order. Moreover, an analysis we reported was limited to few hyperparameter combinations and no extensive hyperparameter space search was performed. Finally, only one and a rather simple method of cost-sensitive re-weighting was tested, and there is a great chance it was sub-optimal. It would be interesting to investigate other schemes, such as the one proposed by Cui et al. (2019).
The error analysis revealed propaganda techniques commonly confused in TC task, and the techniques we were unable to detect effectively within the SI input articles. In addition to providing an overview of problematic classes, the question of which shallow features influence score and worsen the results was addressed. A few of these were identified and our remarks can be used to slightly improve results on SI task with simple post-processing. This is not the case for TC task, where one is unable to propose how to improve the final results with our findings.
An interesting future research direction seems to be the application of the CRF layer and Span CLS to Transformer-based language models when dealing with other tasks outside the propaganda detection problem. These may include Named Entity Recognition in the case of RoBERTa-CRF, and an aspectbased sentiment analysis that can be viewed through the lens of span classification with Span CLS we proposed.

Outro
Developed systems were used to identify and classify spans in the present paper to detect fragments one may suspect to represent one or more propaganda techniques. Unfortunately for the entertaining value of this work, none of such were identified by our SI model.