SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles

We present the results and the main findings of SemEval-2020 Task 11 on Detection of Propaganda Techniques in News Articles. The task featured two subtasks. Subtask SI is about Span Identification: given a plain-text document, spot the specific text fragments containing propaganda. Subtask TC is about Technique Classification: given a specific text fragment, in the context of a full document, determine the propaganda technique it uses, choosing from an inventory of 14 possible propaganda techniques. The task attracted a large number of participants: 250 teams signed up to participate and 44 made a submission on the test set. In this paper, we present the task, analyze the results, and discuss the system submissions and the methods they used. For both subtasks, the best systems used pre-trained Transformers and ensembles.


Introduction
Propaganda aims at influencing people's mindset with the purpose of advancing a specific agenda. It can hide in news published by both established and non-established outlets, and, in the Internet era, it has the potential of reaching very large audiences (Muller, 2018;Tardáguila et al., 2018;Glowacki et al., 2018). Propaganda is most successful when it goes unnoticed by the reader, and it often takes some training for people to be able to spot it. The task is way more difficult for inexperienced users, and the volume of text produced on a daily basis makes it difficult for experts to cope with it manually. With the recent interest in "fake news", the detection of propaganda or highly biased texts has emerged as an active research area. However, most previous work has performed analysis at the document level only (Rashkin et al., 2017;  or has analyzed the general patterns of online propaganda (Garimella et al., 2015;Chatfield et al., 2015).
SemEval-2020 Task 11 offers a different perspective: a fine-grained analysis of the text that complements existing approaches and can, in principle, be combined with them. Propaganda in text (and in other channels) is conveyed through the use of diverse propaganda techniques (Miller, 1939), which range from leveraging on the emotions of the audience -such as using loaded language or appeals to fearto using logical fallacies -such as straw men (misrepresenting someone's opinion), hidden ad-hominem fallacies, and red herring (presenting irrelevant data). Some of these techniques have been studied in tasks such as hate speech detection (Gao et al., 2017) and computational argumentation (Habernal et al., 2018). Figure 1 shows the fine-grained propaganda identification pipeline, including the two targeted subtasks. Our goal is to facilitate the development of models capable of spotting text fragments where propaganda techniques are used. The task featured the following subtasks: Subtask SI (Span Identification): Given a plain-text document, identify those specific fragments that contain at least one propaganda technique. (This is a binary sequence tagging task.) Subtask TC (Technique Classification): Given a propagandistic text snippet and its document context, identify the propaganda technique used in that snippet. (This is a multi-class classification problem.)

Span Identification
Technique Classification Task 1 Task 2 Input Output Figure 1: The full propaganda identification pipeline, including the two subtasks: Span Identification and Technique Classification.
A total of 250 teams registered for the task, 44 of them made an official submission on the test set (66 submissions for both subtasks), and 32 of the participating teams submitted a system description paper.
The rest of the paper is organized as follows. Section 2 introduces the propaganda techniques we considered in this shared task. Section 3 describes the organization of the task, the corpus and the evaluation measures. An overview of the participating systems is given in Section 4, while Section 5 discusses the evaluation results. Related work is described in Section 6. Finally, Section 7 draws some conclusions, and discusses some directions for future work.

Propaganda and its Techniques
Propaganda comes in many forms, but it can be recognized by its persuasive function, sizable target audience, the representation of a specific group's agenda, and the use of faulty reasoning and/or emotional appeals (Miller, 1939). The term propaganda was coined in the 17th century, and initially referred to the propagation of the Catholic faith in the New World (Jowett and O'Donnell, 2012a, p. 2). It soon took a pejorative connotation, as its meaning was extended to also mean opposition to Protestantism. In more recent times, the Institute for Propaganda Analysis (Ins, 1938) proposed the following definition: Propaganda. Expression of opinion or action by individuals or groups deliberately designed to influence opinions or actions of other individuals or groups with reference to predetermined ends. Recently, Bolsover and Howard (2017) dug deeper into this definition identifying its two key elements: (i) trying to influence opinion, and (ii) doing so on purpose.
Propaganda is a broad concept, which runs short for the aim of annotating specific propaganda fragments. Yet, influencing opinions is achieved through a series of rhetorical and psychological techniques, and in the present task, we focus on identifying the use of such techniques in text. Whereas the definition of propaganda is widely accepted in the literature, the set of propaganda techniques considered, and to some extent their definition, differ between different scholars (Torok, 2015). For instance, Miller (1939) considers seven propaganda techniques, whereas Weston (2000) lists at least 24 techniques, and the Wikipedia article on the topic includes 67. 1 Below, we describe the propaganda techniques we consider in the task: a curated list of fourteen techniques derived from the aforementioned studies. We only include techniques that can be found in journalistic articles and can be judged intrinsically, without the need to retrieve supporting information from external resources. For example, we do not include techniques such as card stacking (Jowett and O'Donnell, 2012b, p. 237), since it would require comparing multiple sources. Note that our list of techniques was initially longer than fourteen, but we decided, after the annotation phase, to merge similar techniques with very low frequency in the corpus. A more detailed list with definitions and examples is available online 2 and in Appendix C, and examples are shown in Table 1. 1. Loaded language. Using specific words and phrases with strong emotional implications (either positive or negative) to influence an audience (Weston, 2000, p. 6). 2. Name calling or labeling. Labeling the object of the propaganda campaign as either something the target audience fears, hates, finds undesirable or loves, praises (Miller, 1939). 3. Repetition. Repeating the same message over and over again, so that the audience will eventually accept it (Torok, 2015;Miller, 1939). 4. Exaggeration or minimization. Either representing something in an excessive manner: making things larger, better, worse or making something seem less important or smaller than it actually is (Jowett and O'Donnell, 2012b, pag. 303). 5. Doubt. Questioning the credibility of someone or something. 6. Appeal to fear/prejudice. Seeking to build support for an idea by instilling anxiety and/or panic in the population towards an alternative, possibly based on preconceived judgments. 7. Flag-waving. Playing on strong national feeling (or with respect to any group, e.g., race, gender, political preference) to justify or promote an action or idea (Hobbs and Mcgee, 2008). 8. Causal oversimplification. Assuming a single cause or reason when there are multiple causes behind an issue. We include in the definition also scapegoating, e.g., transferring the blame to one person or group of people without investigating the complexities of an issue. 9. Slogans. A brief and striking phrase that may include labeling and stereotyping. Slogans tend to act as emotional appeals (Dan, 2015). 10. Appeal to authority. Stating that a claim is true simply because a valid authority or expert on the issue supports it, without any other supporting evidence (Goodwin, 2011). We include in this technique the special case in which the reference is not an authority or an expert, although it is referred to as testimonial in the literature (Jowett and O'Donnell, 2012b, pag. 237). 11. Black-and-white fallacy, dictatorship. Presenting two alternative options as the only possibilities, when in fact more possibilities exist (Torok, 2015). Dictatorship is an extreme case: telling the audience exactly what actions to take, eliminating any other possible choice.
12. Thought-terminating cliché. Words or phrases that discourage critical thought and meaningful discussion on a topic. They are typically short, generic sentences that offer seemingly simple answers to complex questions or that distract attention away from other lines of thought (Hunter, 2015, p. 78). 13. Whataboutism, straw man, red herring. Here we merge together three techniques, which are relatively rare taken individually: (i) Whataboutism: Discredit an opponent's position by charging them with hypocrisy without directly disproving their argument (Richter, 2017). (ii) Straw man: When an opponent's proposition is substituted with a similar one, which is then refuted instead of the original (Walton, 2013). Weston (2000, p. 78) specifies the characteristics of the substituted proposition: "caricaturing an opposing view so that it is easy to refute". (iii) Red herring: Introducing irrelevant material to the issue being discussed, so that everyone's attention is diverted away from the points made (Weston, 2000, p. 78).
14. Bandwagon, reductio ad hitlerum. Here we merge together two techniques, which are relatively rare taken individually: (i) Bandwagon. Attempting to persuade the target audience to join in and take the course of action because "everyone else is taking the same action" (Hobbs and Mcgee, 2008). (ii) Reductio ad hitlerum: Persuading an audience to disapprove an action or idea by suggesting that it is popular with groups hated in contempt by the target audience. It can refer to any person or concept with a negative connotation (Teninbaum, 2009).
We provided the definitions, together with some examples and an annotation schema, to professional annotators, and we asked them to manually annotate selected news articles. The annotators worked with an earlier version of the annotation schema, which contained eighteen techniques . As some of these techniques were quite rare, which could cause data sparseness issues for the participating systems, for the purpose of the present SemEval-2020 task 11, we decided to get rid of the four rarest techniques. In particular, we merged Red herring and Straw man with Whataboutism (under technique 13), since all three techniques are trying to divert the attention to an irrelevant topic and away from the actual argument. We further merged Bandwagon with Reductio ad hitlerum (under technique 14), since they both try to approve/disapprove an action or idea by pointing to what is popular/unpopular. Finally, we dropped one rare technique, which we could not easily merge with other techniques: Obfuscation, Intentional vagueness, Confusion. As a result, we reduced the eighteen original propaganda techniques to fourteen.

Evaluation Framework
The SemEval 2020 Task 11 evaluation framework consists of the PTC-SemEval20 corpus and the evaluation measures for both the span identification and the technique classification subtasks. We describe the organization of the task in Section 3.3; here, we focus on the dataset, the evaluation measure, and the organization setup.

The PTC-SemEval20 Corpus
In order to build the PTC-SemEval20 corpus, we retrieved a sample of news articles from the period starting in mid-2017 and ending in early 2019. We selected 13 propaganda and 36 non-propaganda news media outlets, as labeled by Media Bias/Fact Check, 3 and we retrieved articles from these sources. We deduplicated the articles on the basis of word n-gram matching (Barrón-Cedeño and Rosso, 2009), and we discarded faulty entries, e.g., empty entries from blocking websites.
The annotation job consisted of both spotting a propaganda snippet and, at the same time, labeling it with a specific propaganda technique. The annotation guidelines are shown in Appendix C; they are also available online. 4 We ran the annotation in two phases: (i) two annotators labeled an article independently, and (ii) the same two annotators gathered together with a consolidator to discuss dubious instances, e.g., spotted only by one annotator, boundary discrepancies, label mismatch, etc. This protocol was designed after a pilot annotation stage, in which a relatively large number of snippets had been spotted by one annotator only.
Manchin says Democrats acted like babies at the SOTU In a glaring sign of just how stupid and petty things have become in Washington these days [...] State of the Union speech not looking as though Trump killed his grandma. [...]

Input article Annotation file
Article ID Technique Start End Figure 2: Example of a plain-text article (left) and its annotation (  The annotation team consisted of six professional annotators from A Data Pro, 5 trained to spot and to label the propaganda snippets in free text. The job was carried out on an instance of the Anafora annotation platform (Chen and Styler, 2013), which we tailored for our propaganda annotation task. Figure 2 shows an example of an article and its annotations.
We evaluated the quality of the annotation process in terms of γ agreement (Mathet et al., 2015) between each of the annotators and the final gold labels. The γ agreement on the annotated articles is on average 0.6; see (Da San Martino et al., 2019b) for a more detailed discussion of inter-annotator agreement. The training and the development part of the PTC-SemEval20 corpus are the same as the training and the testing datasets described in . The test part of the PTC-SemEval20 corpus consists of 90 additional articles selected from the same sources as for training and development. For the test articles, we further extended the annotation process by adding one extra consolidation step: we revisited all the articles in that partition and we performed the necessary adjustments to the spans and to the labels as necessary, after a thorough discussion and convergence among at least three experts who were not involved in the initial annotations. Table 2 shows some statistics about the corpus we use for the task. It is worth noting that a number of propaganda snippets of different classes overlap. Hence, the number of snippets for the span identification subtask is smaller (e.g., 1,405 for the span identification subtask vs. 1,790 for the technique classification subtask on the test set). The full collection of 536 articles contains 8,981 propaganda text snippets, belonging to one of the above-described fourteen classes. Figure 3 zooms into such snippets and shows the number of instances and the mean length for each class. We can see that, by a large margin, the most common propaganda technique in our news articles is Loaded Language, which is about twice as frequent as the second most frequent technique: Name Calling or Labeling. Whereas these two techniques are among the ones that are expressed in the shortest spans, other propaganda techniques such as Exaggeration, Causal Oversimplification, and Slogans tend to be the longest.

Evaluation Measures
Subtask SI Evaluating subtask SI requires us to match text spans. Our SI evaluation function gives credit to partial matches between gold and predicted spans. Let d be a news article in a set D. A gold span t is a sequence of contiguous indices of the characters composing a text fragment t ⊆ d. For example, in Figure 4 (top-left) the gold fragment "stupid and petty" is represented by the set of indices t 1 = [4,19]. We denote with T d = {t 1 , . . . , t n } the set of all gold spans for an article d and with T = {T d } d the set of all gold annotated spans in D. Similarly, we define S d = {s 1 , . . . , s m } and S to be the set of predicted spans for an article d and a dataset D, respectively. We compute precision P and recall R by adapting the formulas in (Potthast et al., 2010): We define Eq. (1) to be zero when |S| = 0 and Eq. (2) to be zero when |T | = 0. Notice that the predicted spans may overlap, e.g., spans s 3 and s 4 in Figure 4. Therefore, in order for Eq. 1 and Eq. 2 to get values lower than or equal to 1, all overlapping annotations, independently of their techniques, are merged first. For example, s 3 and s 4 are merged into one single annotation, corresponding to s 4 . h o w s t u p i d a n d p e t t y t h i n g s h o w s t u p i d a n d p e t t y t h i n g s h o w s t u p i d a n d p e t t y t h i n g s  Finally, the evaluation measure for subtask SI is the F 1 score, defined as the harmonic mean between P (S, T ) and R(S, T ): Subtask TC Given a propaganda snippet in an article, subtask TC asks to identify the technique in it.
Since there are identical spans annotated with different techniques (around 1.8% of the total annotations), formally this is a multi-label multi-class classification problem. However, we decided to consider the problem as a single-label multi-class one, by performing the following adjustments: (i) whenever a span is associated with multiple techniques, the input file will have multiple copies of such fragments and (ii) the evaluation function ensures that the best match between the predictions and the gold labels for identical spans is used for the evaluation. In other words, the evaluation score is not affected by the order in which the predictions for identical spans are submitted. The evaluation measure for subtask TC is micro-average F 1 . Note that as we have converted this into a single-label task, micro-average F 1 is equivalent to Accuracy (as well as to Precision and to Recall).

Task Organization
We ran the shared task in two phases: Phase 1. Only training and development data were made available, and no gold labels were provided for the latter. The participants competed against each other to achieve the best performance on the development set. A live leaderboard was made available to keep track of all submissions.
Phase 2. The test set was released and the participants were given just a few days to submit their final predictions. The release of the test set was done task-by-task, since giving access to the input files for the TC subtask would have disclosed the gold spans for the SI subtask.
In phase 1, the participants could make an unlimited number of submissions on the development set, and they could see the outcomes in their private space. The best team score, regardless of the submission time, was also shown in a public leaderboard. As a result, not only could the participants observe the impact of various modifications in their own systems, but they could also compare against the results by other participating teams. In phase 2, the participants could again submit multiple runs, but they did not get any feedback on their performance. Only the last submission of each team was considered official and was used for the final team ranking.
In phase 1, a total of 47 teams made submissions on the development set for the SI subtask, and 46 teams submitted for the TC subtask. In phase 2, the number of teams who made official submissions on the test set for subtasks SI and TC was 35 and 31, respectively: this is a total of 66 submissions for the two subtasks, which were made by 44 different teams.
Note that we left the submission system open for submissions on the development set (phase 1) after the competition was over. The up-to-date leaderboards can be found on the website of the competition. 6

Participating Systems
In this section, we focus on a general description of the systems participating on both the SI and the TC subtasks. We pay special attention to the most successful approaches. The subindex on the right of each team represents their official rank in the subtasks. Appendix A includes brief descriptions of all systems. Table 3 shows a quick overview of the systems that took part in the SI subtask. 7 All systems in the top-10 positions relied on some kind of Transformer, in combination with an LSTM or a CRF. In most cases, the Transformer-generated representations were complemented by engineered features, such as named entities and the presence of sentiment and subjectivity clues.

Span Identification Subtask
Team Hitachi(SI:1) achieved the top performance in this subtask . They used a BIO encoding, which is typical for related segmentation and labeling tasks (e.g., named entity recognition). They relied upon a complex heterogeneous multi-layer neural network, trained end-to-end. The network uses pre-trained language models, which generate a representation for each input token. They further added part-of-speech (PoS) and named entity (NE) embeddings. As a result, there are three representations for each token, which are concatenated and used as an input to bi-LSTMs. At this moment, the network branches, as it is trained with three objectives: (i) the main BIO tag prediction objective and two auxiliary ones, namely (ii) token-level technique classification, and (iii) sentence-level classification. There is one Bi-LSTM for objectives (i) and (ii), and there is another Bi-LSTM for objective (iii). For the former, they used an additional CRF layer, which helps improve the consistency of the output. A number of architectures were trained independently -using BERT, GPT-2, XLNet, XLM, RoBERTa, or XLM-RoBERTa-, and the resulting models were combined in ensembles.
Team ApplicaAI(SI:2) (Jurkiewicz et al., 2020) based its success on self-supervision using the RoBERTa model. They used a RoBERTa-CRF architecture trained on the provided data and used it to iteratively produce silver data by predicting on 500k sentences and retraining the model with both gold and silver data. The final classifier was an ensemble of models trained on the original corpus, re-weighting, and a model trained also on silver data. ApplicaAI was not the only team that reported performance boost when using additional data. Team UPB(SI:5) (Paraschiv and Cercel, 2020) decided not to stick to the pre-trained models from BERT-base alone and used masked language modeling to domain-adapt it using 9M articles containing fake, suspicious, and hyperpartisan news articles. Team DoNotDistribute(SI:22)  also opted for generating silver data, but with a different strategy. They report a 5% performance boost when adding 3k new silver training instances. To produce them, they used a library to create near-paraphrases of the propaganda snippets by randomly substituting certain PoS words. Team SkoltechNLP(SI:25)  performed data augmentation based on distributional semantics. Finally, team WMD(SI:33) (Daval-Frerot and Yannick, 2020) applied multiple strategies to augment the data such as back translation, synonym replacement and TF.IDF replacement (replace unimportant words, based on TF.IDF score, by other unimportant words).
Closing the top-three submissions, Team aschern(SI:3) (Chernyavskiy et al., 2020) fine-tuned an ensemble of two differently intialized RoBERTa models, each with an attached CRF for sequence labeling and span character boundary post-processing.
There have been several other promising strategies. Team LTIatCMU(SI:4) (Khosla et al., 2020) used a multi-granular BERT BiLSTM model with additional syntactic and semantic features at the word, sentence and document level, including PoS, named entities, sentiment, and subjectivity. It was trained jointly for token and sentence propaganda classification, with class balancing. They further fine-tuned BERT on persuasive language using 10,000 articles from propaganda websites, which turned out to be important. Team PsuedoProp(SI:14) (Chauhan and Diddee, 2020) built a preliminary sentence-level classifier using an ensemble of XLNet and RoBERTa, before it fine-tuned a BERT-based CRF sequence tagger to identify the exact spans. Team BPGC(SI:21)  went beyond these multigranularity approaches. Information both at the article and at the sentence level was considered when classifying each word as propaganda or not, by computing and concatenating vectorial representations for the three inputs.  A large number of the participating teams built systems that rely heavily on engineered features. For instance, Team CyberWallE(SI:8) (Blaschke et al., 2020) used features modeling sentiment, rhetorical structure, and POS tags, while team UTMN(SI:23) injected the sentiment intensity from VADER and it was among the only teams not relying on deep learning architectures to produce a computationally affordable model.

Technique Classification Subtask
The same trends as for the snippet identification subtask can be observed in the approaches used for the technique classification subtask: practically, all top-performing approaches used representations produced by some kind of Transformer.
Team ApplicaAI(TC:1) achieved the top performance for this subtask . As in their approach to subtask SI, ApplicaAI produced additional silver data for training. This time, they ran their high-performing SI model to spot new propaganda snippets in free text and applied their preliminary TC model to produce extra silver-labeled instances. Their final classifier consisted of an ensemble of models trained on the original corpus, re-weighting, and a model trained also on silver data. In all cases, the input to the classifiers consisted of propaganda snippets and their context.  Table 4: Overview of the approaches to the technique classification subtask. =part of the official submission; =considered in internal experiments. The references to the description papers appear at the bottom.
Team aschern(TC:2) (Chernyavskiy et al., 2020) was the second best, and it based its success on a RoBERTa ensemble with several interesting techniques. They treated the task as one of sequence classification, using an average embedding of the surrounding tokens and the length of the span as contextual features. They further incorporated knowledge from the span identification task, using transfer learning: namely, they first pre-trained their model for the SI task, and then continued training for the TC task. Finally, they performed task-specific postprocessing in order to increase the consistency for the repetition technique spans and to avoid insertions of techniques inside other techniques.
Team Hitachi(TC:3) (Morio et al., 2020) used two distinct feed-forward neural networks (FFNs). The first one is for sentence representation, whereas the second one is for the representation of the tokens in the propaganda span. The propaganda span representation is obtained by concatenating the representation of the begin-of-sentence token, the span start token, the span end token, and the aggregated representation obtained using attention and max-pooling. As for their winning approach to SI, team Hitachi trained on the TC subtask independently with different large-scale pre-trained state-of-the-art language models (BERT, GPT-2, XLNet, XLM, RoBERTa, or XLM-RoBERTa), and then combined the resulting models in an ensemble.
As the top-performing models to subtask TC show, while the two subtasks can be seen as fairly independent, combining them in a reasonable way pays back. Additionally, the context of a propaganda snippet is important to identify the specific propaganda technique it uses. Indeed, other teams tried to make context play a role in their models with certain success. For instance, team newsSweeper(TC:5)  used RoBERTa to obtain and to concatenate representations for both the propaganda snippet and its context (i.e., sentence). Team SocCogCom(TC:13)  reduced the context to a window of three words before and after the propaganda snippet.
As in the SI subtask, a number of teams achieved sizable improvements when using various features. For instance, team BPGC(TC:18)  included TF.IDF vectors of words and character n-grams, topic modeling, and sentence-level polarity, among others, to their ensemble model that used BERT and logistic regression. Team SocCogCom(TC:13)  integrated semantic-level emotional salience features from CrystalFeel (Gupta and Yang, 2018) and word-level psycholinguistic features from LIWC (Pennebaker et al., 2015). Team CyberWallE(TC:8) (Blaschke et al., 2020) added named entities, rhetorical, and question features, while taking special care of repetitions as part of a complex ensemble architecture. According to team UNTLing(TC:27) (Petee and Palmer, 2020), considering NEs is particularly useful for propaganda techniques such as Loaded Language and Flag Waving (e.g., the latter usually includes references to idealized entities) and VAD features were useful for emotion-related propaganda techniques such as Appeal to fear/prejudice and Doubt. Team DiSaster(TC:11) (Kaas et al., 2020) combined BERT with features including frequency of the fragment in the article and in the sentence it appears in, and the inverse uniqueness of words in a span. The goal of these features was to compensate the inability of BERT to deal with distant context, specifically to target the technique Repetition. Team Solomon(TC:4) also targeted Repetition by using dynamic least common sub-sequence, which they used to score the similarity between the fragment and the context. Then, the fragment was considered to be a repetition if the score was greater than a threshold heuristically set with respect to the length of the fragment.
Some other teams decided to perform a normalization of the input texts, thus trying to reduce the representation diversity. This was the case of team DUTH(TC:10) (Bairaktaris et al., 2020), which mapped certain words into classes using named entity recognition with focus on person names and gazetteers containing names and variations of names of countries (255 entries), religions (35 entries), political ideologies (23 entries), and slogans (41 entries). The recognized categories were replaced by the category name in the input, before passing the input to BERT.
As the class distribution for subtask TC is heavily skewed, some teams tried balancing techniques. For example, team Inno(TC:7) (Grigorev and Ivanov, 2020) Table 5 shows the performance of the participating systems both on the testing and on the development partitions on the SI subtask. The baseline for subtask SI is a simple system that randomly generates spans, by first selecting the starting character of a span and then its length. As mentioned in Section 4, practically all approaches relied on Transformers to produce representations and then plugged their output into a sequential model, at the token level. It is worth observing that only three of the top-5 systems on the development set appear also among the top-5 systems on the test set. Indeed, teams syrapropa and PALI felt down from positions 1 and 2 on the development set to positions 25 and 18 on the test set, which suggests possible overfitting. The performance for the final top-3 systems on the test partition -Hitachi, ApplicaAI, and aschernreflects robust systems that seem to generalize much better.  Figure 5 shows the performance evolution when combining the output of the top-performing systems on the test set. All operations are carried out at the character level, i.e., we label characters as propagandistic or not, and then we combine into spans the longest possible sequences of neighboring characters that we labeled as propagandistic. The union and the intersection use the corresponding set operations. In union, a character is considered propagandistic if at least one of the participating systems has recognized it as part of a propaganda span. In intersection, a character is considered propagandistic if all systems have included it as part of a propaganda span. For majority voting, we consider a character propagandistic if more than 50% of the participating systems had included it as part of a propaganda span. We can see in Figure 5 that the precision and the recall trends behave just as we expected: a lower precision (higher recall) is observed when more systems are combined with a union operation, and the opposite is true for the intersection. Despite the loss in terms of precision, computing the union of the top- [2,3] systems results in a better performance than what each of the top systems could achieve on its own. Such a combination gathers large ensembles of Transformer representations together with self-supervision to produce additional training data and boundary post-processing. If we are interested in a high-precision model, applying the intersection would make more sense, as it reaches a precision of 66.95 when combining the top-2 systems; however, this comes at the cost of a sizable lost of spans, which results in considerable drop in recall. The majority voting combination lies somewhere in between: keeping reasonable levels for both the precision and the recall. Table 6 shows the performance of the participating systems on the test set for the TC subtask, and Table 7 reports the results on the development set. The baseline system for subtask TC is a logistic regression classifier using one feature only: the length of the fragment. A similar pattern as for the SI subtask is observed: only two of the top-5 systems on the development set appear among the top-5 systems on the test set, which is a sign of possible overfitting for some of the systems. At the same time, systems that appeared to have a modest performance on the development set could eventually reach a higher position on test. For instance, team Hitachi, which was ranked 8th on the development set, ended up in the third position on the test set.

Results on the Technique Classification Subtask
The tables further show the performance for each of the 14 propaganda techniques. In general, the systems show reasonably good performance when predicting Loaded Language and Name Calling or Labeling. These two classes are the most frequent ones, by a margin, and are also among the shortest ones on average (cf. Figure 3). On the other hand, techniques 13 (Straw man, red herring) and 14 (Bandwagon, reduction ad hitlerum, whataboutism) are among the hardest to identify. They are also among the least frequent ones. Once again, we studied the performance when combining more approaches. Figure 6 shows the performance evolution when combining different numbers of top-performing systems on the test set. As this is a multi-class problem, we combine the systems only on the basis of majority voting. In case of a tie, we prefer the more frequent propaganda technique on the training set. When looking at the overall picture, the performance evolution when adding more systems is fairly flat, reaching the top performance when combining the top-3 systems: an F 1 score of 63.63, which represents more than 1.5 points of absolute improvement over the top-1 system. When zooming into each of the fourteen propaganda techniques, we observe that in general the performance peak is indeed reached when considering three systems, e.g., for Appeal to fear-prejudice, Exaggeration, minimisation, or Causal oversimplification. Still for Doubt, which is the hardest class to recognize, as many as 13 systems are necessary in order to reach a (still discrete) peak performance of 17.78. Finally, note that there are other classes, such as Black-and-white fallacy or Whatabaotism, straw men, red herring, for which system combinations do not help.    Table 7: Technique classification F 1 performance on the development set. The systems are ordered based on the final ranking on the test set (cf. Table 6), whereas the ranking is the one on the development set. Columns 1 to 14 show the performance on each class (cf. Section 2). The best score for each class is bold.

Related Work
Propaganda is particularly visible in the context of "fake news" on social media, which have attracted a lot of research recently (Shu et al., 2017).  surveyed fact-checking approaches to fake news and related problems, and Li et al. (2016) focused on truth discovery in general. Two recent articles in Science offered a general discussion on the science of "fake news" (Lazer et al., 2018) and the process of proliferation of true and false news online (Vosoughi et al., 2018).
We are particularly interested here in how different forms of propaganda are manifested in text. So far, the computational identification of propaganda has been tackled mostly at the article level. Rashkin et al. (2017) created a corpus, where news articles are labeled as belonging to one of the following four categories: propaganda, trusted, hoax, or satire. The articles came from eight sources, two of which were propagandistic. The labels were obtained using distant supervision, assuming that all articles from a given news source share the label of that source, which introduces noise (Horne et al., 2018).  experimented with a binary version of the problem: propaganda vs. no propaganda. See (Da San Martino et al., 2020a) for a recent survey on computational propaganda detection.
In general, propaganda techniques serve as a means to persuade people, often in argumentative settings. While they may increase the rhetorical effectiveness of arguments, they naturally harm other aspects of argumentation quality (Wachsmuth et al., 2017). In particular, many of the span propaganda techniques considered in this shared task relate to the notion of fallacies, i.e. arguments whose reasoning is flawed in some way, often hidden and often on purpose (Tindale, 2007). Some recent work in computational argumentation has dealt with such fallacies. Among these, Habernal et al. (2018) presented and analyzed a corpus of web forum discussions with Ad hominem fallacies, and Habernal et al. (2017) introduced Argotario, a game that educates people to recognize fallacies. Argotario also had a corpus as a by-product, with 1.3k arguments annotated for five fallacies, including Ad hominem, Red herring, and Irrelevant authority, which are related to some of our propaganda techniques (cf. Section 2). Unlike these corpora, the news articles in our corpus are annotated with fourteen propaganda techniques. Moreover, instead of labeling entire arguments, our annotation aims at identifying the minimal text spans related to a technique.
In the present SemEval task, we departed from the eighteen propaganda techniques and the corpus described in Yu et al., 2019). 8 We used the news articles included in that corpus in a pilot task that ran in January 2019, the Hack the News Datathon, 9 as well as in a previous shared task, held as part of the 2019 Workshop on NLP4IF: Censorship, Disinformation, and Propaganda. 10 Both the datathon and the shared task tackled the identification of propaganda techniques as one overall task (along with a binary sentence-level propaganda classification task), i.e. without splitting it into subtasks as we did here. As detailed in the overview paper of Da San , the best-performing systems in the shared task used BERT-based contextual representations. Other systems used contextual representations based on RoBERTa, Grover, and ELMo, or context-independent representations based on lexical, sentiment, readability, and TF-IDF features. As in the task at hand, ensembles were also popular. Still, the most successful submissions achieved an F 1 -score of 24.88 only (and only 10.43 in the datathon). This is why, here we decided to split the task into subtasks in order to allow researchers to focus on one subtask at a time. Moreover, we merged some of the original 18 propaganda techniques to reduce data sparseness issues.
Other related shared tasks include the FEVER 2018 and 2019 tasks on Fact Extraction and VERification , the SemEval 2017 and 2019 tasks on determining the veracity of rumors (Derczynski et al., 2017;Gorrell et al., 2019) and the SemEval 2019 task on Fact-Checking in Community Question Answering Forums (Mihaylova et al., 2019). Also, the CLEF 2018-2020 CheckThat! labs' shared tasks Elsayed et al., 2019a;Elsayed et al., 2019b;Barrón-Cedeño et al., 2020b), which featured tasks on automatic identification  and verification Hasanain et al., 2019;Shaar et al., 2020) of claims in political debates and in social media.

Conclusion and Future Work
We have described SemEval-2020 Task 11 on Detection of Propaganda Techniques in News Articles. The task attracted the interest of a number of researchers: 250 teams signed up to participate, and 44 made submissions on the test dataset. We received 35 and 31 submissions for subtask SI and subtask TC, respectively. Overall, subtask SI (segment identification) was easier and all systems managed to improve over the baseline. However, subtask TC (technique classification) proved to be much more challenging, and some teams could not improve over our baseline.
In future work, we plan to extend the dataset to cover more examples as well as more propaganda techniques. We further plan to develop similar datasets for other languages.

A Summary of all Submitted Systems
This appendix includes a brief summary of all systems for both subtasks. We present the teams in alphabetical order. The subindex on the right of each team represents its official test rank in the subtasks. The teams appearing in Tables 5, 6, or 7 but not here did not submit a paper describing their approach.
Team 3218IR (Dewantara et al., 2020)(SI:31) used a one-dimensional convolutional neural network (CNN) with word embeddings, whose number of layers and filters as well as kernel and pooling sizes were all tuned empirically.
Team ApplicaAI (SI: 2, TC: 1) applied self-supervision using the RoBERTa model. For the SI subtask, they used a RoBERTa-CRF architecture. The model trained using this architecture was then iteratively used to produce silver data by predicting on 500k sentences and retraining the model with both gold and silver data. As for subtask TC, ApplicaAI opted for feeding their models with propagandas snippets in context. Full sentences are shaped as the input with the specific propaganda in them. Once again, silver data was used, taking advantage of the spans detected by their SI model and labeling with their preliminary TC model. The final classifier was an ensemble of models trained on the original corpus, re-weighting, and a model trained also on silver data.
Team aschern (SI: 3, TC: 2) tackled both subtasks. For SI, they fine-tuned an ensemble of two differently intialized RoBERTa models, each with an attached CRF for sequence labeling and simple span character boundary post-processing. A RoBERTa ensemble was also used for TC, treating the task as sequence classification but using an average embedding of the surrounding tokens and the length of a span as contextual features. They further used transfer learning, to pass knowledge about the SI subtask to help the TC subtask. Finally, specific postprocessing was done to increase the consistency of the repetition technique spans and to avoid insertions of techniques in other techniques.
Team BPGC (SI: 21, TC: 18) used a multi-granularity approach to address subtask SI. Information about the article and the sentence was considered when classifying each word as propaganda or not, by means of computing and concatenating vectorial representations for the three inputs. For subtask TC, they used an ensemble of BERT and logistic regression, complemented with engineered features which, as stated by the authors, were particularly useful for the smaller classes. Such features include TF.IDF vectors of words and character n-grams, topic modeling, and sentence-level polarity, among others. Different learning models were explored for both tasks, including LSTM and CNN, together with diverse Transformers to build ensembles of classifiers.
Team CyberWallE (Blaschke et al., 2020)(SI: 8, TC: 8) used BERT embeddings for subtask SI, as well as manual features modeling sentiment, rhetorical structure, and POS tags, which were eventually fed into a bi-LSTM to produce IO labels, followed by some post-processing to merge neighboring spans. For subtask TC, they extracted the pre-softmax layer of BERT and further added extra features (rhetorical, named entities, question), while taking special care of repetitions as part of a complex ensemble architecture, followed by label post-processing.
Team DiSaster (Kaas et al., 2020)(TC:11) used a combination of BERT and hand-crafted features, including frequency of the fragment in the article and in the sentence it appears in and the inverse uniqueness of words in a span. The goal of the features is to compensate the inability of BERT to deal with distant context, specifically to target the technique Repetition.
Team DoNotDistribute (SI: 22, TC: 24) opted for a combination of BERTbased models and engineered features (e.g., PoS, NEs, frequency within propaganda snippets in the training set). A reported performance increase of 5% was obtained by producing 3k new silver training instances. A library was used to create near-paraphrases of the propaganda snippets by randomly substituting certain PoS words.
Team DUTH (Bairaktaris et al., 2020)(TC:10) pre-processed the input including URL normalization, number and punctuation removal, as well as lowercasing. They further mapped certain words into classes using named entity recognition with focus on person names and gazetteers containing names and variations of names of countries (255 entries), religions (35 entries), political ideologies (23 entries), and slogans (41 entries). The recognized categories were replaced by the category name in the input, before passing the input to BERT.
Team Hitachi (SI: 1, TC: 3) used BIO encoding for subtask SI, which is typical for related segmentation and labeling tasks such as named entity recognition. They used a complex heterogeneous multi-layer neural network, trained end-to-end. The network used a pre-trained language model, which generates a representation for each input token. To this were added part-of-speech (PoS) and named entity (NE) embeddings. As a result, there were three representations for each token, which were concatenated and used as an input to bi-LSTMs. At this moment, the network branches as it is trained with three objectives: (i) the main BIO tag prediction objective, and two auxiliary objectives, namely (ii) token-level technique classification, and (iii) sentence-level classification. There is one Bi-LSTM for objectives (i) and (ii), and there is another Bi-LSTM for objective (iii). For the former, there is an additional CRF layer, which helps improve the consistency of the output. For subtask TC, there are two distinct FFNs, feeding input representation, which are obtained in the same manner as for subtask SI. One of the two FFNs is for sentence representation, and the other one is for the representation of tokens in the propaganda span. The propaganda span representation is obtained by concatenating representation of the begin-of-sentence token, span start token, span end token, and aggregated representation by attention and max-pooling. For both subtasks, these architectures were trained independently with different BERT, GPT-2, XLNet, XLM, RoBERTa, or XLM-RoBERTa Transformers; and the resulting models were combined in ensembles.
Team Inno (Grigorev and Ivanov, 2020)(TC:7) used RoBERTa with cost-sensitive learning for subtask TC. They experimented with undersampling, i.e. removing examples from the bigger classes, as well as with modeling the context. They also tried various pre-trained Transformers, but obtained worse results.
Team JUST (TC:15) based its approach to the task on the BERT uncased pre-trained language model, which used 12 transformer layers that were trained for 15 epochs.
Team LTIatCMU (Khosla et al., 2020)(SI:4) used a multi-granular BERT BiLSTM for subtask SI. It used additional syntactic, semantic and pragmatic affect features at the word, sentence and document level. It was jointly trained on token and sentence propaganda classification, with class balancing. In addition, BERT was fine-tuned to persuasive language on about 10,000 articles from propaganda websites, which turned out to be important in their experiments.
Team newsSweeper (Singh et al., 2020)(SI: 13, TC: 5) used BERT with BIOE encoding for subtask SI. For the TC subtask, their official run used RoBERTa to obtain representations for the span and for the sentence, which they concatenated. The team further experimented (i) with other Transformers (BERT, RoBERTa, SpanBERT, and GPT-2), (ii) with other sequence labeling schemes (P/NP, BIO, BIOES), (iii) with concatenating different hidden layers of BERT to obtain a token representation, and (iv) with POS tags, as well as (v) with different neural architectures.
Team NLFIIT (SI: 17, TC: 16) used various combinations of neural architecture and embeddings and found out that ELMo combined with BiLSTM (and self attention for subtask TC) yielded the best performance.
Team NoPropaganda (SI: 7, TC: 6) used the LasetTagger model with the BERTbase encoder for subtask SI. R-BERT was used for subtask TC.
Team NTUAAILS (Arsenos and Siolas, 2020)(SI: 27, TC: 25) used a residual biLSTM fed with pretrained ELMo embeddigns for subtask SI. A biLSTM was used for subtask TC as well, but this time fed with GloVe word embeddings Team PsuedoProp (Chauhan and Diddee, 2020)(SI:14) focused on subtask SI. They pre-classified sentences as propaganda or not using an ensemble of XLNet and RoBERTa, before fine-tuning a BERTbased CRF sequence tagger to identify the exact spans.
Team SkoltechNLP (SI: 25, TC: 26) fine-tuned BERT for SI, expanding the original training set through data augmentation techniques based on distributional semantics.
Team SocCogCom (TC:13) approached subtask TC using BERT/ALBERT together with (i) semantic-level emotional salience features from CrystalFeel (Gupta and Yang, 2018), and (ii) word-level psycholinguistic features from the LIWC lexicon (Pennebaker et al., 2015). They further modeled the context, i.e. three words before and after the target propaganda snippet.
Team Solomon (Raj et al., 2020)(TC:4) addressed subtask TC with a system that combines a transfer learning model based on fine-tuned RoBERTa (integrating fragment and context information), an ensemble of binary classifiers for the smaller classes and a novel system to specifically handle Repetition: they used dynamic least common sub-sequence to assess the similarity between the fragment and the context, and then the fragment was considered to be a repetition if the score was greater than a threshold heuristically set with respect to the length of the fragment.
Team syrapropa (Li and Xiao, 2020)(SI: 25, TC: 20) fine-tuned SpanBERT, a variant of BERT for span detection, on the context of spans in terms of the surrounding non-propaganda text for subtask SI. For subtask TC, they used a hybrid model that consists of several submodels, each specializing in some of the relations. These models include (i) BERT, (ii) BERT with cost adjustment to address class imbalance, and (iii) feature-rich logistic regression. The latter uses features such as length, TF.IDF-weighted words, repetitions, superlatives, and lists of fixed phrases targeting specific propaganda techniques. The output from the hybrid model was further post-processed using some syntactic rules based on part of speech.
Team Transformers (Verma et al., 2020)(SI: 9, TC: 29) explored a manifold of models to address the SI subtask. They considered residual biLSTMs fed with ELMo representations as well as different variations of BERT and RoBERTa with CNNs Team TTUI (Kim and Bethard, 2020)(SI: 20, TC: 14) proposed an ensemble of fine-tuned BERT and RoBERTa models. They observed that feeding as input to the neural network a chunk of multiple, possibly overlapping sentences yielded the best performance. Moreover, for subtask SI, they applied a post-processing to remove gaps in the predictions between adjacent words. For subtask TC, they showed that modeling the context did not help in their experiments.
Team UAIC1860 (Ermurachi and Gifu, 2020)(SI: 28, TC: 26) used traditional text representation techniques: character n-grams, word2vec embeddings, and TF.IDF-weighted word-based features. For both subtasks, these features were used in a Random Forest classifier. Additional experiments with Naïve Bayes, Logistic Regression and SVMs yielded worse results.
Team UMSIForeseer Jiang et al. (2020)(TC:17) focused on subtask TC. They fine-tuned BERT on the labeled training spans, using a mix of oversampling and undersampling that is leveraged using a bagging ensemble learner.
Team UNTLing (Petee and Palmer, 2020)(TC:27) used a logistic regression classifier for subtask TC with a number of features, including bag-of-words, embeddings, NE and VAD lexicon features. Their analysis highlights that NE are useful for Loaded Language and Flag Waving. The VAD features were useful for emotion-related techniques such as Appeal to fear/prejudice and Doubt. They performed some experiments on the development set for subtask SI after the deadline. They used CRF with a number of features including PoS, syntactic dependency between the token and the previous/next word, BoW of preceding/following tokens, and the GloVe embedding of the token.
Team UPB (Paraschiv and Cercel, 2020)(SI: 5, TC: 19) used models based on BERT-base. Rather than just using the pre-trained models, they used masked language models to domain-adapt it with 9M-articles with fake, suspicious, and hyperpartisan news articles. They used the same domain-adapted model for both subtasks. They further used CRF for subtask SI, and a softmax for subtask TC.
Team UTMN (SI:23) addressed subtask SI by representing the texts as a concatenation of tokens and context embeddings, together with sentiment intensity from VADER. They avoided deep learning architectures in order to produce a computationally affordable model, namely logistic regression.
Team WMD (Daval-Frerot and Yannick, 2020)(SI: 33, TC: 21) used an ensemble of BERT-based models, LSTMs, SVMs, gradient boosting, and random forest together with character and word-level embeddings. In addition, they used a number of techniques for data augmentation: back-translation, synonym replacement and TF.IDF replacement, i.e., replacing unimportant words, according to their TF.IDF score, with other unimportant words.
Team YNUtaoxin (Tao and Zhou, 2020)(SI:11) used BERT, RoBERTa and XLNet on subtask SI focusing on determining the optimal input sentence length for the networks.

B Errata
After the shared task has ended, we found a bug in the code for our evaluation tools, which affected both subtasks. Overall, its impact was limited, and the ranking computed with the fixed code did not change substantially, in particular for the top-ranked teams. Tables 8 and 9 show the corrected scores on the test sets for subtasks SI and TC, respectively. Any reference to the task results should refer to these numbers.   Table 9: Technique classification F 1 performance on the test set using the fixed scorer. The systems are ordered based on the final ranking. Columns 1 to 14 show the performance for each of the fourteen propaganda techniques (cf. Section 2). The best score for each technique is shown in bold.

C Annotation Instructions
Below, we show a series of snapshots of the actual annotation instructions and propaganda techniques definitions and examples that we showed to the human annotators. These are also available online: • http://propaganda.qcri.org/annotations/ • https://propaganda.qcri.org/annotations/definitions.html