From the Stage to the Audience: Propaganda on Reddit

Political discussions revolve around ideological conflicts that often split the audience into two opposing parties. Both parties try to win the argument by bringing forward information. However, often this information is misleading, and its dissemination employs propaganda techniques. In this work, we analyze the impact of propaganda on six major political forums on Reddit that target a diverse audience in two countries, the US and the UK. We focus on three research questions: who is posting propaganda? how does propaganda differ across the political spectrum? and how is propaganda received on political forums?


Introduction
Propaganda, translated from Latin as "things that must be disseminated", represents information intended to persuade an audience to accept a particular idea or cause by using specific strategies or stirring up emotions. Our work is the first study that leverages a high quality annotated dataset of propaganda techniques  to understand the impact of propaganda on online conversations.
In this paper, we perform an in-depth and longterm analysis of propaganda on online forums. We focus on six subreddits from two English speaking countries, the US and the UK, for one year. We select a popular subreddit for political news with no party affiliation and two subreddits dedicated to each country's dominant parties. In the US, the two main parties are the Democrat and the Republican party. The Democrat party is center-left; however, it contains several factions with ideologies varying from the center to the left. The Republican Party is a center-right party and has shifted in recent years towards national conservatism. In the UK, the most popular parties are the Labour Party and the Conservative Party. Similarly to the US, these parties represent the center-left and the center-right. The Labour Party has social democratic and socialist factions, while the Conservative Party has many factions, such as one-nation conservatism, liberal conservatism, or social conservatism. In recent years, both countries passed through significant political turmoil, such as Donald Trump's election in the US and the referendum on leaving the EU in the UK. However, a recent opinion piece in the Washington Post highlights an essential difference between the political discourse in the two countries. The journalist believes that the division between the left and the right in America is driven by the different interpretations the two parties give to the words "rights", "liberty" or "freedom", which have a strong moral imperative. This difference is not present in the UK, hence political parties there might find it easier to reach common grounds.
Our contribution to the study of propaganda in online discussions is in investigating the following research questions: i) Who is posting propaganda? ii) How does propaganda differ across the political spectrum or different countries? and iii) How is propaganda received on political forums? We believe we are the first to investigate these important questions in forums with different political leaning. For the first question, we find that media sources' political bias is a strong indicator of the tendency of using propaganda and that a smaller community of users is disproportionately spreading propagandistic articles. Regarding the second question, we find that forums dedicated to less popular parties in a country are more likely to post biased news and that cultural differences might dictate which propaganda techniques are employed. Finally, we find that if a submission or comment has more propaganda content, it might receive more user engagement, measured either as the number of comments or as upvotes and downvotes.

Related Work
Analysis of political discussions. (Roozenbeek and Salvador Palau, 2017) explore the role of online communities in elections and how different types of new events impact their dynamics. In (Soliman et al., 2019), the authors analyze political communities (subreddits on Reddit), comparing them to the content posted, their relationships to other subreddits, and the distribution of attention received in these subcommunities. They compare left-leaning with right-leaning communities, with significant differences emerging, such as higher use of derogatory language in the right-leaning communities, stronger connectivity between the US and the European right-leaning communities, and more substantial focus on media sources reflecting their political leaning in the left-leaning subreddits. In (Guimaraes et al., 2019), the authors identify different conversation patterns that refine the notion of controversy into disputes, disruptions, and discrepancies and perform a systematic analysis of discussion threads based on essential facets of a conversation like users, sentiments, and topics.  proposes an analytical template to explore the nature of political discussions by studying the interaction and linguistic patterns within and between politically homogeneous and heterogeneous communication spaces on Reddit. (Carman et al., 2018) analyzes the effects of vote manipulation on article visibility and user engagement by comparing political threads on Reddit whose visibility is artificially increased.
Propaganda detection. Previous works on propaganda have focused on proposing datasets to foster further research, including document-level annotations (Rashkin et al., 2017; and fragment level annotations (Da San . Efforts for constructing annotated datasets have also been made in other European languages different from English (Kmetty et al., 2020;Baisa et al., 2019). Automatic propaganda detection approaches are almost always proposed alongside new corpora. (Rashkin et al., 2017) defines a four-class text classification task that detects propaganda, satire, hoaxes, and real news, while (Barrón-Cedeño et al., 2019) uses a binary classification to detect propaganda and nonpropaganda articles.  perform fine-grained analysis of texts by detecting all fragments that contain propaganda techniques, as well as their type. In (Kellner et al., 2020), the authors quantify the influence of trolls on Twitter that contribute to the propaganda spread during political elections in online communities. Studies on the use of propaganda have also helped understand how terrorist organizations share their ideology and attract new members (Al-Rawi and Groshek, 2020;Bisgin et al., 2019).  reviews the state of the art of computational propaganda detection from both an NLP and a network analysis perspective, arguing on the need to combine these communities' efforts.
Bot detection in political discussions. Research on political discussions has mostly focused on specialized topics such as adversarial debates between two parties, like election campaigns and referendums. (Rizoiu et al., 2018;Davis et al., 2016) use machine learning approaches to study social bots' influence in the diffusion of tweets containing partisan hashtags surrounding a political debate. (Hurtado et al., 2019) studies political discussions on Reddit and uses graph-based methods to reveal a fully connected community of users who exhibit a bot-like behavior.  introduces a generative model based on users' temporal activity patterns to study abnormal posting behavior both on Twitter and Reddit data.
Journalistic efforts in studying online content. There have been some relevant initiatives by communities of expert journalists or volunteers to raise awareness of different online news issues by evaluating the content published by news outlets and social media. For instance, Media Bias/Fact Check (MBFC) is an independent organization that analyzes media in terms of their factual reporting, bias, and propagandist content, among other aspects. Full Fact, an independent fact-checking organization in the UK, provides free tools, information, and advice for checking claims by politicians and the media. Similar initiatives have been taken by US News and World Report and the European Union.

Propaganda Techniques
Propaganda is a communication technique primarily used to influence public opinion towards an a-priory established agenda.
According to the Institute for Propaganda Analysis, propaganda had its definition pinned in 1938 as being "the expression of an opinion or an action by individuals or groups deliberately designed to influence the opinions or the actions of other individuals or groups with reference to predetermined ends" (for Propaganda Analysis, 1938).
In the past century, spreading propaganda required controlling traditional journalism media, such as newsprint, TV, and radio stations. It represented a form of communication that only large institutions and governments could afford. With the recent rise of the Internet and its use as online mass media, "computational propaganda" appeared (Bolsover and Howard, 2017) as a social and technical phenomenon that made propaganda campaigns easily accessible to a wide variety of small organizations and individuals that targeted audiences of unprecedented size. Recent striking examples include the propaganda allegedly set to influence the 2016 US presidential elections (Mueller, 2018) and the 2016 Brexit referendum (Howard and Kollanyi, 2016).
While the definition of propaganda has reached consensus in the literature, the complete list of techniques considered propagandist are still under discussion, Wikipedia 1 mentioning 68 of them. We adhere to the hypothesis previously made by  that argues that propaganda is a communication technique that does not depend on the document topic and its topic-specific vocabulary and for which representations based on writing style, readability, and stylistic features generalize better than word-level based representations. (Da San Martino et al., 2019) chooses to investigate a curated list of eighteen propaganda techniques found in journalistic articles that can be judged intrinsically, without the need to retrieve supporting information from external resources. Many of these techniques are also fallacies since propagandists use arguments that are sometimes convincing and not necessarily valid. A fallacy is an argument where the evidence does not support the claim that is put forward. The other techniques employ emotional language or use rhetorical, psychological, and disinformation strategies to present an idea.
We leverage the list of eighteen propaganda techniques proposed by .
• Appeal to authority (fallacy) cites an expert's opinion to support an argument, without any other supporting evidence.
• Appeal to fear or prejudice (fallacy) supports a claim by increasing fear towards an alternative, possibly based on preconceived judgments.
• Bandwagon (argumentum ad populum fallacy) persuades the audience that a claim is true because many people believe so.
• Black and white fallacy presents only two choices out of many available, with the choice on the agenda as being the better one.
• Causal oversimplification (fallacy of the single cause) assumes only one cause for a complex issue out of many possible ones.
• Flag waving (fallacy) exploits strong patriotic feelings for a group or idea to justify an action or a claim.
• Name calling or labeling uses names, labels, or euphemisms to construct a good/bad image of a group or idea that is to be supported/denounced.
• Red herring (fallacy) presents an irrelevant, although possible convincing argument to divert the attention from the matter at hand.
• Reductio ad Hitlerum (fallacy) persuades the target audience to disapprove of a claim by associating it with a group widely held in contempt.
• Straw man (fallacy) addresses and refutes a superficially similar claim instead of the real one.
• Whataboutism (fallacy) discredits the opponent's claim by accusing them of hypocrisy without directly addressing the original argument.
• Doubt questions the credibility of an idea by disseminating negative information about it.
• Exaggeration or minimization makes the reality look more meaningful or more insignificant than it is.
• Loaded language uses words and phrases with substantial emotional implications.
• Obfuscation, intentional vagueness, confusion (ambiguity fallacy) deliberately employs vague generalities leaving the audience to draw its interpretations.
• Repetition repeatedly uses the same symbol or idea to make it unforgettable.
• Slogans make use of brief and striking phrases to deliver the intended message.
• Thought terminating cliches take advantage of short, generic phrases that divert the attention or seem to offer simple answers to complex problems to stop an argument from proceeding further.

Reddit Dataset
We select six subreddits: Politics, Democrats, Republican, UKPolitics, LabourUK, and Tories. Politics is a subreddit for "current and explicitly political U.S. news.". The subreddit does not claim any political affiliation. The Democrats subreddit description contains "We are here to get Democrats elected up and down the ballot.", and it is a partisan subreddit. Republican is "a partisan subreddit" and the place where "Republicans discuss issues with other Republicans", hence it is a subreddit for people supporting the US Republican party. UKPolitics is a forum for "political news and debate concerning the United Kingdom" and does not claim any political affiliation. LabourUK is a subreddit that discusses breaking news concerning the British Labour Party. Finally, Tories is a subreddit for news concerning the Conservative Party in the UK, also known as the Tories. When there are several subreddits on the same topic (for example, BritishPolitics is also a subreddit for politics in the United Kingdom), we select the subreddit with the largest number of members. We note that Reddit does not ask for or encourages users to share personal data, such as their locations. Statistics on Reddit users are available only through data gathered from independent polls and surveys. For example, we know that the US and UK are the bestrepresented countries among the Reddit users. In the light of the surveys, we hypothesize that there are many users from the US and UK that engage in political subreddits. We take all content posted for a period of one year, January 2019 to December 2019, from the PushShift dataset (Baumgartner et al., 2020). On Reddit, a discussion is started by a submission, e.g. a news article or a piece of text, and users will engage by writing comments. A comment is described by author, body (the content of the comment), and score (computed as upvotes minus downvotes) among others. We remove comments tagged as "[deleted]" or "[removed]", which are comments removed by the moderators or the users themselves. A submission has several properties, including content (often linked via a URL), number of comments, score (upvotes minus downvotes), and author. For simplicity, we refer to the submission and the article linked in the submission using the term submission. We retrieve the external articles by following the link in the submission. We filter out the submissions whose corresponding articles were not found by the crawler, either because cookie permissions cannot be given automatically or because the link is no longer valid. We also filter out the submissions linking to articles with less than 200 words. We want to focus on journalistic like content, a piece of text large enough to develop well an idea. Overall, we keep around 43 − 71% of the original submissions, depending on the subreddit. An overview of our dataset is given in Table 1.
To further understand the subreddits' dynamics, we report the overlap between the users commenting or posting a submission in the forums over the period we study. In the US related forums, there are 736K unique users, out of which 730K unique users in Politics, 8.5K in Democrats, and 7.7K in Republican. We have that 75% of users in Democrats and 57% of users in Republican also post in Politics, while only 5% of users posting in Republican also post in Democrats. In the UK forums, we have 46K unique users, out of which 44K post in UKPolitics, 3.3K in LabourUK, and 1K in Tories. The overlap between the forums shows a more balanced dynamics, with 61% of the LabourUK users and 63% of the Tories users also posting in UKPolitics, and 23% of the Tories users posting in LabourUK.
We define two classification tasks based on the propaganda dataset described in Section 4: i) propaganda identification, which predicts if a sentence contains any propaganda techniques and ii) propaganda technique identification, which given a sentence containing propaganda, predicts the type of technique.
For each task, we test the following classifiers: a random classifier which predicts a class uniformly at random, a suite of transformer classifier BERT (Devlin et al., 2019), ROBERTA (Liu et al., 2019) and XLNet (Yang et al., 2019), and an ensemble classifier that makes a prediction based on the most confident label given by one of the three classifiers (BERT, ROBERTA or XLNet). Finally, we add the multi granularity model proposed in , MGN ReLU. To fine-tune the transformer models, we add a final linear layer. We use a sequence length of 210, a learning rate of 0.01, a mini-batch of size 16, anneal factor of 0.5, patience of 2, and the maximum number of epochs to 20. To deal with dataset imbalance in both tasks, we weight the loss function samples according to the class weight.
The first task, propaganda identification, is a binary classification task, with classes propaganda and nonpropaganda. We present the results in Table 2. We note that propaganda identification is a difficult task, and all the classifiers obtain moderately good results, however much better than random selection.
The second task allowed us to understand if we have enough instances of each propaganda tech-  nique to classify them. We ran an experimental study, and we observed that bandwagon, obfuscation, red herring, straw men, and thoughtterminating cliches were never recognized in the test set by our classifiers. Given this, we removed them from the annotations, and we kept the remaining techniques for the first and second tasks. We present the results in Table 3.  Topical confounds. Finally, we study the effect of topical confounds in propaganda and technique classification. This analysis aims to understand if there are topical biases in the annotated dataset, which might bias our analysis. For example, if the data contains many articles on Trump, we might tend to label as propaganda any article referring to him. To identify topical biases, we use the approach presented in (Kumar et al., 2019). We first identify statistically overrepresented words in each propaganda technique in the training set and then replace them with a special token in the test set. The overrepresented words are computed using log-odds ratio with Dirichlet prior (Monroe et al., 2008), and we present the results in Table 4. We recall that we removed the techniques bandwagon, obfuscation, red herring, straw men, and thought-terminating cliches from our labeled dataset. As we can observe, for certain categories, the words are very intuitive. For example in reductio ad hitlerum we have many words related to totalitarian regimes, or in flag-waving we have many words around the notion of country. However, for most techniques, the words do not form cohesive topics, which is  expected as propaganda represents a communication technique, and it is not restricted to a topic.
To further verify that our classifiers learn style and not the topic, in the test set, we replace with a special token the top k words strongly associated with each technique, computed from the training set. For both k = 10 and k = 20 we report a very small decrease in F 1 score for the BERT classifier in the propaganda classification task, from 55.08 to 52.47 and from 55.08 to 53.08. For the technique classification task, for k = 10 we do not observe a drop in performance, while for k = 20 we pass from 29.75 F 1-micro score to 27.26, and from 22.17 F 1-macro to 19.85. Besides, we note that the decrease in performance for this task is distributed among techniques. For flag waving and reductio ad hitlerum, for which certain words were important with respect to their definition, we do not observe a large decrease in F 1 score. For example, the F1 scores for flag waving for k = 10 and k = 20 decrease from 43.98 to 39.57 and to 39.36, respectively. Given the small decrease in performance, we can conclude that our classifier does not learn topical confounds but the language patterns of propaganda techniques. We leverage the propaganda identification classifier to define a propaganda score. The propaganda score of a document is the percentage of sentences that were labelled as containing propaganda. We compute the propaganda score of each submission, and based on the distribution of the score values in a subreddit, we define two groups: the least propaganda, which represents the 25% submissions with the lowest propaganda score, and the most propaganda, which represents the 25% submissions with the highest propaganda score. Our aim in defining the two groups is to mitigate part of the classifier's imprecision and make our analysis more robust.

Propaganda on Reddit
In this section, we focus on several research questions around propaganda on online forums.

RQ1. Who is posting propaganda?
In the context of political forum discussions, this question targets two different groups: media outlets and social media users. The initial publishers are the media outlets, but users handpick what news to share on political forums. To study what media outlets are present and in which measure they are responsible for the propaganda content, we look at the groups defined in Section 5, least propaganda submissions, and most propaganda submissions. We compute the top-level domain for each submission in the two groups, which corresponds to the media outlet. We give each media outlet a label measuring its political leaning, according to MediaBiasFactCheck: center, left-center, right-center, left, right, questionable, and others. The center label is interpreted as having no or little political bias, left-center and right-center have a slight bias, left and right have a moderate bias, while the questionable label has a strong bias. The others label is given to sites that are not found in our dataset. MediaBiasFactCheck computes a media source's political bias taking into account bias by story selection, bias by omission, or bias by labeling, among others. In Table 5, we observe a strong relationship between the political bias of the media sources and the groups we computed using our propaganda score. Hence, we can infer that political bias often translates into the use of propaganda techniques.
Concerning users posting propaganda content on Reddit, we cannot link them to real entities; however, we can observe them as a community. We find  Table 5: There is a strong relation between the political bias of the media sources and the groups we computed using our propaganda score. For example, on Politics, the majority of media sources are left center in the least propaganda group, while the majority of sources are left in the most propaganda group. that the most propaganda group's submissions are created by a smaller number of unique users than the submissions in the least propaganda group, on all subreddits except LabourUK, where we observe the opposite trend. On Politics, the least propaganda group has 9% more unique submitters, on Democrats 5%, on Republican 32%, on UKPolitics 9%, on Tories 10% while on LabourUK the most propaganda group has 28% more unique users that created a submission. This trend might indicate that certain users are more active in publishing propaganda content.
One follow-up question that we ask is how many of these users are bots. The presence of bots could explain why in one group there are fewer users posting articles. While there are several lists of Reddit bots, none of them is complete. Given this, we employ Rest-Sleep-and-Comment (RSC) (Ferraz , a generative method that can distinguish human from bot posting activity. The method receives in input the intervals between two consecutive posts of a user, and these intervals are then compared with the aggregated distributions of intervals of all the users. The authors provide an initial training set of normal users and bots consisting of 37 bots and 999 users, to which we add 94 extra bots to make the model more robust. RSC has an average F 1-score of 77.3 in cross-validation. The model requires at least 800 consecutive timestamps at which a user has written a comment. We retrieve from our subreddits all the users that posted a submission, and we keep the users for which we could retrieve the required number of timestamps. We note that the timestamps for a user are retrieved such that they represent consecutive chronological posts. Hence we do not restrict the subreddits in which the user might have posted. We find 748 possible bots on Politics, 91 on Democrats, 21 on Republican, 135 on UKPolitics, 23 on LabourUK and 9 on Tories. We investigate if these suspicious users posted a larger percentage of most propaganda articles in comparison with least propaganda articles. We find that this is the case on all subreddits, except Republican. However, the results are not statistically significant (p > 0.05) and the differences are close, as seen in Table 6. Hence, we can conclude that the bots' automatic activity in our dataset is not necessarily linked to posting propaganda content. Also, the small percentage of content in most propaganda group published by the bots shows that the majority of the propaganda content is published by real users.  RQ2. Does propaganda differ across the political spectrum? For this analysis, we will distinguish between US-based subreddits and UK sub-reddits. We compare these subreddits using our propaganda score. We find that there is a statistically significant difference between the median propaganda score of articles on all subreddits in US (p < 0.001), with the most propagandistic content being shared on the subreddit Republican (median = 0.307), followed by Democrats (median = 0.250), and finally Politics (median = 0.222). In the UK subreddits, UKPolitics (median = 0.214) and Tories (median = 0.217) contains less propaganda than LabourUK (median = 0.257). There is no statistical difference between UKPolitics and Tories, tested using Kruskal-Wallis one-way analysis of variance, followed by Conover posthoc tests. These results indicate that right leaning forums are not more likely to post propaganda than left leaning. However, the tendance of using propaganda could result from the popularity of the respective party in the country. The Conservative Party in the UK has been in government since 2010, and a 2019 survey showed the party 15 points ahead of the Labour party. Even if the Republican party in the US won the White House in 2016, it didn't win the popular vote, and according to surveys more Americans identify as democrats. We also note that the subreddits that don't claim any political affiliation, Politics and UKPolitics, have less propagandistic content, which is consistent with the results in Table 5.
A second question is if the propaganda techniques employed differ according to the subreddits' political leaning or according to the country. We annotate using our propaganda technique identification classifier the sentences we previously labeled as propaganda to test this. We restrict ourselves to articles in the group most propaganda, using the intuition that if many sentences in the same article raise flags in the classifier, it is more likely that the article contains propaganda. For each subreddit, we rank the propaganda techniques by their frequency. We find that the relative ranking of techniques does not differ much between subreddits from the same country. The top 5 most frequent techniques are in the US loaded language, name-calling, exaggeration or minimization, flag waving, doubt, while in the UK based subreddits we have loaded language, name calling, doubt, appeal to fear or prejudice, exaggeration or minimization. Given the low accuracy of our technique classifier, we cannot make any definitive claims. However, such differ-ences between the subreddits discussing politics in the two countries are plausible when considering the cultural differences. For example, Americans might be more susceptible to flag-waving, the technique of using patriotic feelings to justify an action. In 2017, 67% of Americans believed that the US is the leader of the free world according to a survey by the Public Broadcasting Service.
RQ3. How is propaganda received on political forums? To answer this last question, we aim to understand if more propaganda content will create more engagement. On Reddit, engagement is measured in the number of comments or the number of votes.
Firstly, we investigate if users comment more on submissions with higher propaganda score. We compare the median number of comments between the least propaganda group and the most propaganda group for each subreddit using the one sided Mann-Whitney U test. We find that on Politics, Democrats, Republican, UKPolitics and Labou-rUK, submissions in the most propaganda group receive more comments, while on Tories we observe the opposite effect.
We usually associate propaganda with media outlets. However, people can employ the same techniques to persuade the audience. We investigate how comments with propagandistic undertones are received on Reddit. For this we look at the comment's score, which is the difference between the upvotes and downvotes that a comment received. We construct two groups of comments: positively received comments that have the score ≥ t pos ≥ 10, and negatively received comments with the score score ≤ t neg ≤ −5. We compute the average propaganda score in the positively and negatively received groups while increasing the absolute value of the thresholds (t pos from 10 to 50 and t neg from −5 to −50), as shown in Figure 1. We observe that the average propaganda score of a comment increases with the engagement it generates, measured as the number of upvotes or downvotes it received. However, the trend is not observed on Republican and on Tories, one of the smaller subreddits for which we have very few data points in the plot.

Conclusion
In this work, we perform an extensive analysis of propaganda on online forums. We study for one year six subreddits from two English-speaking Figure 1: The average propaganda score in the positively received (blue) and negatively received (red) groups, while increasing the absolute value of the threshold. A data point represents a group with more than 100 comments. countries, the US and the UK. We find several interesting patterns that can be leveraged by Reddit users and moderators to create better online discussions. We have found trends that we believe were not observed before in the literature. For example: i) the parties which represent a minority in a country might tend to use more propaganda; ii) political bias (either towards the right or the left) might be an indication of propaganda; iii) users that post more biased content form smaller communities; iv) differences in the use of propaganda techniques across countries might be rooted in cultural differences; v) submissions and comments having more propaganda content tend to receive more engagement in the form of number of comments, upvotes or downvotes. We note that while we have thoroughly tested all our hypotheses, our work is based on the automatic labelling of submissions and comments, with all the imprecision of such a method. We believe that understanding how propaganda affects us is of utmost importance for ensuring we live in democratic societies.