Structurizing Misinformation Stories via Rationalizing Fact-Checks

Misinformation has recently become a well-documented matter of public concern. Existing studies on this topic have hitherto adopted a coarse concept of misinformation, which incorporates a broad spectrum of story types ranging from political conspiracies to misinterpreted pranks. This paper aims to structurize these misinformation stories by leveraging fact-check articles. Our intuition is that key phrases in a fact-check article that identify the misinformation type(s) (e.g., doctored images, urban legends) also act as rationales that determine the verdict of the fact-check (e.g., false). We experiment on rationalized models with domain knowledge as weak supervision to extract these phrases as rationales, and then cluster semantically similar rationales to summarize prevalent misinformation types. Using archived fact-checks from Snopes.com, we identify ten types of misinformation stories. We discuss how these types have evolved over the last ten years and compare their prevalence between the 2016/2020 US presidential elections and the H1N1/COVID-19 pandemics.


Introduction
Misinformation has raised increasing public concerns globally, well-documented in Africa (Ahinkorah et al., 2020), Asia (Kaur et al., 2018), and Europe (Fletcher et al., 2018). In the US, "fake news" accounted for 6% of all news consumption during the 2016 US presidential election (Grinberg et al., 2019). Years later, 29% of US adults in a survey believed that the "exaggerated threat" of the COVID-19 pandemic purposefully damaged former US president Donald Trump (Uscinski et al., 2020), and 77% of Trump's supporters believed "voter fraud" manipulated the 2020 US presidential election in spite of a complete lack of evidence (Pennycook and Rand, 2021).  As such misinformation continues to threaten society, researchers have started investigating this multifaceted problem, from understanding the socio-psychological foundations of susceptibility (Bakir and McStay, 2018) and measuring public responses (Jiang and Wilson, 2018;Jiang et al., 2020b), to designing detection algorithms (Shu et al., 2017) and auditing countermeasures for online platforms (Jiang et al., 2019(Jiang et al., , 2020c. These studies mostly adopted the term "misinformation" as a coarse concept for any false or inaccurate information, which incorporates a broad spectrum of misinformation stories, e.g., political conspiracies to misinterpreted pranks. Although misinformation types have been theorized and categorized by practitioners (Wardle, 2017), there is, to our knowledge, no empirical research that has systematically measured these prevalent types of misinformation stories.
This paper aims to unpack the coarse concept of misinformation and structurize it to fine-grained story types (as illustrated in Figure 1). We conduct this query through an empirical lens and ask the question: what are the prevalent types of misinformation stories in the US over the last ten years?
The answer to our question is buried in archived fact-checks, which are specialized news articles that verify factual information and debunk false claims by presenting contradictory evidence (Jiang et al., 2020a). As a critical component of their semi-structured journalistic style, fact-checks often embed the (mis)information type(s) within their steps of reasoning. For example, consider the following snippet from a Snopes.com fact-check with a verdict of false (Evon, 2019): "...For instance, some started sharing a doctored photograph of Thunberg with alt-right boogeyman George Soros (the original photograph featured former Vice President Al Gore)..." The key phrase doctored photograph in the snippet identifies the misinformation type of the factchecked story. Additional example phrases are highlighted in Figure 1. With a large corpus of fact-checks, these phrases would accumulate and reveal prevalent types of misinformation stories.
Extracting these phrases is a computational task. Our intuition is that such phrases in a fact-check also act as rationales that determine the verdict of the fact-check. In the previous example, the verdict is false in part because the story contains a doctored photograph. Therefore, a neural model that predicts the verdict of a fact-check would also use the misinformation types as rationales.
To realize this intuition, we experiment on existing rationalized neural models to extract these phrases (Lei et al., 2016;Jain et al., 2020), and, to target specific kinds of rationales, we additionally propose to include domain knowledge as weak supervision in the rationalizing process. Using public datasets as validation (Zaidan et al., 2007;Carton et al., 2018), we evaluate the performance variation of different rationalized models, and show that including domain knowledge consistently improves the quality of extracted rationales.
After selecting the most appropriate method, we conduct an empirical investigation of prevalent misinformation types. Using archived fact-checks from Snopes.com, spanning from its founding in 1994 to 2021, we extract rationales by applying the selected model with theorized misinformation types for weak supervision (Wardle, 2017), and then cluster rationales based on their semantic similarity to summarize prevalent misinformation types. We identify ten types of misinformation stories, a preview of which are shown in Figure 1.
Using our derived lexicon of these clustered misinformation stories, we then explore the evolution of misinformation types over the last ten years.
Our key findings include: increased prevalence of conspiracy theories, fabricated content, and digital manipulation; and decreased prevalence of legends and tales, pranks and jokes, mistakes and errors, etc. We also conducted two case studies on notable events that involve grave misinformation. From the case study of US presidential elections, we observe that the most prevalent misinformation type for both the 2016 and 2020 elections is fabricated content, while the 2016 election has more hoaxes and satires. From the case study of pandemics, our results show that the H1N1 pandemic in 2009 has more legends and tales, while the COVID-19 pandemic attracts more conspiracy theories.
The code and data used in the paper are available at: https://factcheck.shanjiang.me.

Related Work
There is a rich literature that has studied the online misinformation ecosystem from multiple perspectives (Del Vicario et al., 2016;Lazer et al., 2018). Within the computational linguistics community, from an audiences' perspective, Jiang and Wilson (2018) found that social media users expressed different linguistic signals when responding to false claims, and the authors later used these signals to model and measure (dis)beliefs in (mis)information (Jiang et al., 2020b;Metzger et al., 2021). From a platforms' perspective, researchers have assisted platforms in designing novel misinformation detection methods (Wu et al., 2019;Lu and Li, 2020;Lee, 2018, 2020), as well as audited existing misinformation intervention practices (Robertson et al., 2018;Jiang et al., 2019Jiang et al., , 2020cHussein et al., 2020).
In this work, we study another key player in the misinformation ecosystem, storytellers, and investigate the prevalent types of misinformation told to date. From the storytellers' perspective, Wardle (2017) theorized several potential misinformation types (e.g., satire or parody, misleading content, and false connection), yet no empirical evidence has been connected to this typology. Additionally, researchers have investigated specific types of mis-information as case studies, e.g., state-sponsored disinformation (Starbird et al., 2019;Wilson and Starbird, 2020), fauxtography (Zannettou et al., 2018;Wang et al., 2021), and conspiracy theories (Samory and Mitra, 2018;Phadke et al., 2021). In this paper, we aim to structurize these misinformation stories to theorized or novel types.

Rationalized Neural Models
Realizing our intuition (as described in § 1) requires neural models to (at least shallowly) reason about predictions. In this section, we introduce existing rationalized neural models and propose to include domain knowledge as weak supervision in the rationalizing process. We then experiment with public datasets and lexicons for evaluation.

Problem Formulation
In a standard text classification problem, each instance is in a form of (x, y).
x is the input token sequence of length l, where V x is the vocabulary of the input and i is the index of each token x i . y ∈ {0, 1} m is the binary label of length m. Rationalization requires a model to output the predictionŷ together with a binary mask z = [z i ] ∈ {0, 1} l of input length l, indicating which tokens are used (i.e., z i = 1) to make the decision. These tokens are called rationales.
Hard rationalization requires a model to directly output z. Initially proposed by Lei et al. (2016), the model first passes the input x to a tagger 1 module and samples a binary mask z from a Bernoulli distribution, i.e., z ∼ Tagger(x), and then uses only unmasked tokens to make a prediction of y, i.e.,ŷ = Predictor(z, x). 2 The loss function of this method contains two parts. The first part is a standard loss for the prediction L y (ŷ, y), which can be realized using common classification loss, e.g., cross entropy. The second part is a loss L z (z) 3 aiming to regularize z and encourage conciseness and contiguity of rationale selection, formulated by Lei et al. (2016). Recent work proposed to improve the initial model with an adversarial component (Yu et al., 2019;Carton et al., 2018). Combining these parts together, the  Hard rationalization is an end-to-end model that first uses input x to generate rationales z, and then uses unmasked tokens to predict y. Soft rationalization is a three-phased model that first uses input x to predict y and outputs importance scores s, then binarizes s to rationales z, and finally uses unmasked tokens to predict y again as evaluation for faithfulness. model is trained end-to-end using reinforce-style estimation (Williams, 1992), as sampling rationales is a non-differentiable computation. The modules of hard rationalization are illustrated in Figure 2.
Soft rationalization, in contrast, allows a model to first output a continuous version of importance scores s = [s i ] ∈ R l , and then binarize it to get z. Initially formalized by Jain et al. (2020) as a multiphase method, the model first conducts a standard text classification using a supporter modulê y = Supporter(x) and outputs importance scores s, then binarizes s using a tagger module, i.e., z = Tagger(s), and finally uses only unmasked tokens of x to make another predictionŷ to evaluate the faithfulness of selected rationales. 4 These three modules are trained separately in three phases. 5 Since the supporter and predictor are standard text classification modules the only loss needed is for the prediction L y (ŷ, y). This method is more straightforward than the hard rationalization method, as it avoids non-differentiable com-putations and the instability induced by reinforcestyle estimation. The modules of soft rationalization are also illustrated in Figure 2.
The popular attention mechanism (Bahdanau et al., 2014) provides built-in access to s. Although there have been debates on the properties achieved by attention-based explanations (Jain and Wallace, 2019; Wiegreffe and Pinter, 2019;Serrano and Smith, 2019), rationales extracted by straightforward rules on attention weights were demonstrated as comparable to human-generated rationales (Jain et al., 2020). Additionally, in our use case we only need the rationales themselves as key phrases and do not require them to faithfully predict y, therefore the last predictor module can be omitted.

Domain Knowledge as Weak Supervision
Both hard and soft rationalization methods can be trained with or without supervision w.r.t. rationales z (DeYoung et al., 2020) 6 . When rationales are selected in an unsupervised manner, the model would intuitively favor rationales that are most informative to predict the corresponding label as a result of optimizing the loss function. This could result in some undesirable rationales in our case: for example, certain entities like "COVID-19" or "Trump" that are highly correlated with misinformation would be selected as rationales even though they do not suggest any misinformation types. Therefore, we propose to weakly supervise 7 the rationalizing process with domain knowledge to obtain specific, desired types of rationales.
Assuming a lexicon of vocabulary V d as domain knowledge, we reprocess the input and generate weak labels for rationales , masked) otherwise. Then, we include an additional loss item L d (z, z d ) or L d (s, z d ) for the hard or soft rationalization method.
Combining the loss items together, the objective for the end-to-end hard rationalization model is: where θ contains the parameters to estimate and λ (·) are hyperparameters weighting loss items.
Similarly, the objective function for the first phase of soft rationalization is: 6 They are trained with supervision w.r.t. the label y. 7 Since there is inherently no ground-truth of misinformation types in fact-check articles.

Experiments on Public Datasets
We conduct experiments on public datasets to evaluate the performance of hard and soft rationalization methods, particularly for our needs, and confirm that including domain knowledge as weak supervision helps with the rationalizing process.
Datasets selection. An ideal dataset for our models should meet the following requirements: (a) formulated as a text classification problem, (b) annotated with human rationales, and (c) can be associated with high quality lexicons to obtain domain knowledge. We select two datasets based on these criteria: the movie reviews dataset released by Pang et al. (2002) and later annotated with rationales by Zaidan et al. (2007), which contains 2K movie reviews labeled with positive or negative sentiments; and the personal attacks dataset released by Wulczyn et al. (2017) and later annotated with rationales by Carton et al. (2018), which contains more than 100K Wikipedia comments labeled as personal attacks or not.
Domain knowledge. For the sentiment analysis on movie reviews, we use the EmoLex lexicon released by Mohammad and Turney (2013), which contains vocabularies of positive and negative sentiments. For identifying personal attacks, we use a lexicon released by Wiegand et al. (2018), which contains a vocabulary of abusive words. With corresponding vocabularies, we generate weak rationale labels z d for each dataset.
Evaluation metrics. We choose binary precision Pr(z) to evaluate the quality of extracted rationales, because (a) a perfect recall can be trivially achieved by selecting all tokens as rationales, 8 and (b) our case of identifying key phrases requires concise rationales. Additionally, we measure the average percentage of selected rationales over the input length %(z). For predictions, we use macro F 1 (y) as the evaluation metric as well as the percentage of information used %(x) to make the prediction.
Experimental setup and results. The train, dev, and test sets are pre-specified in public datasets. We optimize hyperparameters for F 1 (y) on the dev sets, and only evaluate rationale quality Pr(z) after a model is decided. We discuss additional implementation details (e.g., hyperparameters, loss functions, module cells) in Appendix § A. Movie reviews (Zaidan et al., 2007) Personal attacks (Carton et al., 2018) Table 1: Evaluation results for hard and soft rationalization methods. Our experiments show that: (a) hard rationalization requires a sensitive hyperparameter λ z to regularize rationales (h 2 to h 0 ); (b) soft rationalization achieves the best F 1 (y) overall, but Pr(z) depends on the rationale extraction approach (s 2 /s 3 to s 0 ); (c) domain knowledge as weak supervision improves Pr(z) for both hard (h 1 to h 0 ) and soft (s 1 to s 0 ) rationalization while maintaining similar %(z) and F 1 (y); (d) soft rationalization achieves better Pr(z) in a fair comparison (s 1 to h 1 ).
The evaluation results for all our experiments on test sets are reported in Table 1, indexed with h 0 -h 3 and s 0 -s 3 . We report the evaluation results on dev sets in Appendix § B.
Regularization for hard rationalization. h 0 and h 2 are our re-implementation of Lei et al. (2016), varying the rationale regularization hyperparameter λ z . Our experiments show that λ z is a crucial choice. When a small λ z is chosen (i.e., rationales are under-regularized), the model has a tendency to utilize all the available information to optimize the predictive accuracy. In h 2 , we set λ z = 0 and the model selects 99.9% of tokens as rationales while achieving the best F 1 (y) overall, which is an undesirable outcome in our case. Therefore, we increases λ z so that only small parts of tokens are selected as rationales in h 0 . However, echoing Jain et al. (2020), the output when varying λ z is sensitive and unpredictable, and searching for this hyperparameter is both time-consuming and energy-inefficient. We also run an experiment h 3 with the additional adversarial component proposed in (Carton et al., 2018;Yu et al., 2019), and the evaluation metrics are not consistently improved compared to h 0 .
Binarization for soft rationalization. s 0 , s 2 and s 3 are our re-implementation of Jain et al. (2020). For soft rationalization, rationales are selected (i.e., binarized) after the supporter module is trained in phase one, therefore s 0 -s 3 utilize 100% of the tokens by default, and achieve the best F 1 (y) overall. We implement a straightforward approach to select rationales by setting a threshold t and make z i = 1 (i.e., unmasked) if the importance score s i > t and z i = 0 (i.e., masked) otherwise. Intuitively, increasing t corresponds to less selected rationales, and therefore increasing Pr(z). To confirm, in s 2 , we increase t until %(z) is exactly half of s 0 . Similarly, decreasing t corresponds to more selected rationales, and therefore decreasing Pr(z). In s 3 , we decrease t until %(z) is exactly double of s 0 .
Is domain knowledge helpful? h 1 and s 1 include domain knowledge as weak supervision. Our results show that domain knowledge improves Pr(z) for both hard (h 1 to h 0 ) and soft (s 1 to s 0 ) rationalization methods and on both dataset, while maintaining similar %(z) and F 1 (y). The improvements are more substantial for soft rationalization.
Hard vs. soft rationalization. To fairly compare hard and soft rationalization methods, we choose the threshold t to keep %(z) the same for h 1 and s 1 . 9 Our experiments show that soft rationalization weakly supervised by domain knowledge achieves better Pr(z) on both datasets, and therefore we chose it for rationalizing fact-checks.

Rationalizing Fact-Checks
After determining that soft rationalization is the most appropriate method, we apply it to extract rationales from fact-checks. In this section, we introduce the dataset we collected from Snopes.com and conduct experiment with fact-checks to structurize misinformation stories.

Data Collection
Snopes.com is a renowned fact-checking website, certified by the International Fact-Checking Network as non-partisan and transparent (Poynter, 2018). We collect HTML webpages of fact-check articles from Snopes.com, spanning from its founding in 1994 to the beginning of 2021.
Preprocess and statistics. We first preprocess collected fact-checks by extracting the main article content and verdicts from HTML webpages using a customized parser, and tokenizing the content with NLTK (Bird, 2006). The preprocessing script is included in our released codebase.
After preprocessing, the median sequence length of fact-checks is 386 tokens, and 88.6% of factchecks containing ≤1,024 tokens. Jiang et al. (2020a) found that the most informative content in fact-checks tended to be located at the head or the tail of the article content. Therefore, we set the maximum sequence length to 1,024 and truncate over-length fact-checks.
Next, we label each fact-check with a binary label depending on its verdict: (truthful) information if the verdict is at least mostly true and misinformation otherwise, which results in 2,513 information and 11,183 misinformation instances.
Additionally, we preemptively mask tokens that are the exact words as its verdict (e.g., "rate it as false" to "rate it as [MASK]"), 10 otherwise predicting the verdict would be trivial and the model would copy overlapping tokens as rationales.
Domain knowledge for misinformation types. The domain knowledge comes from two sources: (a) the misinformation types theorized by Wardle (2017), e.g., misleading or fabricated content; and (b) certain variants of verdicts from Snopes.com such as satire or scam (Snopes.com, 2021a). We combine these into a small vocabulary V d containing 12 words, listed in Appendix § A.

Experiments and Results
We randomly split the fact-checks to 80% train, 10% dev, and 10% test sets, and adjust hyperparameters to optimize F 1 (y) on dev set. For initialization, we train word embeddings using Gensim (Rehurek and Sojka, 2011) on the entire corpus. The final model achieves F 1 (y) = 0.75/0.74 on the test set with/without domain knowledge.
Clustering rationales. To systematically understand extracted rationales, we cluster these rationales based on semantic similarity. For each rationale, we average word embeddings to represent 10 Verdicts from Snopes.com are structured HTML fields that can be easily parsed. Figure 3: Structure of misinformation types. The ten identified clusters (colored) offer empirical confirmation of theorized misinformation types, contain novel fine-grained clusters, and reorganize the structure of misinformation stories. the embedding of the rationale, and then run a hierarchical clustering for these embeddings. The hierarchical clustering uses cosine similarity as the distance metric, commonly used for word embeddings (Mikolov et al., 2013), and the complete link method (Voorhees, 1986) to obtain a relatively balanced linkage tree.
The results from the clustering are shown in Figure 3. From the root of the dendrogram, we can traverse its branches to find clusters until we reach a sensible threshold of cosine distance, and categorize the remaining branches and leaf nodes (i.e., rationales) to multiple clusters. Figure 3 shows an example visualization that contains ten clusters of rationales that are semantically similar to the domain knowledge, and leaf nodes in each cluster are aggregated to plot a word cloud, with the frequency of a node encoded as the font size of the phrase.
Note that rationales extracted from soft rationalization are dependent on the chosen threshold t to binarize importance scores. The example in Figure 3 uses a threshold of t = 0.01. Varying the threshold would affect extracted rationales but mostly the ones with low prevalence, and these rare rationales also correspond to small font sizes in the word cloud. Therefore, the effect from varying t would be visually negligible in Figure 3.
Structure of misinformation stories. We make the following observations from the ten clusters of misinformation types identified in Figure 3.
First, the clusters empirically confirm existing domain knowledge in V d . Certain theorized misinformation types, such as satires and parodies from (Wardle, 2017), are identified as individual clusters from fact-checks.
Second, the clusters complement V d with additional phrases describing (semantically) similar misinformation types. For example, our results add "humor" and "gossip" to the same category as satires and parodies and add "tales" and "lore" to the same category as legends . This helps us grasp the similarity between misinformation types, and also enriches the lexicon V d , which proves useful for subsequent analysis in § 5.
Third, we discover novel, fine-grained clusters that are not highlighted in V d . There are multiple possible explanations as to why these misinformation types form their own clusters. Conspiracy theories are often associated with intentional political campaigns (Samory and Mitra, 2018) which can affect their semantics when referenced in fact-checks. In contrast, digital alteration is a relatively recent misinformation tactic that has been enabled by technological developments such as FaceSwap (Korshunova et al., 2017) and DeepFake (Westerlund, 2019). Hoaxes and pranks often have a mis-chievous intent that distinguishes them from other clusters. Other new clusters include clickbait with inflammatory and sensational language and entirely fictional content . Fourth, the clusters reorganize the structure of these misinformation types based on their semantics, e.g., fabricated and misleading content belongs to two types of misinformation in (Wardle, 2017), while in our results they are clustered together. This suggests that the semantic distance between fabricated and misleading content is less than the chosen similarity threshold, at least when these misinformation types are referred to by factcheckers when writing articles.
Finally, the remaining words in V d are also found in our rationales. However, due to low prevalence, they are not visible in Figure 3 and do not form their own clusters.

Evolution of Misinformation
In this section, we leverage the clusters of misinformation types identified by our method as a lexicon and apply it back to the our original fact-check dataset. Specifically, we analyze the evolution of misinformation types over the last ten years and compare misinformation trends around major realworld events.
Evolution over the last ten years. We first explore the evolution of misinformation over time. We map each fact-check article with one or more corresponding misinformation types identified by our method, and then aggregate fact-checks by year from before 2010 11 to the end of 2020 to estimate the relative ratio of each misinformation type.
As shown in Figure 4, 12 the prevalence of certain misinformation types on Snopes.com has drastically changed over the last ten years.
Heavily politicized misinformation types, such as digitally altered or doctored images or photographs , fabricated and misleading content , and conspiracy theories have nearly doubled in relative ratios over the last ten years. In contrast, the prevalence of (arguably) less politicized stories, such as legends and tales , hoaxes and pranks , and mistakes and errors have decreased.
These trends may be a proxy for the underlying prevalence of different misinformation types within the US. Studies that measure political ideologies  Alternatively, these trends may reflect shifts in Snopes.com's priorities. The website, launched in 1994, was initially named Urban Legends Reference Pages. Since then it has grown to encompass a broad spectrum of subjects. Due to its limited resources, fact-checkers from Snopes.com only cover a subset of online misinformation, and their priority is to "fact-check whatever items the greatest number of readers are asking about or searching for at any given time (Snopes.com, 2021b)." 13 Given the rising impact of political misinformation in recent years (Zannettou et al., 2019(Zannettou et al., , 2020, such misinformation could reach an increasing number of Snopes.com readers, and therefore the website may dedicate more resources to fact-checking related types of misinformation. Additionally, Snopes.com has established collaborations with social media platforms, e.g., Facebook (Green and Mikkelson), to specifically target viral misinformation circulating on these platforms, where the rising meme culture could also attract Snopes.com's attention and therefore explain a surge of digitally altered images (Ling et al., 2021;Wang et al., 2021). 13 Users can submit a topic to Snopes.com on its contact page (Snopes.com, 2021c), the results from which may affect Snopes.com's priorities. 2016 vs. 2020 US presidential election. We now compare misinformation types between the 2016 and 2020 elections. To filter for relevance, we constrain our analysis to fact-checks that (1) were published in the election years and (2) included the names of the presidential candidates and/or their running mates (e.g., "Joe Biden" and "Kamala Harris"). This results in 2,586 fact-checks for the 2016 election and 2,436 fact-checks for 2020.
The prevalence of each misinformation type is shown in Figure 5. We observe that the relative ratios of many misinformation types are similar between the two elections, e.g., legends and tales and bogus scams , while the 2016 election has more hoaxes , satires , etc. The most prevalent type during both elections is fabricated and misleading content , next to conspiracy theories . H1N1 vs. COVID-19. Finally, we compare misinformation types between the H1N1 pandemic in 2009 and the COVID-19 pandemic. For H1N1 related fact-checks, we search for keywords "flu", "influenza", and "H1N1" in fact-checks and constrain the publication date until the end of 2012. 14 For COVID-19 related fact-checks, we search for keywords "COVID-19" and "coronavirus", and only consider fact-checks published in 2019 or later, which results in 833 fact-checks for the H1N1 pandemic and 656 fact-checks for COVID-19.
The relative ratio of each misinformation type is also shown in Figure 5. We observe that the prevalence of some misinformation types are sig- 14

Discussion
In this section, we discuss limitations of our work and future directions, and finally conclude.
Limitations and future directions. We adopted a computational approach to investigate our research question, and this method inherently shares common limitations with observational studies, e.g., prone to bias and confounding (Benson and Hartz, 2000). Specifically, our corpus contains fact-checks from Snopes.com, one of the most comprehensive fact-checking agencies in the US.
Snopes.com covers a broader spectrum of topics than politics-focused fact-checkers (e.g., Politi-Fact.com, FactCheck.org), 15 and thus we argue that it covers a representative sample of misinformation within the US. However, Snopes.com may not be representative of the international misinformation ecosystem (Ahinkorah et al., 2020;Kaur et al., 2018;Fletcher et al., 2018). In the future, we hope that our method can help characterize misinformation comparatively on a global scale when more structured fact-checks become available. 16 Additionally, fact-checkers are time constrained, as thus the misinformation stories they cover tend to be high-profile. Therefore low-prevalence, long-tail misinformation stories may not be observed in our study. Understanding low-volume misinformation types may require a different collection of corpora other than fact-checks, e.g., a cross-platform investigation on social media conversations (Wilson and Starbird, 2020;Abilov et al., 2021). Lastly, the misinformation types we extract from our weakly supervised approach are not validated with ground-truth labels. This is largely due to the lack of empirical knowledge on misinformation types, and therefore we are unable to provide specific guidance to annotators. Although the clusters in Figure 3 provide straightforward structure of misinformation stories, in future work, we plan to leverage these results to construct annotation guidelines and obtain human-identified misinformation types for further analysis.
Conclusion. In this paper, we identify ten prevalent misinformation types with rationalized models on fact-checks and analyze their evolution over the last ten years and between notable events. We hope that this paper offers an empirical lens to the systematic understanding of fine-grained misinformation types, and complements existing work investigating the misinformation problem.

Ethical Considerations
This paper uses Snopes.com fact-checks to train and validate our models, and also includes several quotes and snippets of fact-checks.
We consider our case a fair use under the US 17 copyright law, which permits limited use of copyrighted material without the need for permission from the copyright holder.
According to 17 U.S.C. § 107, we discuss how our research abides the principles that are considered for a fair use judgment: • Purpose and character of the use: we use factchecks for noncommercial research purpose only, and additionally, using textual content for model training is considered to be transformative, cf. Authors Guild, Inc. v. Google Inc. (2013, 2015, 2016. • Amount and substantiality: we present only snippets of fact-checks for illustrative purpose in our paper (i.e., several quotes and snippets in text and figures), and only URLs to original fact-checks in our public dataset. • Effect upon work's value: we do not identify any adverse impact our work may have on the potential market (e.g., ads, memberships) of the copyright holder.
The end goal of our research aligns with that of Snopes.com, i.e., to rebut misinformation and to restore credibility to the online information ecosystem. We hope the aggregated knowledge of factchecks from our models can shed light on this road and be a helpful addition to the literature.

A Implementation Details
In this section, we discuss additional implementation details that we omitted in the main paper.
Loss functions. For the predictive loss L y (ŷ, y), we use a common cross entropy loss function. For the rationale regularization loss L z (z), we introduced it as a single item in the main paper for simplicity, but it actually contains two parts as implemented by Yu et al. (2019). The first part is to encourage conciseness: where i z i represents the number of selected tokens, and k is a hyperparameter defining a loss-free upper-bound for it. The second part is to encourage contiguity: where z i − z i−1 denotes a transition between z i = 0 and z i−1 = 1 or vice versa, therefore i z i − z i−1 represents the number of rationale phrases, and l is another hyperparameter defining a loss-free upper-bound for it.
Combining these two parts together, we can further specify λ z L z (z) as λ zk L zk (z) + λ zl L zl (z).
For domain knowledge weak supervision, we define L d (z, z d ) as: which decreases loss by 1 if both z i = 1 and z i d = 1, i.e., selecting a token in the domain knowledge vocabulary V d , and has no effect on the loss otherwise. Similarly, we define L d (s, z d ) as: which decreases loss by s i if z i d = 1, and has no effect on the loss if z i d = 0. This encourages the training to increase the importance score s i on domain knowledge to reduce the loss.
With this implementation, there are five hyperparameters to search for the hard rationalization method: λ zk , k, λ zl , l and λ d , and only one hyperparameter to search for the soft rationalization method: λ d .
Module cells. Each module in soft and hard rationalization methods can be implemented with different neural cells. Here, we consider two common types of choices: RNN cells, e.g., LSTM, and transformer cells (Vaswani et al., 2017), e.g., BERT (Devlin et al., 2019).
For hard rationalization, the rationale selection process is actively regularized by L z (z), therefore we simply choose the cell type that optimizes F 1 (y) on dev sets, i.e., transformers.
For soft rationalization, the rationale selection process is based on passively generated importance scores (i.e., attention), therefore the inherent behavioral difference between RNN and transformer cells would significantly impact our choice.  In our experiments, we observe that transformer cells often assign strong importance to a single token, but assign near zero weights to its neighboring tokens (possibly as a result of its multi-head attention mechanism), while RNN cells assign strong importance to a single token, but also some residue, fading weights to its neighboring tokens.
Consider the following example, which shows the distribution of importance scores generated by transformer cells, with darker text representing higher importance scores and lighter text scoring near zero. In the following example, only the token conspiracy is selected as rationale: "...Furthermore, claims that COVID-19 was "manufactured," or that it "escaped from" this Chinese lab, are nothing more than baseless conspiracy theories..." In contrast, the following example shows the distribution of importance scores generated by RNN cells for the same snippet, i.e., the token conspiracy has the strongest importance score, but its neighboring tokens are also assigned some weight above the threshold, and therefore the phrase baseless conspiracy theories is selected as rationale: "...Furthermore, claims that COVID-19 was "manufactured," or that it "escaped from" this Chinese lab, are nothing more than baseless conspiracy theories..." As we prefer to obtain phrases (i.e., one or more tokens) for rationales, we choose between RNN cells. After optimizing F 1 (y) on dev set, we choose bidirectional LSTM initialized with GloVe embeddings (Pennington et al., 2014) for the soft rationalization method.
Hyperparameters. As discussed in the paper, we optimize hyperparameters for F 1 (y) on the dev sets.
Since the size of dev sets is relatively small in our experiments, a rigorous grid search for hyperparameters might overfit to several instances in the dev set, therefore we tune the hyperparameters manually starting from the hyperparameters released by (Yu et al., 2019) and(Carton et al., 2018).
For fact-checks, the best-performing model for soft rationalization uses λ d = 1.0.

B Additional Results
In this section, we record additional results from our experiments that we omitted in the main paper.
Validation performance. The evaluation results for all our experiments on both test and dev sets are reported in Table 2. We also include accuracy metric Ac(y) in the table 18 , and the evaluation results for fact-checks. Note that evaluation for z is empty for fact-checks, since there are no groundtruth rationales. As shown in Table 2, the results on dev sets align with our findings on test sets discussed in the main paper.
Model size, computing machine and runtime. The number of parameters is 325K for hard rationalization models, and 967K for soft rationalization models. All experiments were conducted on a 12GB Nvidia Titan X GPU node, and finished training within an hour per experiment.

C Rationale Examples
In this section, we list additional examples of extracted rationales for ten identified misinformation types.
For urban legends and tales : "...the 1930 Colette short story La Chienne (The Bitch) has become an urban legend in that its plot is often now related as a string of events that..." For altered or doctored images : "...magazine covers of "highest paid" people. These doctored images have featured celebrities such as John Legend, Chuck Norris, Bob Dylan, Susan Boyle, and..." For hoaxes and pranks : "...This meme is a hoax. Nobody is (or was) licking toilets as a form of protest against Donald Trump. The images shown in the meme were taken from..." 18 Our public dataset has balanced positive and negative labels therefore Ac(y) = F1(y). For bogus scams : "...In October 2019, we came across a decidedly bizarre version of the scam. This time, Nigerian astronaut Abacha Tunde was reportedly stuck in space and..." For mistakes and errors : "...noted that reports of missing children (which are typically resolved quickly) are often mistakenly confused by the public with relatively rare instances of..." For fabricated content : "...The Neon Nettle report was "unusual" because it was completely fabricated: Bono said nothing during his Rolling Stone interview about "colluding with elites"..." For baseless conspiracies : "...Furthermore, claims that COVID-19 was "manufactured," or that it "escaped from" this Chinese lab, are nothing more than baseless conspiracy theories..." For satires and parodies : "...This item was not a factual recounting of real-life events. The article originated with a website that describes its output as being humorous or satirical in nature..." For fictitious content : "...However, both of these shocking quotes, along with the rest of article in which they are found, are completely fictitious. As the name of the web site implies..." For sensational clickbait : "...And Breitbart regurgitated some of the pictures as viral clickbait under the headline "Armed Black Panthers Lobby for Democrat Gubernatorial Candidate Stacey Abrams"..."