Missing Counter-Evidence Renders NLP Fact-Checking Unrealistic for Misinformation

Misinformation emerges in times of uncertainty when credible information is limited. This is challenging for NLP-based fact-checking as it relies on counter-evidence, which may not yet be available. Despite increasing interest in automatic fact-checking, it is still unclear if automated approaches can realistically refute harmful real-world misinformation. Here, we contrast and compare NLP fact-checking with how professional fact-checkers combat misinformation in the absence of counter-evidence. In our analysis, we show that, by design, existing NLP task definitions for fact-checking cannot refute misinformation as professional fact-checkers do for the majority of claims. We then define two requirements that the evidence in datasets must fulfill for realistic fact-checking: It must be (1) sufficient to refute the claim and (2) not leaked from existing fact-checking articles. We survey existing fact-checking datasets and find that all of them fail to satisfy both criteria. Finally, we perform experiments to demonstrate that models trained on a large-scale fact-checking dataset rely on leaked evidence, which makes them unsuitable in real-world scenarios. Taken together, we show that current NLP fact-checking cannot realistically combat real-world misinformation because it depends on unrealistic assumptions about counter-evidence in the data.


Introduction
According to van der Linden (2022), misinformation is "false or misleading information masquerading as legitimate news, regardless of intent".Misinformation is dangerous as it can directly impact human behavior and have harmful real-world consequences such as the Pizzagate shooting (Fisher et al., 2016), interfering in the 2016 democratic US election (Bovet and Makse, 2019), or the promotion of false COVID-19 cures (Aghababaeian et al., 2020).Surging misinformation during the COVID-19 pandemic, coined "infodemic" by WHO (Zarocostas, 2020), exemplifies the danger coming from misinformation.To combat misinformation, journalists from fact-checking organizations (e.g., Poli-tiFact or Snopes) conduct a laborious manual effort to verify claims based on possible harms and their prominence (Arnold, 2020).However, manual factchecking cannot keep pace with the rate at which misinformation is posted and circulated.Automatic fact-checking has gained significant attention within the NLP community in recent years, with the goal of developing tools to assist fact-checkers in combating misinformation.For the past few years, NLP researchers have created a wide range of factchecking datasets with claims from fact-checking organization websites (Vlachos and Riedel, 2014;Wang, 2017;Augenstein et al., 2019;Hanselowski et al., 2019;Ostrowski et al., 2021;Gupta and Srikumar, 2021;Khan et al., 2022).The fundamental goal of fact-checking is, given a claim made by a claimant, to find a collection of evidence and provide a verdict about the claim's veracity based on the evidence.The underlying technique used by fact-checkers, and journalists in general, to assess the veracity of a claim is called verification (Silverman, 2016).In a comprehensive survey, Guo et al. (2022) proposed an NLP fact-checking framework (FCNLP) that aggregates existing (sub)tasks and approaches of automated fact-checking.FCNLP reflects current research trends on automatic factchecking in NLP and divides the aforementioned process into evidence retrieval, verdict prediction, and justification production.
In this paper, we focus on harmful misinformation claims that satisfied the professional factcheckers' selection criteria and refer to them as real-world misinformation.Our goal is to answer the following research question: Can evidencebased NLP fact-checking approaches in FC-NLP refute novel real-world misinformation?FCNLP assumes a system has access to counterevidence (e.g., through information retrieval) to refute a claim.Consider the false claim "Telemundo is an English-language television network" from FEVER (Thorne et al., 2018): A system following FCNLP must find counter-evidence contradicting the claim (i.e., Telemundo is a Spanish company) to refute the claim.This may require more complex reasoning over multiple documents.We contrast this example to the real-world false claim that "Half a million sharks could be killed to make the COVID-19 vaccine" (Figure 1).If true, credible sources would likely report this incident, providing supporting evidence.As it is not, before being factchecked, there is no refuting evidence stating that COVID-19 vaccine production will not kill sharks.Only after guaranteeing that the claim relies on the false premise of COVID-19 vaccines using squalene (harvested from sharks), it can be refuted.After the claim's verification, fact-checkers publish reports explaining the verdict and thereby produce counter-evidence.Relying on counter-evidence leaked from such reports is unrealistic if a system is to be applied to new claims.
In this work, we identify gaps between current research on FCNLP and the verification process of professional fact-checkers.Via analysis from different perspectives, we argue that the assumption of the existence of counter-evidence in FCNLP is unrealistic and does not reflect real-world requirements.We hope our analysis sheds light on future research directions in automatic fact-checking.In summary, our major contributions are: • We identify two criteria from the journalistic verification process, which allow overcoming the reliance on counter-evidence (Section 2).
• We show that FCNLP is incapable of satisfying these criteria, preventing the successful verification of most misinformation claims from the journalistic perspective (Section 3).
• We identify two evidence criteria (sufficient & unleaked) for realistic fact-checking.We find that all existing datasets in FCNLP containing real-world misinformation violate at least one criterion (Section 4) and are hence unrealistic.
• We semi-automatically analyze MULTIFC, a large-scale fact-checking dataset to support our findings, and show that models trained on claims from PolitiFact and Snopes (via MULTIFC) rely on leaked evidence.

How Humans Fact-check
To motivate our distinct focus on misinformation, we investigate what claims professional factcheckers verify.We crawl 20,274 fact-checked claims from PolitiFact2 ranging from 2007-2021.
Figure 2 shows the ratio of different verdicts3 per year.After 2016, fact-checkers increasingly select false claims as important for fact-checking.In 2021 less than 10% of the selected claims were correct.Some claims can be refuted via counter-evidence (as required by FCNLP).For example, official statistics can contradict the false claim about the U.S. that "In the 1980s, the lowest income people had the biggest gains".If the evidence makes it

Claim Based Upon
(1) If you were forced to use a Sharpie to fill out your ballot, that is voter fraud.
false assumption (2) The Biden administration will begin "spying" on bank and cash app accounts starting 2022.
tax legislation (3) Barcelona terrorist is cousins with former President Barack Obama.
satire article (4) The Democratic health care plan is a government takeover of our health programs.
health care plan (5) People in Holland protests against of COVID-19 measures.
protests event impossible for the claim to be true (e.g., because of mutually exclusive statistics) we refer to the evidence as global counter-evidence.Global counterevidence attacks the textual claim itself without relying on reasoning and sources behind it.In contrast, to refute the claim that "COVID-19 vaccines may kill sharks" (Figure 1), fact-checkers did not rely on global counter-evidence specifically proofing that sharks will not be killed to produce COVID-19 vaccines.Neither is it plausible that such counter-evidence exists.Here, the counterevidence is bound to the claim's underlying (false) reasoning.The claim is only refuted because it follows the false assumption, not because it was disproved.The absence of global counter-evidence is not an exceptional problem for this specific claim but is common among misinformation: Misinformation surges when the high demand for information cannot be met with a sufficient supply of credible answers (Silverman, 2014;FullFact, 2020).Non-credible and possibly false and harmful information fill these deficits of credible information (Golebiewski and Boyd, 2019;Shane and Noel, 2020).The very existence of misinformation often builds on the absence of credible counter-evidence, which in turn, is essential for FCNLP.
Professional fact-checkers refute misinformation even if no global counter-evidence exists, e.g., by rebutting underlying assumptions (Figure 1).Table 1 shows a few false claims built on top of various resources: (1) relies on a false assumption that sharpies invalidate election ballots, (2 & 4) misinterpret official documents or laws, (3) is based on non-credible sources, and (5) changes a topic of a specific event from "gas extraction" to "COVID-19 measures".Fact-checkers use the reasoning for the claim to consider evidence that is, or refers to, the claimant's source: the original tax legislation (2), or alternate (correct) descriptions of protests against gas extraction (5).Here, the content of the evidence alone is often insufficient.The assertion that the claimant's source and the used counterevidence are identical, or refer to the same event is crucial to refute the claim: Claim (2) is refuted because the tax legislation it relies upon does not support the "spying" claim.However, the document does not specifically refute the claim, and without knowing that the claimant relied on it, it becomes useless as counter-evidence.Similarly, the correct narrative of protests against gas extraction is only mutually exclusive to the false claim (5) of protests against COVID-19 measures when assuring both refer to the identical incident.For similar reasons, the co-reference assumption is critical to the task definition of SNLI (Bowman et al., 2015).After this assertion, mutual exclusiveness is not required to refute the claim: It is sufficient if the claim is not entailed (i.e.incorrectly derived or relies on unverifiable speculations) or based on invalid sources (such as satire) to refute it.Based on these observations we identify two criteria to refute claims if no global-counter evidence exists.We validate their relevance in Section 3: • Source Guarantee: The guarantee that identified evidence either constitutes or refers to the claimant's reason for the claim.
• Context Availability: We broadly consider context as the claim's original environment, which allows us to unambiguously comprehend the claim, and trace the claim and its sources across multiple platforms if required.
It is a logical precondition for the source guarantee.
Both criteria are challenging for computers but naturally satisfied by human fact-checkers.Buttry (2014) defines the question "How do you know that?" to be at the heart of verification.After selecting a claim, finding provenance and sourcing are the first steps in journalistic verification.Provenance provides crucial information about context and motivation (Urbani, 2020).Journalists must then identify solid sources to compare the claim with (Silverman, 2014;Borel, 2016).Ideally, the claimant provides sources, which must be included and assessed in the verification process.During verification, journalists rely, if possible, on relevant primary sources, such as uninterpreted and original legislation documents (for claim 2, Table 1).Factchecking organisations see sourcing as one of the most important parts of their work (Arnold, 2020).
3 Can FCNLP Help Human Verification?
In this section, we first analyze human verification strategies based on an analysis of 100 misinformation claims.We then contrast human verification strategies with FCNLP.

Human Verification Strategies
We manually analyze 100 misinformation claims 4 from two well-known fact-checking organizations: PolitiFact and Snopes.We randomly choose 50 misinformation claims from each website which contains 25 claims from MULTIFC (a large NLP fact-checking dataset with real-world claims before 2019) and 25 claims from 2020/2021.We extract the URL for each claim and analyze its verification strategy based on the entire fact-checking article.Claims that require the identification of scam webpages, imposter messages, or multi-modal reasoning 5 such as detecting misrepresented, miscaptioned or manipulated images (Zlatkova et al., 2019) were marked as not applicable to FCNLP by nature.In the first round of analysis, we assess whether humans relied on the source guarantee to refute the claim.Each claim (and its verification) is unique and can be refuted using different strategies.In the second round of analysis we identify the primary strategy to refute the claim and verify that it is based on the source guarantee.This led us to identify 4 primary human-verification strategies: 1. Global counter-evidence (GCE): Counterevidence via arbitrarily complex reasoning but without the source guarantee.

Local counter-evidence (LCE):
Evidence requires the source guarantee to refute the (reasoning behind) the claim.
3. Non-credible source (NCS): Evidence requires the source guarantee to refute the claim based on non-credible sources (e.g.satire).
4 Claims are from the following categories: "pants on fire", "false" and "mostly false".
5 If a claim can be expressed in text and verified without multi-modal reasoning we consider the verbalized variant of the claim and do not discard it.

No evidence assertion (NEA):
The claim is refuted as no (trusted) evidence supports it.
We discard 25 non-applicable claims and show the results of the remaining 75 claims in Table 2. Please refer to Appendix A for more analysis details and examples.In some cases, the selection of one strategy is ambiguous if multiple strategies are applied.In a pilot study to analyze human verification strategies, two co-authors agreed on 9/10 applicable misinformation claims.In general, about two-thirds of the claims were refuted by relying on the source guarantee.In 20 cases fact-checkers refuted the claim by finding global counter-evidence.
In one case (other), fact-checkers relied entirely on expert statements.In general, experts supported the fact-checkers in identifying and discussing evidence, or strengthened their argument via statements but did not affect the underlying verification strategy.

NLP Fact Verification
Focusing on evidence-based approaches.Approaches in FCNLP estimate the claim's veracity based on surface cues within the claim (Rashkin et al., 2017;Patwa et al., 2021), assisted with metadata (Wang, 2017;Cui and Lee, 2020;Li et al., 2020;Dadgar and Ghatee, 2021), or using evidence documents.Here, the system uses the stance of the evidence towards the claim to predict the verdict.Verdict labels are often non-binary and include a neutral stance (Thorne et al., 2018), or finegrained veracity labels from fact-checking organizations (Augenstein et al., 2019).Evidence-based approaches either rely on unverified documents or user comments (Ferreira and Vlachos, 2016;Zubiaga et al., 2016;Pomerleau and Rao, 2017), or assume access to a presumed trusted knowledge base such as Wikipedia (Thorne et al., 2018), scientific publications (Wadden et al., 2020), or search engine results (Augenstein et al., 2019).In this paper, we focus on trusted evidence-based verification approaches which can deal with the truth changing over time (Schuster et al., 2019).More importantly, they are the most representative of professional fact verification.Effectively debunking misinformation requires stating the corrected fact and explaining the myth's fallacy (Lewandowsky et al., 2020), both of which require trusted evidence.
Global counter-evidence assumption in FCNLP.
In FCNLP, evidence retrieval-based approaches assume that the semantic content of a claim is sufficient to find relevant (counter-) evidence in a trusted knowledge base (Thorne et al., 2018;Jiang et al., 2020;Wadden et al., 2020;Aly et al., 2021).This becomes problematic for misinformation that requires the source guarantee to refute the claim.By nature, in this case, the claim and evidence content are distinct and not entailing.Content cannot assert that two different narratives describe the same protests (e.g., Claim 5 in Table 1), or that a non-entailing fact (squalene is harvested from sharks) serves as a basis for the false claim (e.g., Figure 1).The consequence is a circular reasoning problem: Knowing that a claim is false is a precondition to establishing the source guarantee, which in turn is needed to refute the claim.To escape this cycle, one must (a) provide the source guarantee by other means than content (e.g., context), or (b) find evidence that refutes the claim without the source guarantee (global counter-evidence).By relying only on the content of the claim, FCNLP cannot provide the source guarantee and is limited to global counter-evidence, which only accounts for 20% of misinformation claims analyzed in the previous section.
Current FCNLP fails to provide source guarantees.We note that providing the source guarantee goes beyond entity disambiguation, as required in FEVER (Thorne et al., 2018).The self-contained context within claims in FEVER is typically sufficient to disambiguate named entities if required. 6fter disambiguation, the retrieved evidence serves as global counter-evidence.
Recent approaches further add context snippets from Wikipedia (Sathe et al., 2020) or dialogues (Gupta et al., 2022) to resolve ambiguities and cannot provide the source guarantee to break the circu-lar reasoning problem.These snippets differ from the context used by professional fact-checkers who often need to trace claims and their sources across different platforms.Recently, Thorne et al. (2021) annotate more realistic claims w.r.t.multiple evidence passages.They found supporting and refuting passages for the same claim, which prevents the prediction of an overall verdict.Some works collect evidence for the respective claims by identifying scenarios where the claimant's source is naturally provided: such as a strictly moderated forum (Saakyan et al., 2021), scientific publications (Wadden et al., 2020), or Wikipedia references (Sathe et al., 2020).However, such source evidence is only collected for true claims.Adhering to the global counter-evidence assumptions of previous work, false claims in these works are generated artificially and do not reflect real-world misinformation.

Human and NLP Comparison
Our analysis (Table 2) finds fact-checkers only refuted 26% of false claims with global counterevidence.In all other cases, fact-checkers relied on source guarantees (LCE, NCS) or asserted that no supporting evidence exists (NEA).The verification strategy is not evident given the claim alone but dependent on existing evidence.The claim that "President Barack Obama's policies have forced many parts of the country to experience rolling blackouts" is refuted via global counter-evidence (that rolling blackouts had natural causes).The claim that "90% of rural women and 55% of all women are illiterate in Morocco" seems verifiable via official statistics.Yet, no comparable statistics exist and the claim is refuted due to relying on a decade-old USAID request report.
We further analyze claims refuted via global counter-evidence, that FCNLP, in theory, can refute.Some claims only require shallow reasoning as directly contradicting evidence naturally exists: A transcript of an interview in which Ron DeSantis was asked about the coronavirus can easily refute the claim "Ron DeSantis was never asked about coronavirus".Another case is when information about the claim's veracity already exists, e.g., because those affected by the myth already corrected the claim.Most claims require complex reasoning like legal text understanding or comparing and deriving statistics.Some claims require the definition of some terms first, to make them verifiable.Col-  lecting all required global counter-evidence often requires aggregating and comparing different information, possibly under time constraints.Consider the false claim that "Illegal immigration wasn't a subject that was on anybody's mind until [Trump] brought it up at [his] announcement": To refute this claim, one must first determine when Trump announced his run for the presidency, then count and compare how often "illegal immigration" was mentioned before and after the announcement.7

NLP Fact-Checking Datasets
Based on our observations in Section 3 and FC-NLP's reliance on global counter-evidence, we hypothesize that evidence in existing fact-checking datasets does not fully satisfy real-world demands.We, hence, investigate how FCNLP's assumptions affect fact-checking datasets and if they constitute realistic test beds for real-world misinformation.
For real-world scenarios, datasets must contain realworld misinformation claims and realistic counterevidence.For evidence, we define the following two requirements: • Sufficient: Evidence must be sufficient to justify the verdict from a human perspective.
• Unleaked: Evidence must not contain information that only existed after the claim was verified.
The issue of leaked evidence was also mentioned very recently by Khan et al. (2022).Unlike us, they did not comprehensively analyze existing datasets, evaluate the impact on trained systems (Section 5), or consider the complementary criterion of sufficient (counter-) evidence.Relying on leaked evidence is related to the important yet different task of detecting already-verified claims (Shaar et al., 2020), but is unrealistic for novel claims.
We survey NLP fact-checking datasets with natural input claims8 that assume access to trusted evidence.Table 3 summarizes our survey results.Datasets 1-6 contain no real-world misinformation: False claims are derived from true real-world claims (1-3) or within a gamified setting (4), ensuring that counter-evidence exists.Other works (5 & 6) reformulate real-world user queries, which are linked to Wikipedia articles as (counter-) evidence.
We find that no dataset with real-world misinformation (7-16) satisfies both evidence criteria.We identify four categories: First, datasets that consider (parts of) a fact-checking article as evidence contain sufficient, yet leaked evidence (7 & 8).Second, annotators estimate claim veracity based on evidence such as Wikipedia or scientific publications.The authors find that evidence often only covers parts of these realistic, complex claims, which yield low annotator agreement (9), or a weakened task definition for stances (10).
Third is to rely on the same evidence as fact- In 45.5% of cases, annotators found no stance for the professionally selected evidence snippets even though professional fact-checkers considered these snippets important to be included in the article.Due to conflicting and unexplained evidence snippets, we rate this insufficient to predict the correct verdict.
The human verification process (Section 2) guides the creation of the fact-checking article and can serve as a possible explanation for these problems.
Articles link to the claim's context and possibly other similar claims (likely supporting).Often (e.g. during COVID-19 (Simon et al., 2020)), claims are not completely fabricated.Fact-checkers identify documents and their interdependence when investigating the claimant's reasoning for the claim (likely not refuting).Documents used to disprove the claimant's reasoning may have no or little relevance to the original claim (as in Figure 1).Each step is non-trivial and may rely on numerous documents (or expert statements).Relying on premise evidence without considering the verification process and how these documents relate, is insufficient.Both other datasets (12 & 13) in this category provide no annotations and are limited to freely available evidence documents (as opposed to paywalled web pages or e-mails).
Fourth is using a search engine during dataset construction to expand the accessible knowledge.Even when excluding search results that point to the claim's fact-checking article, leaked evidence persists: Different organizations may verify the same claims, or disseminate the fact-checkers verification.Only Baly et al. (2018) provide stance annotation for Arabic claim and evidence pairs.For false claims, they found that only a few documents disagree, and more agree, with the claim.A possible explanation is that misinformation often emerges when trusted information or counterevidence is scarce.Fact-checking articles fill this deficit.Partially excluding them during dataset generation reduces the found counter-evidence.Lacking counter-evidence is not a problem of the dataset generation, but the underlying nature of misinformation, and should be considered by the task definition.We rate evidence in this category (14-16) leaked and insufficient, and back it up in Section 5.

A Case Study of Leaked Evidence
We view MULTIFC (Augenstein et al., 2019), the largest dataset of its group, as an instantiation of FCNLP applied to the real world: It contains realworld claims and professionally assessed verdicts as labels from 26 fact-checking organizations (like Snopes or PolitiFact).The authors use the Google search engine to expand evidence retrieval to the real world during dataset construction.We abstract from the fact that MULTIFC only provides incomplete evidence snippets and consider (if possible) the underlying article in its entirety.

Quantification of Leaked Evidence
We focus our analysis on 16,244 misinformation claims that we identify via misinformation labels (listed in Appendix B.1).To quantify how many claims in MULTIFC contain leaked evidence, we consider all evidence snippets stemming from a fact-checking article, or discussing the veracity of a claim, as leaked.Table 4 shows examples of leaked evidence that strongly indicates the claim's verdict.The first snippet comes directly from a factchecking organization (Truth Or Fiction9 ).Only identifying leaked evidence that directly comes  from fact-checking organizations is insufficient: After the publication of the verification report, its content is disseminated via other publishers (such as Pinterest in the second example).We identify leaked evidence snippets using patterns for their source URLs or contained phrases.A complete list of all used patterns is given in Appendix B.2).This requires the evidence to be relevant.Irrelevant articles are insufficient, albeit not leaking.To this end, we manually analyze 100 claims with 230 automatically found leaking evidence snippets.We confirm that 83.9% of the snippets are leaked (details in Appendix B.3). 97/100 of the selected claims contain at least one leaked evidence snippet.Table 5 lists the number of claims with leaked evidence identified by the pattern-based approach.It detected leaked evidence for 69.7% of misinformation claims.In addition, we manually analyze evidence of 100 misinformation claims for which this approach found no leaked evidence.Misinformation verification often requires multiple evidence documents, rendering a single sufficient evidence snippet unrealistic.We follow Sarrouti et al. (2021) and test if a snippet supports or refutes parts of the claim.Table 6 shows that approximately onethird of the claims contain further leaked evidence.15 claims have unleaked refuting evidence.In 10 cases this evidence is overshadowed via leaked evidence for the same claim.Most analyzed claims only have non-refuting evidence.Similar to Baly et al. (2018), we found supporting evidence for 40 misinformation claims; for 35 of these claims, the evidence was misinformation and thus supported the claim; for the remaining five claims, the claim became accurate, and the evidence became avail- able at a date later than the claim's creation.

Impact on Trained Systems
Hansen et al. ( 2021) found that models in MUL-TIFC can predict the correct verdict based on the evidence snippets alone.To test if leaked evidence can serve as an explanation, we fine-tune BERT (Devlin et al., 2019) (bert-base-uncased) to predict the veracity label of a claim given the evidence snippets with and without a claim.As input to BERT, we separate the claim (when used) from the evidence snippets using a [SEP] token and predict the veracity label based on a linear layer on top of a preceding [CLS] token (Training details in Appendix C.1.).When each evidence snippet is represented via its content only this performs on par with the specialized model introduced by Hansen et al. (2021).We additionally find that the snippet's title carries much signal, and adding it to the input improves the overall performance on Poli-tiFact.Snippets are concatenated (separated by ";" in the order provided by MULTIFC and truncated after 512 tokens.We experiment on the train-, devand test-splits Hansen et al. (2021) extracted from MULTIFC on claims from Snopes and PolitiFact.We test four types of input: only evidence (only title, only text, both), or the complete sample of claim and evidence.For the evaluation (Table 7), we split the test data based on whether a claim contains automatically identified leaked evidence.On Snopes, the macro-F1 is higher on the unleaked than on the leaked subset.Upon closer inspection, we find that the label distribution on Snopes is heavily skewed towards "false", which worsens on the leaked subset.Models seem to rely on patterns of leaked evidence to predict the majority label "false" (see Appendix C.2). On the leaked subset, this comes at the cost of incorrect predictions for all other labels, yielding a lower F1-macro.On the larger PolitiFact subset, labels are not much skewed towards a single majority label.Across all experiments, the performance gap signals the reliance on leaked evidence.We confirm the impact of leaked evidence for both datasets by evaluating the model on the same instances with leaked or unleaked evidence, to avoid the different label distribution distorting the results (Appendix C.3).

Related Work
Combat Misinformation After Its Verification.The identified limitations of the previous studies on NLP fact-checking datasets described in Section 4 do not devalue the surveyed datasets and we view them as highly important and useful contributions.These limitations are tied to our specific research question to refute novel real-world misinformation.We strongly build on these previous works and view them as crucial starting points to fact-check real-world misinformation.Existing fact-checking articles are highly valuable and automatic methods should utilize them to detect and combat misinformation.Automatic methods specifically using these resources detect misinformation by matching claims with known misconceptions (Hossain et al., 2020;Weinzierl and Harabagiu, 2022) or already verified claims (Vo and Lee, 2020;Shaar et al., 2020;Martín et al., 2022;Hardalov et al., 2022b).
Surveys on Automatic Fact-Checking.Recent work surveyed (aspects of) automated factchecking and related tasks, including explainability (Kotonya and Toni, 2020a), stance classification (Küçük and Can, 2020;Hardalov et al., 2022a), propaganda detection (Da San Martino et al., 2021), rumor detection on social media (Zubiaga et al., 2018;Islam et al., 2020), fake-news detection (Oshikawa et al., 2020;Zhou and Zafarani, 2020), and automated fact-checking (Thorne and Vlachos, 2018).We refer interested readers to these papers.Guo et al. (2022) surveyed the state of automatic fact-checking.Based on their work, we zoom in on real-world misinformation to investigate the gap between professional fact-checkers and FCNLP.Recently, Nakov et al. (2021) surveyed tasks to assist humans during the verification.Our work differs in that we focus solely on the automatic verification approach of misinformation.They argue for the need for automatic tools to support humans during verification.Similarly, Graves (2018) interviewed expert fact-checkers and computer scientists and conclude, that automatic fact-checking cannot replicate professional fact-checkers in the foreseeable future.Our results confirm the challenging nature of misinformation but also outline why current models have unrealistic expectations, and how humans overcome these problems.We believe this to be important as real-world misinformation is well within the scope of current NLP research.Towards Human Verification.A possible path forward is to align automatic verification with journalistic verification: Use the claimant's reasoning to find evidence and verify the claim.This relies on the complex task of finding the correct sources (Arnold, 2020).A fruitful but understudied direction may be automated provenance detection (Zhang et al., 2020(Zhang et al., , 2021)).Building systems that can provide source guarantees paves the way for reasoning tasks, such as the detection of logical fallacies (Jin et al., 2022), implicit warrants (Habernal et al., 2018), or propaganda techniques (Da San Martino et al., 2019;Huang et al., 2022).Integrating sufficient context into datasets is nontrivial and may require tracing a claim and its source across multiple platforms.Existing literature shows the heterogeneity of misinformation (Borel, 2016;Wardle et al., 2017;Cook, 2020) and can help to identify small, focused problems that can realistically be translated into NLP.Approaches from computer vision focus on misinformationspecific approaches to detect manipulated or misrepresented images (Zlatkova et al., 2019;Abdelnabi et al., 2022;Musi and Rocci, 2022).

Conclusion
In this work, we contrasted NLP fact-checking approaches with how professional fact-checkers combat misinformation.We identified that reliance on counter-evidence hinders current fact-checking systems to refute real-world misinformation.Using MULTIFC we find that most evidence is insufficient, or leaked and exploited by trained models.Moving forward, we suggest to align NLP approaches with the human verification process, and task definitions with smaller and well-defined verification strategies.

Limitations
The scope of this study is restricted to misinformation claims, and their representation as textual statements, that professional fact-checking organizations selected as important to verify.This only represents a fraction of all existing misinformation (Vinhas and Bastos, 2022).Our findings cannot be generalized to other types of misinformation.Process definitions for claim selection and verification differ amongst fact-checkers (Arnold, 2020).The assessed claims for the analysis and experiments are biased to the claim selection criteria, including the domain, language, and geographical biases of Snopes and PolitiFact.Even fact-checkers cannot fully eliminate subjectivity.Nieminen and Sankari (2021) find 11% PolitiFact's verified claims uncheckable.We consider the factcheckers assessment as the gold standard and adhere to the introduced subjectivity.PolitiFact and Snopes verify claims from English-speaking countries with rich resources and trusted government documents.Fact-checking organizations may rely on different strategies, adapted to different scenarios such as different topics, dissemination of misinformation, or trust and availability of official information. 10The quantification of leaked evidence is bound to the time-frame, fact-checking organizations, and found evidence of MULTIFC.We did not investigate the influence of different factors such as the fact-checkers language, domain, or popularity, nor did we evaluate different evidence collection strategies.The same restrictions apply to the experimental results.Further, following Hansen et al. (2021) we only consider labels on a veracity scale for the experiments (e.g.excluding "misleading").

Ethics Statement
In this work we only consider publicly available data as provided by fact-checking organizations or MULTIFC, but do not publish it.We do not use any personal data.We note that creating more realistic datasets (including realistic context), as suggested by us, induces ethical challenges as it requires personalized data (e.g. from Twitter or Facebook).We consider this study's goal to reduce harmful misinformation by aligning automatic methods with best-practices from professional fact-checkers as ethically correct.However, even if successfully developed, fact-checking systems are inevitably imperfect.Malicious actors may design claims that exploit the system's weakness to predict the opposite verdict, giving legitimacy to false claims, or discrediting correct claims.Further, malicious actors may develop fact-checking systems under their control.When extended with triggers enabling backdoor attacks (Chen et al., 2021)

B Leaked Evidence Analysis B.1 Misinformation Labels
We consider all claims rated with strongly-leaning false verdicts and other verdicts that fall into the misinformation category such as "misleading" as misinformation.Remaining claims are either true (e.g."verified", "mostly true"), mixed (e.g."halftrue", "outdated") or not clearly applicable to misinformation (e.g."opinion!","scam", "full flop").We provide all considered labels within MULTIFC below.

B.2 Automatic Identification of Leaked Evidence
Table 9 shows the URLs used to automatically determine leaked evidence.We consider an evidence snippet as leaked if any URLs of Table 9 is a substring of the snippet's URL.We exclude URLs if they may also cover URLs to news articles.Further, we consider an evidence snippet as leaked, if its lowercased title or text matches any of the regular expressions in Table 10.We identified two commonly made errors: • Different Claim: The approach considered evidence as leaked if it is not relevant to the exact same claim, but connected to the same incident ("President Obama pushed through the stimulus based on the promise of keeping unemployment under 8 percent."and "The president promised that if he spent money on a stimulus program that unemployment would go to 5.7 percent or 6 percent.Those were his words"), or thematically related ("Cadbury chocolate eggs are infected with HIV-positive blood" and "HIV & AIDS infected oranges coming from Libya").
• Discussing Fake News or Fact-Checking: The approach selects snippets, that discuss fact-checking or fake news from a different perspective, not as a result of verification.This includes opinions or reports complaining about "fake news" being spread, or about the fact-checking process.

B.3 Manual Guidelines
To determine the stance of evidence snippets to a claim, or whether it is leaked or not, we proceed in the following order: We first read the original factchecking article, to fully comprehend the claim and how fact-checkers refuted it.If the title or text of the evidence snippets provides sufficient information, we decide based on the snippet alone.
If we cannot make a clear decision based on the snippet, we consider the original web page.This may be required, as evidence snippets often contain incomplete sentences.

B.3.1 Leaked Evidence Snippets
We consider evidence snippets as leaked if (a) they constitute information that relies on the verification of the same claim, or (b) provides originally unknown information from the claim's future.When relying on content from the claim's verification, we do not require the information to contradict the claim from a human perspective.This often occurs when different pages (such as overview pages) reference the fact-checking article.Such a page may be a clear indication of the verdict in some cases (e.g. if titled "All False Claims by Person A").
In other cases, different interpretations are valid: The statement "We previously fact-checked similar claims that ..." may be seen as neutral or as a give-away that similar claims were refuted.Further, humans cannot judge whether models may rely on latent patterns.An overview page titled "All claims from Person A" may be sufficient for the model if it learned that most claims by Person A are false.To remove this ambiguity, we consider any mention / or information taken form the claim's verification as leaked.We do not strictly consider all evidence that appeared after the verification as leaked.Not all evidence published after the claim's verification, is based not based on the verification.If not, we verify whether it relies on new information that previously did not exist or whether the truth changed.Consider the claim "Khloe Kardashian did give birth over easter."refuted on April 5, 2018.Evidence about her actual birth on April 12, 2018, does not rely on a previous verification but is still considered leaked (new information available).In other cases ("Coca-Cola's "Share A Coke" campaign includes a bottle for the KKK.",March 2, 2016) we consider evidence from March 30, 2016, as unleaked: It correctly reports about the same incident the claim refers to without any mention of the false claim, or its verification.

B.3.2 Stance of Evidence Snippets
For most claims, it is unrealistic to assume a single evidence snippet can refute them entirely.We follow Sarrouti et al. (2021) to allow evidence to support or refute parts of the claim only.We separately mark supporting evidence from the claim's future, as the claim's veracity may have changed.
We consider correctly identified counter-evidence as refuting the claim, even when it requires the source guarantee.

C Experiments on MULTIFC C.1 Training details
For our experiments we use bert-base-uncased as provided by Wolf et al. (2020).We represent each evidence snippet e as the title, the text body, or the concatenation of both (depending on the experiment).We concatenate all evidence snippets e i , separated by a semicolon (e 1 ; e 2 ; ...; e n ).We input the concatenation of the claim c and the concatenated evidence, separated by [SEP] token, and truncated after 512 tokens: [CLS] c [SEP] e 1 ; e 2 ; ...; e n [SEP].We predict the label via a linear layer on the [CLS] token.We train all models for 5 epochs with a learning rate of 2e-5 and a batch size of 16.We select the model with the highest F1 score on the development dataset, evaluated after each epoch.We keep the default parameters for all other values.We always train and evaluate three models using the seeds (1,2,3).We did not fine-tune any hyperparameter.We provide code for reproduction.We run our experiments on a DGX A100.

Figure 1 :
Figure 1: A false claim from PolitiFact.It is unlikely to find counter-evidence.Fact-checkers refute the claim by disproving why it was made.

Table 1 :
Example misinformation claims for source guarantee.

Table 2 :
Strategies used to refute 75 of 100 misinformation claims with and without source guarantee (Src.).

Table 3 :
Overview of NLP fact-checking datasets as realistic test-beds to combat real-world misinformation.We indicate whether humans annotated the stance between claim and evidence (Ev.Ann.)

Table 4 :
Google Earth Finds SOS From Woman Stranded on Deserted Island snippet URL Evidence Snippet: The Truth: The story is a hoax.... GOOGLE EARTH FINDS WOMAN TRAPPED ON DESERTED ISLAND FOR 7 YEARS ... other end "How did you find me" to which they replied "Some kid from Minnesota found your SOS sign on Google Earth"; From Truth Or Fiction phrase Claim: Country music singer Merle Haggard left his entire estate to an LGBT group.Evidence Snippet: Discover ideas about Country Singers.Fake news reports that recently-deceased country music legend Merle Haggard left his entire estate to an LGBT group; From Pinterest Examples from MULTIFC of leaked evidence detected via the snippet URL (linking to a fact-checking article) or a phrase of the evidence snippet.

Table 5 :
Absolute and relative number of automatically identified leaked evidence of MULTIFC misinformation.

Table 6 :
Manual analysis of 100 claims without automatically identified leaked evidence.
to control the outcome, these systems can serve as powerful tools to decide what seems true or false.