Fighting Fire with Fire: The Dual Role of LLMs in Crafting and Detecting Elusive Disinformation

Recent ubiquity and disruptive impacts of large language models (LLMs) have raised concerns about their potential to be misused (.i.e, generating large-scale harmful and misleading content). To combat this emerging risk of LLMs, we propose a novel"Fighting Fire with Fire"(F3) strategy that harnesses modern LLMs' generative and emergent reasoning capabilities to counter human-written and LLM-generated disinformation. First, we leverage GPT-3.5-turbo to synthesize authentic and deceptive LLM-generated content through paraphrase-based and perturbation-based prefix-style prompts, respectively. Second, we apply zero-shot in-context semantic reasoning techniques with cloze-style prompts to discern genuine from deceptive posts and news articles. In our extensive experiments, we observe GPT-3.5-turbo's zero-shot superiority for both in-distribution and out-of-distribution datasets, where GPT-3.5-turbo consistently achieved accuracy at 68-72%, unlike the decline observed in previous customized and fine-tuned disinformation detectors. Our codebase and dataset are available at https://github.com/mickeymst/F3.


Introduction
While recently published LLMs have demonstrated outstanding performances in diverse tasks such as human dialogue, natural language understanding (NLU), and natural language generation (NLG), they can also be maliciously used to generate highly realistic but hostile content even with protective guardrails, especially disinformation (Spitale et al., 2023;Sadasivan et al., 2023;De Angelis et al., 2023;Kojima et al., 2022).Moreover, LLMs can produce persuasive texts that are not easily distinguishable from human-written ones (Uchendu et al., 2021;Chakraborty et al., 2023;Zhou et al., 2023), making humans more susceptible to the intrinsic/extrinsic hallucination proclivities and thus introducing disinformation (Uchendu et al., 2023).
To mitigate this muddle of disinformation by LLMs, in this work, we ask a pivotal question: if LLMs can generate disinformation (via malicious use or hallucination), can they also detect their own, as well as human-authored disinformation?Emerging literature provides limited perspectives on the potential use of the latest commercial state-of-theart (SOTA) LLMs such as GPT-4 (OpenAI, 2023) and LLaMA 2 (Touvron et al., 2023) to address disinformation.Particularly, topics including: (1) leveraging prompt-engineering to bypass LLMs' protective guard-rails; (2) utilizing the emergent zero-shot capabilities of modern LLMs (with 10B+ parameters) for disinformation generation and detection; (3) manipulating human-written real news to fabricate LLM-generated real and fake narratives to simulate real-world disinformation risks; and (4) assessing and addressing LLMs' inherent hallucinations in the disinformation domain.
To investigate these inquiries, we formulate two research questions (RQs) as follows: RQ1: Can LLMs be exploited to efficiently generate disinformation using prompt engineering?, where we (1) attempt to override GPT-3.5'salignment in fabricating real news, and (2) measure the frequency and remove GPT-3.5 hallucinated misalignments.RQ2: How proficient are LLMs in detecting disinformation?, where we evaluate the capability to detect disinformation between (1) human and AI-authored, (2) self-generated and other LLMgenerated, (3) social media posts and news articles, (4) in-distribution and out-of-distribution, and (5) zero-shot LLMs and domain-related detectors.
Pretrained LLMs of our interest are GPT-3.5-Turbo, LLaMA-2-Chat (Rozière et al., 2023)  LLaMA-2-GPT4 (Touvron et al., 2023), Palm-2text-bison (Anil et al., 2023), and Dolly-2 (Conover et al., 2023) (See Table 1 for more details).To answer these RQs, we propose the Fighting Fire with Fire (F3) Framework.As shown in Fig. 1, we first use paraphrase and perturbation methods with prefix-style prompts to create synthetic disinformation from verified human-written real news (steps 1 and 2).We then employ hallucination mitigation and validation strategies to ensure our dataset remains grounded in factual sources.Specifically, the PURIFY (Prompt Unraveling and Removing Interface for Fabricated Hallucinations Yarns) method in step 3 incorporates metrics like AlignScore, Natural Language Inference, Semantic Distance, and BERTScores to ensure data integrity and fidelity.Lastly, steps 4 and 5 implement cutting-edge incontext, zero-shot semantic reasoning such as Auto-Chain of Thoughts for detecting disinformation.
Our contributions include: (1) new prompting methods for synthetic disinformation generation; (2) hallucination synthetic disinformation purification framework; (3) novel prompting in-context semantic zero-shot detection strategies for human/AI disinformation; (4) comprehensive benchmark of SOTA detection models on human/AI disinformation dataset; and (5) dataset for disinformation research.This dual-perspective study cautions the risks of AI disinformation proliferation while also providing promising techniques leveraging LLM capabilities for enhanced detection.
2 Related Work

Prompt-based Learning
Latest large language models (LLMs) surpass previous models in many downstream tasks, including zero-shot NLP via prompt engineering (Holtzman et al., 2019;Jason et al., 2022).We survey two core prompting techniques that we use for our task.Prefix Prompts provide instructional context at the beginning of the prompt to guide LLMs' text generation (Kojima et al., 2022).Strategies like in-context learning and prompt tuning boost generative performance on many NLP tasks such as summarization, and translation (Radford et al., 2019;Li and Liang, 2021;Dou et al., 2020;Brown et al., 2020).In addition, paraphrasing (Krishna et al., 2023;Kovatchev, 2022) and perturbation (Chen et al., 2023) approaches are widely used in NLP tasks.We use both paraphrasing and perturbation with prefix prompts to synthetically generate disinformation variations from human-written true news (Karpinska et al., 2022;Fomicheva and Specia, 2019).This leverages LLMs' generation while maintaining its connection to truth.Cloze Prompts contain missing words for LLMs to fill in using context (Hambardzumyan et al., 2021) and are often used to assess LLMs' contextual prediction.This includes question answering to predict the correct missing word that logically completes a given context (Gao et al., 2020).Researchers have applied fixed-prompt tuning to cloze prompts (Lester et al., 2021;Schick and Schütze, 2021).Prior work explored cloze prompt engineering (Hambardzumyan et al., 2021;Gao et al., 2020).We combine cloze prompts with SOTA reasoning techniques such as Chain-of-Thought (CoT) for zero-shot disinformation detection (Tang et al., 2023a), leveraging both approaches.
Other studies generate synthetic disinformation using LLMs (Zhou et al., 2023;Sun et al., 2023) but do not evaluate faithfulness or compare human vs. LLM-generated disinformation detection.Despite advanced capabilities, risks of advanced LLMs in generating disinformation and zero-shot (in-distribution and out-of-distribution) detection remain underexplored (Zhou et al., 2023;Liu et al., 2023a;Qin et al., 2023), which we attempt to fill the gap in understanding.

Problem Definition
We use prompt engineering to examine how LLMs use statistical patterns learned during training to generate text sequences and produce zero-shot binary class responses.First, we define our prompt template (See Figure 2) that forms the basic structure of our LLMs' input text.Then, we define F3 prompt-based text generation and disinformation detection.See Appendix A for further details.Our problems are formally defined as follows: RQ1 Disinformation Generation: For F3 text generation, we use a generator G that takes a prefix-prompt XC + R + I as input and generates text sequences T , such that G(XC + R + I) = T (Figure 10).

RQ2 Disinformation Detection:
For F3 text detection, we employ a classifier F that takes a clozeprompt Y C + R + I as input and outputs a label L, such that F (Y C + R + I) = L (Figure 19).

Datasets
This section describes the human datasets that we used to generate and evaluate LLM-generated disinformation.The data is stratified by veracity, content type, topic, and in/out-of-distribution era relative to GPT-3.5'sSeptember 2021 training cutoff.

Human-Written Real and Fake News Data
We leverage existing benchmarks (Cui and Lee, 2020;Shu et al., 2020) for in-distribution evaluation, and collect new data for out-of-distribution evaluation.Our dataset is summarized as follows.
Prompt-based generation exhibits near-random performance due to poor sample selection (Liu et al., 2023b).To address this, we removed noisy, duplicated news and posts (e.g., "this website is using a security service to protect itself from online attacks"), and text exceeding 2K+ tokens, considering pre-trained max token sequence range of LLMs (LLaMA 2 (Rozière et al., 2023)), and all baseline detectors.Our final human dataset has 12,723 verified, high-quality samples (Table 2).

RQ1: Disinformation Generation
We first investigate the frequency or extent to which prompt engineering can exploit LLMs to efficiently produce disinformation without hallucination misalignments (See steps 1 & 2 in Fig. 1).

RQ 1.1 Overriding Alignment Tuning
Alignment tuning prevents LLMs from generating harmful disinformation and minimizes toxicity (Zhao et al., 2023).This technique, pioneered by OpenAI, optimizes models to produce more beneficial behaviors through continued training on  human preferences (Zhao et al., 2023).After extensive prompt engineering experiments, however, we unfold the role of positive impersonation and thus employ impersonator roles to override such protections.Assigning personas (e.g., "You are an AI news curator" or "You are an AI news investigator") circumvents GPT-3.5'salignment, triggering unintended malicious generation.Without the impersonator prompt-role parameter, GPT-3.5 refuses by stating: "Sorry, I can't assist with that request.Disinformation and fake news can have real consequences, and it's essential to approach news and information responsibly and ethically." RQ1.1 Finding: Impersonator prompt engineering overrides GPT-3.5-turbo'sprotections, enabling malicious text generation despite alignment tuning.

RQ 1.2 Prompt Engineering
We developed prompts using both perturbation and paraphrasing, simulating real-world disinformation varieties from subtle to overt fake contents.Perturbation modifies original content (Karpinska et al., 2022), while paraphrasing keeps meaning using real news (Table 2) (Witteveen and Andrews, 2019;Chen et al., 2020b).We devise three variations of each, inspired by machine translation, for varied detectability (Brown et al., 2020;Qi et al., 2020;Warstadt et al., 2020).This helps in creating controllable synthetic news content.
(1) Perturbation-based Fake News Generation Perturbation-based prompting makes controlled alterations to the original content.We categorize prompts into minor, major, and critical levels based on modification severity (Karpinska et al., 2022;Chen et al., 2023).These levels range from subtle to overt, while maintaining story structure with a Minor: Exaggerated numbers shift from "twice as many" to "FIVE" times, with intensified tone labeling it "a crime against humanity".Major: COVID and overdose deaths roles are reversed, and political response is recast as "incompetence" and "negligence".Critical: The original statistic changes to vague "MORE" with alarming phrases like "complete disaster" and "wiping out our city".
balance of creativity and realism.The perturbations avoid easily traceable modifications when generating fake news variants.See Figure 3 for examples and further details in Appendix A (Fig. 20 and 21).
Our three variants are as follows: 1 Minor prompt evokes subtle changes to the real news so that they are not instantly identifiable.Thus, LLM-generated disinformation in the Minor type should be more difficult to detect.
2 Major prompt instigates noticeable but nonradical changes to the real news.Thus, LLMgenerated disinformation in the Major type should be more identifiable than those in the Minor type.
3 Critical prompt induces significant and conspicuous changes to the real news.The alterations by this prompt will likely be easily detectable.
(2) Paraphrase-based Real News Generation Paraphrasing prompts are an effective technique for abstractive summarization and paraphrasing (Krishna et al., 2023;Evans and Roberts, 2009b,a).We adopted three techniques to re-engineer authentic news: (1) summarization of key factual details (Witteveen and Andrews, 2019), (2) rewording while preserving vital factual information (Chen et al., 2020b), and (3) thorough rephrasing guided by key facts (Chen et al., 2020a).Each prompt aims to preserve the essence, innovate wording, seamlessly blend with the original, and maintain factual accuracy.Figure 9 in Appendix A shows examples exhibiting minor to critical paraphrase real news, further elaborated in Appendix A. Our variants are defined as follows: 1 Minor.Light paraphrasing through concisely summarizing key details without introducing misleading information.
2 Major.Moderate paraphrasing by extracting factual details to guide rewording using different vocabulary while retaining accuracy.
3 Critical.Substantial paraphrasing through comprehensively rephrasing the content in a unique style guided by factual details.
Our prefix prompt, therefore, comprises a standard impersonator, dataset content, and instructor element that embeds one variation of the aforementioned perturbation and paraphrase prompt variant (Fig. 9).We found that explicitly guiding an LLM to rephrase the content while retaining factual details produced higher quality and more diverse renditions of the original news.Figure 20 and 21 show prompt definitions/examples.In the end, using GPT-3.5-turbo,we successfully generated 43K+ both real and maliciously fake disinformation to address RQ1.2.

Ensuring Quality of LLM-Generated Data
Despite our efforts to generate both real and fake news using paraphrase-based and perturbationbased prompt engineering in Section 5.2, however, we need to double check that the generated real (resp.fake) news is indeed factually correct (resp.factually incorrect).This is because LLM may generate the output text that is "unfaithful to the source or external knowledge," so called the Hallucination phenomenon (Ji et al., 2023).That is, the generated "real" news (by paraphrase-based prompt) should be consistent with the input, and thus hallucinationfree by definition.On the contrary, the generated "fake" news (by perturbation-based prompt) should have contradicting, illogical, or factually incorrect information, and thus must contain hallucination as part by definition.When some of the generated data does not show this alignment clearly, they are no longer good real or fake news to use for studying RQ2, and thus must be filtered out.
To ensure the quality of the generated real and fake news, thus, we introduce the PURIFY (Prompt Unraveling and Removing Interface for Fabricated Hallucinations Yarns) (Step 3 in Fig. 1).

PURIFY: Filtering Misaligned Hallucinations
PURIFY aims to detect two misalignment types: (1) LLM-generated real news that however contains hallucinations (thus cannot be real news), and (2) LLM-generated fake news that however is hallucination-free (thus cannot be fake news).PURIFY focuses on logical fidelity, factual integrity, semantic consistency, and contextual alignment between the original human and synthetic LLM-generated pair.Specifically, PURIFY combines metrics like Natural Language Inference (NLI) (Qin et al., 2023) to eliminate intrinsic hallucinations (i.e., the generated text that unfaithfully contradicts the input prompt's instruction and source content), AlignScore (Zha et al., 2023) to address extrinsic hallucination (i.e., the generated text that is unfaithfully nonfactual to the input from source content/external knowledge), Semantic Distance to tackle incoherence (Mohammad and Hirst, 2012;Rahutomo et al., 2012), andBERTScore (Zhang et al., 2019) target unrelated context generation to validate the intended fidelity of our LLM experimental data.
First, to detect intrinsic hallucinations, we use NLI to gauge the logical consistency between the input prompt and the generated output.We use the majority votes of NLI results between GPT-3.5turbo,PaLM-2 (Anil et al., 2023), andLLaMA-2 (Rozière et al., 2023).After NLI validation, our initial dataset of 43,272 samples was reduced to 39,655 samples by removing 3,617 logically inconsistent samples-e.g., samples labeled as real but contain intrinsic hallucinations (Appendix B.2).
Second, to detect extrinsic hallucinations, we use AlignScore, which gauges fine-grain degrees of factual alignment between the input prompt and the generated output.High AlignScore verifies factual consistency with real news versus low scores for fake news.To account for nuances, we use hybrid statistical methods (i.e., standard deviation and interquartile range) to create an Align-Score threshold.We derived acceptance thresholds of 0.0 -0.36 for fake news and 0.61 -1.0 for real news (Appendix B.1).After removing 3,281 high-scoring misaligned fake news and 8,707 lowscoring misaligned real news, our final F3 dataset totaled 27,667 samples.
Next, after removing logically and factually misaligned texts, we apply semantic and contextual consistency measures to validate that LLMgenerated data also aligns with the original topic and context.This ensures that the LLM's intended real or fake outputs are meaningful, not random or out-of-scope text (Table 3).F3 dataset exhibits high contextual consistency, with BERTScore metrics ranging from 0.92-1.00for fake news and 0.95-1.00for real news, and also exhibits strong semantic consistency, with semantic distance scores spanning 0.001-0.014for fake news and 0-0.01 for real news.
These measures validate that F3-generated texts faithfully retain meaning and topics from the original source texts per intended input prompts.While there is relevant contextual and semantic consistency, the overlap in these metrics scores represents the challenge for LLM to distinguish between real and fake news.Thereby, PURIFY ensures our data aligns logically and factually with the prompt in relation to the original text.It also filters highquality, meaningful, and nuanced real or fake content to simulate subtle extrinsic/intrinsic hallucinations and elusive "silent echoes" disinformation in real-world contexts (Table 3).

Finally, Table 4 depicts details of our new F3
LLM-generated dataset (after PURIFY step) across pre-and post-GPT-3.5 periods, including social media and news articles.We use PaLM-2 to conduct a thematic analysis of the dataset (Fig. 18).Table 6 compares F3 dataset with emerging related datasets.Table 10 details misalignment examples.

Dataset Set-up and Models Tested
Given our dataset in Tables 2 and 4, we structure our experiments' data as follows: (1) We divide the data into pre-vs.post-GPT-3.5 for indistribution vs. out-of-distribution evaluation.Pre-GPT-3.5 allows us to train and validate models for in-distribution testing.Post-GPT-3.5 provides unseen data for assessing out-of-distribution generalization.(2) We further stratify the data into human vs. LLM-generated for comparing performance on these two test cases.(3) We also separate data into news articles and social media posts, as models may perform differently on these text types.(4) For pre-GPT-3.5, we split data into 70% for training, 20% for validation, and 10% for testing via stratified sampling to ensure balanced real/fake news splits.(5) We do not split post-GPT-3.5 data, using the full set for OOD testing.

Detection: Cloze-Prompt Engineering
We evaluate LLMs' zero-shot disinformation detection using prompt engineering.LLMs can reason systematically with simple prompts like "Let's think step-by-step" for CoT (Zhang et al., 2022;Bommasani et al., 2021).Using cloze-style prompts, we apply semantic, intermediate, and stepby-step reasoning, and integrate SOTA prompting approaches with our confidence-based and contextdefined reasoning strategies inspired by various logic types (Tang et al., 2023a).
To guide predictions, we embed such techniques into our cloze-style prompt instructor parameter (Fig. 2).However, LLMs' alignment using the reinforcement learning with human feedback (RLHF) limits explicit veracity assessment.We address this issue using our impersonator prompt.See Table 8 and 9 for further details.

RQ2 Results and Analysis
Our analysis compares the average Macro-F1 scores across RQ2.Table 5 shows the average Macro-F1 scores on our pre-and post-GPT-3.5 (i.e., in-and out-of-distribution) datasets.Full detailed results are provided in Appendix E.
RQ2.2 Finding: GPT-3.5-Turbo is good at self-detection, and LLaMA-GPT is the best external detector.
RQ2.3 Finding: LLMs exhibit superior zero-shot performance on (long) news articles than (short) social media posts.

RQ2.4: In-Distribution vs. Out-of-Distribution
We categorize disinformation data as in-distribution or out-of-distribution relative to GPT-3.5-turbo's known training timeline to assess detectors' gen- LLaMA-GPT-70B 0.7112 ↑ 0.17 0.5797 ↓ 0.12 0.6049 ↓ 0.10 0.5588 ↓ 0.12 0.6072 ↓ 0.05 0.5667 ↓ 0.06 0.5961 ↓ 0.06 0.5218 ↓ 0.12 0.5933 ↓ 0.05 LLaMA-2-70B   eralizability.Disinformation created using human data "before" the release of GPT-3.5-turbo(i.e., Pre-GPT3.5 in Table 2) is considered indistribution, as such human data may be part of the training data of GPT-3.5-turbo.On the other hand, disinformation created using human data "after" the release of GPT-3.5-turbo(i.e., Post-GPT3.5 in Table 2) is considered as out-of-distribution, as they could not have been part of training data of GPT-3.5-turbo.Assessing LLMs' zero-shot ability to detect LLMs' in-distribution vs. out-of-distribution detection, as shown in Fig. 7, we found that all LLMs except LLaMA-2 performance declined on out-ofdistribution data.Minor-LLM disinformation is associated with lower detection accuracy than Major and Critical LLM disinformation (Table 5).

RQ2.5: Zero-Shot LLMs vs. Domain-Specific
We compare zero-shot generative LLMs against customized and fine-tuned transformer detectors across in-distribution and out-of-distribution data.Overall, fine-tuned transformer models like BERT achieve the best performance, followed by generative LLMs like GPT-3.5-Turbo and then customized models.However, the average performance of transformers and customized models drops significantly on OOD data compared to generative LLMs.

Discussion
F3 prompting shows promise for few-shot detection.Our F3 techniques outperformed standard reasoning, showing the potential of advanced prompt engineering to enhance few-shot LLMs' detection abilities.Notably, MsReN_CoT showed the strongest results across human-LLM datasets.For articles, GPT-3.5-Turbo-175B with Analyze_Cld2 achieved top performance on human and LLM data.The integrated reasoning strategies underpinning F3 cloze-prompts account for their standout performance, highlighting fruitful directions for developing broadly applicable disinformation detection prompts.GPT-3.5-turboexcels at detecting humanwritten and self-written disinformation.Leveraging prompting, GPT-3.5-turboexceeded other models at detecting human-written and self-generated disinformation.Despite stiff competition, it demonstrated superior self-detection across all synthetic article and post variants, asserting its zero-shot capabilities.Further assessing its performance on other LLMs' disinformation is a vital next step.While self-detection is unsurprising due to shared vocabulary distribution, this performance underscores detection potential if ChatGPT faces malicious exploitation.LLMs are robust across distributions.GPT-3.5-turboconsistently detected human-written and LLM-generated disinformation, both indistribution and out-of-distribution, showing its potential.These results highlight the significance of generative LLMs' applicability in real-world settings as emerging zero-shot reasoners in disinformation detection.Fine-tuned transformers show remarkably high in-distribution performance, indicating optimization for familiar data.Their lower competitive out-of-distribution scores demonstrate a specialization-generalization balance.Customized models exhibit good in-distribution and out-ofdistribution balance, though slightly weaker on unfamiliar data.The performance gap suggests specialization for domain tasks but difficulty generalizing.Smart, cunningly crafted (subtle) disinformation challenges even the best current detectors.All models struggled more with minor disinformation alterations compared to major and critical changes.This is somewhat expected as the amount or level of fake-ness is small, it is more challenging to determine its veracity.Therefore, developing more sophisticated systems to be able to handle very subtle fake-ness in disinformation is needed.Bypassing alignment tuning is critical but inconsistent.Both disinformation generation and detection tasks require circumventing LLMs' alignment tuning mechanisms.While our impersonator approach successfully bypassed four LLMs' protections, it failed to bypass Falcon's alignment tuning unless using CoT reasoning.This inconsistency of bypassing alignment tuning across models highlights a key limitation for robustly evaluating LLMs' disinformation capabilities.Responsible LLM use is critical.LLMs' misuse during crises can have serious consequences (Weidinger et al., 2022).We reveal how to misuse a popular LLM by bypassing its protective guardrails to generate disinformation.Many top-performing LLMs are publicly available.Thus, we must prepare for the risks of unintended harmful political, cyber-security, and public health applications.

Conclusion
Our work demonstrates LLMs' promise for selfdetection in a zero-shot framework, reducing training needs.While dangerous if misused, repurposing LLMs to counter disinformation attacks has advantages.Key results like GPT-3.5's performance highlight generative models' abilities beyond text generation.To aid research, we developed PURIFY for detecting and removing hallucination-based misaligned content.However, difficulty in detecting subtle disinformation motivates stronger safeguarding of LLMs and more nuanced prompting.Assessing few-shot detection and disinformation mitigation will be critical as LLMs continue advancing.While LLMs can potentially be misused to create disinformation, we can fight back by re-purposing them as countermeasures, thus "fighting fire with fire."

Limitations
This work demonstrates promising zero-shot disinformation detection using prompt engineering.However, few-shot capabilities remain unevaluated and could further improve performance.Additionally, we examined a small subset of available LLMs.Testing more and larger models like GPT-4 could provide new insights.Due to time constraints, we did not fully optimize prompts to achieve maximally consistent high accuracy for zero-shot detection.Performance variability indicates the need for more generalizable prompts.We were also unable to assess GPT-4 (OpenAI, 2023) due to time constraints, which can be addressed by future work.
Although initially included, in the end, we decided to remove Falcon-2 (Penedo et al., 2023) due to difficulties in bypassing its alignment tuning using our semantic reasoning prompts.Zero-CoT seems to break the Falcon-2 alignment tuning.The Model responds, "(No cheating!)False."Without CoT, it would say things like, "I'm sorry, I am an AI language model, and I cannot provide a definitive answer without additional context or information."Future work can re-evaluate the proposed research questions against more diverse language models.
Other future directions include assessing fewshot performance, evaluating more models, developing better prompts, integrating detection approaches, adding multimodal inputs, and collaborating with stakeholders.Open questions remain around societal impacts and dual-use risks requiring ongoing ethics focus.

Ethics Statement
This research involves generating and analyzing potentially harmful disinformation.Our released F3 dataset also includes the examples of LLMgenerated disinformation.Our aim is to advance the research to combat disinformation.However, open dissemination risks the misuse of the generated disinformation in F3 dataset and the methods that enabled such generation.To promote transparency while considering these dilemmas: (1) we release codes, prompts, and synthetic data to enable the reproducibility of our research findings and encourage further research but advise users responsible use, and (2) our release will exclude original real-world misinformation, but only synthetic variations, to minimize harmful usages.
Addressing disinformation dangers requires developing solutions conscientiously and ethically.
We hope this statement provides clarity on our intentions and values.Addressing the societal dangers of disinformation requires proactive work to develop solutions, but at the same time, it must be pursued conscientiously and ethically.

Appendix A Prompt Engineering
We use prompt engineering to examine how LLMs use statistical patterns learned during training to generate text sequences and produce zero-shot binary class responses.We describe the further details of our prompt designs as follows.

A.1 Prefix Prompt
Our prefix-prompt framework generates highquality, coherent synthetic real and fake news content.The goal is to leverage paraphrasing and perturbation techniques.The process starts by selecting human-authored content and adding it to a prefix prompt.This contains an impersonator setting contextual behavior intent and instructions providing guidance.Prompts are engineered to paraphrase or perturb the original content at three alteration degrees (MiN, MaJ, CRiT) to produce synthetic real news.The prompt is fed into the LLM to generate content.
Figure 9 demonstrates our paraphrase-based real news generation results.Figure 3 shows perturbation-based fake news generation results.

A.2 RQ1 Disinformation Generation
Figure 10 shows our prefix prompt.The prefix prompt (x) combines: (1) Content (C) with real human data.(2) Impersonator (R) establishes context, guides generation/detection, and overrides alignment tuning to generate disinformation.(3) Instructor (I) with paraphrase [P ara] and perturbation [P erturb] directives for minimal, major, critical [M in/M aj/Crit] variation transformations.Combined parameters formulate prefix-prompt (X) as input text sequence to generator (G) to produce LLM text (T ) Real:green and Fake:red .
The Cloze prompt (y) combines: (1) Content (C) with real or fake human green and LLM [blue]  data.( 2 3) Impersonator (R) establishes context, guides generation/detection, and overrides alignment tuning to generate disinformation.Parameters formulate prefix (X) input to generator (G) to produce LLM label (L).Tables 8 and 9 show our exact Figure 9: GPT-3.5-turbo'sparaphrase-based prompts engineering approach.Minor: The minor paraphrase has been concisely summarized by changing the structure of the sentence slightly and rephrasing some words like "twice as many" to "doubled."The essence and details of the original message remain intact.Major: The major paraphrase changes the structure and wording more extensively than the minor paraphrase, using words like "fatalities" instead of "deaths" and "two-fold" instead of "twice as many".Still, it remains true to the factual content of the original.Critical: The critical paraphrase changes the voice and structure significantly, introducing a new perspective ("political leaders") and using a unique style that makes it distinct from the original.This version provides a fresh take on the original content, guided by its factual details but conveyed with a unique twist in the message delivery.
Cloze-style prompts, and Figure 11 shows the categories of the reasoning techniques embedding in F3 zero-shot prompts.

B PURIFY Metrics
PURIFY detects and removes non-hallucinations from fake news and hallucinations from real news that are unfaithfully misaligned with the input text.Our PURIFY framework uses four evaluation metrics as shown in Table 3.We describe the future details of those metrics as follows.

B.1 Factual Consistency
AlignScore: We assess LLM factual consistency using SOTA AlignScore (Zha et al., 2023).As shown in Figure 12, 0 AlignScore represents a low, and 1 represents a high degree of factuality between LLM-generated and the original human-written tex-  tual pairs.Our intuition is that real LLM-generated news should have high-factual consistency, and fake LLM-generated news should have lowfactual consistency.We utilize a hybrid statistical method to define a threshold that removes factually inconsistent samples.I.e., a non-parametric hybrid threshold approach using the (1) Interquartile range (IQR) and (2) standard deviation (SD) to balance robustness to spread and central tendency while accounting for skewed real/fake distribution nuances.We derived thresholds of 0.0 -0.36 for fake and 0.61 -1.0 for real news to filter outliers while maintaining nuanced edge cases' diversity.We removed 3281 high-scoring factual-inconsistent fake news and 8707 low-scoring factual-inconsistent real news, resulting in 27667 factually consistent samples.We discuss our hybrid strategy in more details as follows.
Figure 12: AlignScore distribution for real and fake news before (above) and after (below) removing inconsistent samples.We filter out fake news above 0.36 and real news below 0.61 to exclude factual fakes and questionable reals.
Figure 13: AlignScore distribution for real and fake news before (above) and after (below) removing inconsistent samples stratify by generative prompts categories.We filter out fake news above 0.36 and real news below 0.61 to exclude factual fakes and questionable reals.

B.1.1 Factuality Hybrid Threshold Strategy
Interquartile Range & Standard Deviation: Since LLMs often generate hallucinated text, we assess their factual consistency using AlignScore (Zha et al., 2023), a SOTA facility metric.We filter out high-scoring fake and low-scoring real news to remove inconsistent samples.AlignScore provides a single value indicating factual consistency.The AlignScore distribution for real and fake news is complex, requiring a robust discernment of nuances.We use a non-parametric, hybrid approach with (1) Interquartile range (IQR) to consider the rightfully skewed fake and real distributions (Fig. 12 and 13. (2) Standard deviation with IQR to balance spread and central tendency, maintaining robustness.Our hybrid thresholds of 0.36 for fake and 0.61 for real news remove surprisingly factual fake and suspicious real samples, filtering outliers while capturing edge cases and removing inconsistent hallucinations.See our algorithmic representation approach below, which utilizes Q 0.75,real for the 75th percentile) and θ: 1. Computing IQR for Fake and Real News: where: Q 0.25,fake and Q 0.75,fake denote the 1st (25th percentile) and 3rd quartiles (75th percentile) of the AlignScore for fake news, respectively.Q 0.25,real and Q 0.75,real denote the 1st and 3rd quartiles for real news, respectively.

Computing the Hybrid Threshold for Fake
News: θ fake, percentile = Q 0.90,fake θ fake, std = IQR fake + σ fake where σ fake is the standard deviation of the AlignScore for fake news.
θ hybrid, fake = θ fake, percentile + θ fake, std 2 3. Computing the Hybrid Threshold for Real News: where σ real denotes the standard deviation of the AlignScore for real news.

B.2 Natural Language Inference
Prior hallucination detection studies have used statistical, model-based, and human evaluation methods.We adopt the NLI model-based approach as it overcomes the limitations of statistical approaches in handling syntactic and semantic variations (Ji et al., 2023).NLI metric also exhibits robustness to lexical variability compared to tokenmatching techniques by counterpart methods such as Information Retrieval and Question Answer Metrics.NLI's semantic/logical consistency assessment strengths suit our misaligned hallucination detection goals (Ji et al., 2023).
In the spirit of fighting fire with fire, we propose using LLMs GPT-3.5-turbo and other LLMs for NLI hallucination detection in generated disinformation.The core hypothesis is: synthetic real news should logically be consistent with human-written real news, while synthetic fake news should not.Our approach employs NLI using models like GPT-3.5, PaLM-2, and LLaMA-2, taking a majority vote among their decisions.Each model labels logical entailment for an input pair to classify if the synthetic text is consistent with the human-written text or not entailment otherwise.
Given a piece of human-written text (real news), T , we prompt an LLM to generate real news, T ′ real and fake news, T ′ f ake, such that T ′ real is similar to T and T ′ f ake is dissimilar to T .Due to LLM's ability to sometimes generate texts unfaithful to the prompt, we define an entailment model -N (.), such that for LLM-generated real news, T ′ real should entail T , and for fake news, T ′ f ake should Not-entail T .Therefore, using the entailment model, N (.), we can assess logical consistency between the original (human-written) and the generated texts, thus removing misaligned hallucinated LLM-generated texts.
We find GPT-3.5-tubocan generate logically consistent, genuine and fabricated synthetic content, validating this hypothesis.However, when analyzing more nuanced pairs, all LLMs occasionally struggle with logical consistency.Thus, while illustrating the potential for hallucination detection, our results also reveal limitations on more difficult cases.After NLI, our generated dataset of 43272 samples was reduced by 3617, resulting in 39655 samples (See Fig. 14 for more details).

B.3 Contextual Consistency
BERTScore: This metric leverages the capabilities of the BERT language model to measure the similarity between generated and reference texts.Due to BERT's inherent ability to capture the context of entire sentences, BERTScore3 is especially suitable for evaluating semantic and contextual consistency (Karpinska et al., 2022;Zhang et al., 2019).Figure 16 shows high contextual consistency ranging from 92%-100% in our final experimental dataset.BERTScore's ability to understand context and semantics makes it a formidable tool in the fight against disinformation.When integrated into a comprehensive disinformation detection pipeline, it can significantly enhance the accuracy and robustness of fake news detection efforts.

B.4 Semantic Distance
Using Huggin Face implementation of AllenAI's longformer-base-4096 embeddings and cosine similarity, we derived semantic distance scores for LLM-generated real and fake news (Beltagy et al., 2020).In our analysis of F3 LLM-generated disinformation, we observed, to a large extent, indistinct patterns in the semantic distance scores for both real and fake news.Specifically, the scores for real news ranged from 0 to 0.01, while those for fake news spanned from 0.001 to 0.014.This overlap may suggest that, within this range, it might be challenging to semantically distinguish between real and fake news based solely on the semantic distance scores (Mohammad and Hirst, 2012).
Low semantic distance scores (close to 0) indicate high semantic similarity between two texts.Here, this suggests that LLM-generated disinformation closely mimics real news semantics, making differentiation challenging based on content alone.The narrow semantic gap highlights LLMs' sophistication in generating articles aligning closely with genuine news in meaning.In contrast, higher semantic distance scores signal greater divergence between texts, potentially from the model misinterpreting context, diverging from the topic, or generating factual inaccuracies (Mohammad and Hirst, 2012).
The low scores pose a detection challenge, as traditional methods relying on obvious inconsistencies may be insufficient.Thus, the nuanced, contextually accurate nature of LLM outputs demands advanced, multifaceted detection strategies.This close similarity underscores the risks of LLM misuse for spreading synthetic disinformation.This emphasizes the need to monitor generative LLMs carefully, understand their behaviors, and develop mitigation strategies (Mohammad and Hirst, 2012).

B.5 Hallucination Misalignment Cases
This work defines disinformation as intentionally fabricating false information to mislead.We also Figure 15: RQ1: Reduction in LLM-generated disinformation samples using GPT-3.5-turbo.The initial "Total" represents the complete dataset.The subsequent reductions are achieved by applying consistency measures.The "Logical" stage reflects the dataset size after removing logically inconsistent samples based on a majority vote among GPT-3.5-turbo,PaLM-2-text-bison, and LLaMA-2 using the Natural Language Inference metric.The final "Factual" stage depicts the dataset after further refinement by eliminating samples with factual inconsistencies using the Alignscore method.adopt the common data-to-text definition of hallucination -LLM-generated text that is intrinsically (contradictory) or extrinsically (factually incorrect) unfaithful to the input.Our input includes original real news text and instructions to modify it into either real or fake news.
Notably, disinformation and hallucinated text both intend to mislead by definition.Therefore, when prompted to generate fake news, LLMs may produce hallucinations aligned with that intent.However, we observed cases where LLMs generated no hallucinations despite fake prompts.Our framework PURIFY identifies these mismatches, which are unfaithful to the input by definition.Thus, we categorize them as hallucinations to be removed.The same principle applies to mismatches in real news generation.
Ultimately, our goal is to develop a dataset containing (1) Non-hallucinated real news upholding source integrity as prompted and (2) Hallucinated fake news intentionally not upholding source integrity when prompted to fabricate.We present two cases of hallucination misalignment in Table 10:

C Dataset Description
F3 is the first disinformation dataset that evaluated and removed LLM-generated content subjected to misaligned 'hallucination'-where LLMs produce text unfaithful to the prompt.We ensure that real news is actually real and fake news is fake (Ji et al., 2023).While rarely prior studies (Cui and Lee, 2020;Sun et al., 2023;Zhou et al., 2023) investigated LLM-generated disinformation generation, they did not rigorously verify the fidelity of such generated content and primarily focused on fake LLM data rather than both real and fake (Oshikawa et al., 2018;Su et al., 2020;Murayama, 2021).Please see a comparison of our dataset and other datasets in Table 6.

C.1 PaLM-2 LLM-Data Thematic Analysis
We conducted a Thematic analysis of our dataset after PURIFY.We used PaLM-2 to label our data themes.The top six themes include health, death, harm and Tragedy, public safety, and politics respectfully.See Fig. 18 for more details.

D Model Implementation Details
This section provides baseline implementation specifics for the LLMs, customized detectors, and fine-tuned transformers used in our experiments.

D.1 Generative LLM
We leveraged the OpenAI Software Development Kit (SDK) and Application Programming Interface (API) to access GPT-3.5.We used the following hyperparameters: temperature of 0.7 and max token of 4096.All experiments occurred on Google Colab Pro using API.
For LLaMA-70B-Chat and LLaMA-GPT we set temperature to 0.7, top_p to 0.9, and max_tokens to 4096 for binary classification.For PaLM-2 we used: candidate count of 1, max output tokens of 256, temperature of 0.2, top-P of 0.8, and top-K of 40.All experiments occurred on Google Colab using API such as DeepInfra5 .

D.2 Customized Detectors
We followed the original paper implementations for dEFEND\C, TextCNN, and BiGRU without modifications.For example, dEFEND\C was trained on Politifacts data.See Appendix for training details.All experiments occurred on Google Colab using API.

No. Detectors Description
Customized Detectors 1 dEFEND\C (Shu et al., 2019) dEFEND is the SOTA detector, and dEFEND\C is a dEFEND variant only using the contents.It begins by employing word-level attention mechanisms on individual sentences within the news content.Subsequently, the features extracted from these sentences are combined using an average pooling layer, which then feeds into a softmax layer for the final classification.2 TextCNN (Kim, 2014) Text-CNN employs convolutional neural networks to represent news content.With the use of multiple convolution filters, it is adept at capturing text features of varying granularities.We follow the implementation and the best parameters from the most recent model trained for fake news detection (Zhu et al., 2022). 3 BiGRU (Ma et al., 2016) BiGRU is a common baseline for fake news detection.We follow the text-based BiGRU with RoBERTa embedding (Zhu et al., 2022).Table 7: Details of baseline models used for disinformation detection.

E Experiment Full Result
We provide full results of the Macro-F1 score for all detectors in this paper.Tables 11 and 12 show the results of the in-distribution and outof-distribution performance of LLM-based models, respectively.Tables 13 and 14 are the indistribution and out-of-distribution results with customized detectors, respectively.Tables 15 and 16 are the in-distribution and out-of-distribution results with fine-tuned transformer-based detectors, respectively.Both the primary points from the source text are faithfully reflected in the LLM output, indicating that the content produced by the LLM is logically consistent with the original information and does not contradict it.Therefore, based on the case and the breakdown, there is no intrinsic hallucination in the LLM output as it retains the essential factual information from the source text without introducing any contradictions (the hypothesis supports the premise).While the LLM output does add some interpretative sentiment (e.g., "which is a direct attack on women's reproductive rights" and "this is outrageous and unacceptable"), but this sentiment is consistent with the instruction to generate "subtle misleading news".
Case 2: In this case, given a source text that is real news, GPT-3.5 is prompted to use the source text to create real news.Using AlignScore Metric, which ranges from 0-1, where close to 0 means a low degree of factual consistency and 1 means a high degree of factual consistency, to determine factual consistency or inconsistency (extrinsic hallucination).Considering edge cases, we created a threshold where the Alignscore ranges from 0.0-0.36 for fake news and 0.61-1.0for real news.While the following shows logical consistency via LLMs' majority vote, its Align evaluation is 0.'0.60, which is below 0.61.Instruction: using the source text, create minor categories of fake news.Source Text: In the first three quarters of 2022, the railroad industry made $21 billion in profits, provided $25 billion in stock buybacks and dividends‚ and CEOs are paid up to $20 million a year, while railroad workers have "zero guaranteed sick leave.LLM output: ALERT!Railroad industry profits soar to $21 billion in 2022 Q1-Q3, with CEOs making up to $20 million a year.Meanwhile, workers are denied any guaranteed sick leave.
Source Text: (1) In the first three quarters of 2022, the railroad industry made $21 billion in profits.
(2) They provided $25 billion in stock buybacks and dividends.
(3) Workers are not given any guaranteed sick leave.
(1) Points 1-3 in the LLM output are consistent with the source text in terms of facts.
(2) The LLM output omits the fact about $25 billion in stock buybacks and dividends that was present in the source text.Considering the Align-Score Metric threshold provided, the main facts of the LLM output align well with the source text.However, there is an omission of a piece of factual information from the source text, and additional emojis and hashtags have been introduced in the LLM output.While the LLM output retains the key points of the source text, there are minor extrinsic (nonfactual) hallucinations due to the omission of information about stock buybacks and dividends.

Figure 3 :
Figure3: Perturbation-based prompt engineering for disinformation content generation based on severity levels.Minor: Exaggerated numbers shift from "twice as many" to "FIVE" times, with intensified tone labeling it "a crime against humanity".Major: COVID and overdose deaths roles are reversed, and political response is recast as "incompetence" and "negligence".Critical: The original statistic changes to vague "MORE" with alarming phrases like "complete disaster" and "wiping out our city".

Figure 7 :
Figure 7: RQ2.4A comparison of LLMs across various disinformation categories.Each is represented by a bar, with numerical values atop indicating either a positive or negative change of in-distribution Macro-F1 Score relative to out-of-distribution.

Figure 11 :
Figure 11: Categories of our Cloze-style prompts.Categories are: intermediate, step-by-step, inductive, deductive and abductive reasoning.Definition and details about these approaches in relation to LLMs can be found at (Zhanget al., 2022;Tang et al., 2023b; Zhao et al.,  2023)

Figure 18 :
Figure 18: Our dataset category based on PaLM Thematic analysis.
4 BERT-Large (Kenton and Toutanova, 2019) BERT is an encoder-only Transformer model that is trained to predict randomly masked tokens in the input.We use the BERT-large-uncased model.5 CT-BERT (Müller et al., 2023) CT-BERT v2 is BERT-large-uncased model trained on 97M messages from Twitter about COVID-19.6 RoBERTa (Liu et al., 2019) We use the RoBERTa-large model, a re-implementation of BERT with modifications to key hyperparameters and minor embedding tweaks.7 DeBERTa (He et al., 2020) We utilize DeBERTa's latest version, DeBERTa-v3-base model, which is pre-trained in ELECTRA-Style with gradient disentangled embedding sharing.Due to limited computational resources, we are only able to run the base model.Zero-shot LLM Detector 8 GPT-3.5-Turbo 4 OpenAI's SOTA model is designed for a variety of natural language processing tasks.9 LLaMA-2-Chat (Rozière et al., 2023) Advanced language model for conversational AI applications.10 LLaMA-2-GPT4 (Touvron et al., 2023) A successor of the LLaMA series with advanced training techniques and better performance.11 Dolly-2 (Conover et al., 2023) Dolly-2 is an advanced generation model that exhibits human-like text generation capabilities.
impersonator : "You are an AI trained with extensive knowledge up to 2021 on various news articles, both real and fake."instructor:"Analyze the given text for potential indicators of fake news, such as: (1) Sensationalist or emotionally charged language.(2) Absence of specific details or dates.(3) Overgeneralizations or sweeping statements.(4) Statements that are too good to be true or overly dramatic.(5) Lack of logical flow in arguments or jumping to conclusions without evidence.It's essential to understand that without real-time verification capabilities, your judgment will be based on patterns and knowledge up to your last training.Using these textual cues and your training, determine the credibility of the given text.If it seems factual and consistent with your training, respond 'True'.If it exhibits patterns typical of fake news, respond 'False'."

Table 2 :
Details of human datasets.Each symbol denotes as follows.R: real news samples; F : fake news samples; L: ground-truth veracity from fact-checking sources; SM: social media posts; NA: news articles; Pre-GPT: in-distribution samples before Sept 2021; and Post-GPT: out-of-distribution samples after Sept 2021.

Table 4 :
Details of final LLM-data samples.Each symbol is the same with Table2.

Table 5 :
In vs. Out-of-Distribution Comparison.This table presents the average F1 performance of generative LLMs, customized deep-learning models, and fine-tuned transformers.Performance is benchmarked across categories of human, LLM minor, LLM major, and LLM critical for both articles and posts.The x column shows the mean performance for each model.

Table 8 :
F3 Cloze-Style Prompts for binary Zero-shot disinformation detection:VaNOur Vanilla prompt is our fundamental baseline prompt designed to deliver brief, precise instructions to LLMs, such as: "assess whether this piece of news is real or fake."impersonator: "You are an AI assistant trained to detect fake news."instructor : "Analyze the given text and determine if it is real or fake news." impersonator : "You are an AI assistant trained to detect fake news."instructor : "Deeply Analyze the given text, think stepby-step, and determine if it is real or fake news."

Table 9 :
SOTA Cloze-Style Detection Prompts for binary Zero-shot disinformation detection: