From Chaos to Clarity: Claim Normalization to Empower Fact-Checking

With the rise of social media, users are exposed to many misleading claims. However, the pervasive noise inherent in these posts presents a challenge in identifying precise and prominent claims that require verification. Extracting the important claims from such posts is arduous and time-consuming, yet it is an underexplored problem. Here, we aim to bridge this gap. We introduce a novel task, Claim Normalization (aka ClaimNorm), which aims to decompose complex and noisy social media posts into more straightforward and understandable forms, termed normalized claims. We propose CACN, a pioneering approach that leverages chain-of-thought and claim check-worthiness estimation, mimicking human reasoning processes, to comprehend intricate claims. Moreover, we capitalize on the in-context learning capabilities of large language models to provide guidance and to improve claim normalization. To evaluate the effectiveness of our proposed model, we meticulously compile a comprehensive real-world dataset, CLAN, comprising more than 6k instances of social media posts alongside their respective normalized claims. Our experiments demonstrate that CACN outperforms several baselines across various evaluation measures. Finally, our rigorous error analysis validates CACN's capabilities and pitfalls.


Introduction
Social media have enabled a new way of communication, breaking down geographical barriers and bringing unprecedented opportunities for knowledge exchange.However, this has also presented a growing threat to society, e.g., during the 2016 US Presidential Election (Allcott and Gentzkow, 2017), the COVID-19 pandemic (Alam et al., 2021;Rocha et al., 2021;Nakov et al., 2022a), the Ukraine-Russia conflict (Khaldarova and Pantti, 2016), etc. False claims are an intrinsic aspect of fabricated news, rumors, propaganda, and misinformation.Journalists and fact-checkers work tirelessly to assess the factuality of such claims in spoken and/or written form, sifting through an avalanche of claims and pieces of evidence to determine the truth.To further address this pressing issue, several independent fact-checking organizations have emerged in recent years, such as Snopes, 2 FullFact, 3 and Poli-tiFact, 4 which play a crucial role in verifying the accuracy of online content.However, the rate at which online information is being disseminated far outpaces the capacity of fact-checkers, making it difficult to verify every single claim.This, in turn, leaves numerous unverified claims circulating online, potentially reaching millions before they can be verified.
In light of the growing challenges faced by factcheckers in verifying the factuality of social media claims, we propose the novel task of claim normalization.This task aims to extract and to simplify the central assertion made in a long, noisy social media post.This can improve the efficacy and curtail the workload of fact-checkers while maintaining high precision and conscientiousness.We provide a more detailed explanation of why the claim normalization task is essential and illustrate its significance in Appendix A.1.
In our problem formulation, given an input social media claim, the system needs to simplify it in a concise form that contains the post's central assertion that fact-checkers can easily verify.To better understand our motivation, we illustrate the task in Figure 1.The first social media post reads, 'Cyanocobalamin is a synthetic form of Vitamin B12...If you're on B12 supplements, throw them away.'This post contains some extraneous information that has no relevance for fact-checkers.As a result, they distil the information and summarize it as, 'Cyanocobalamin, the most common form of Vitamin B12, is toxic.'Fact-checkers tasked with verifying the accuracy of such noisy posts need to read through them and condense their content to obtain a concise claim that can be easily fact-checked.Unfortunately, this process can be exceedingly timeconsuming.By automating the claim normalization process, fact-checkers can work more efficiently.Another aspect is that fact-checkers often choose what to fact-check based on the virality of a claim, for which they need to be able to recognize when the same claim appears in a slightly different form, and claim normalization is essential for this.
Our contributions are as follows: • We introduce the novel task of claim normalization, which seeks to detect the core claim in a given piece of text.
• We present a meticulously curated high-quality dataset specifically tailored for claim normalization of noisy social media posts.
• We propose a robust framework for claim normalization, incorporating chain-of-thought, incontext learning, and claim check-worthiness estimation to comprehend intricate claims.
• We conduct a thorough error analysis, which can inform future research.

Related Work
Claim Analysis.Previous work has focused on distinct aspects of claims, including claim detection (Daxenberger et al., 2017;Gupta et al., 2021;Sundriyal et al., 2021;Gangi Reddy et al., 2022a,b), claim check-worthiness estimation (Hassan et al., 2017;Gencheva et al., 2017;Barrón-Cedeño et al., 2018;Jaradat et al., 2018;Vasileva et al., 2019;Barrón-Cedeño et al., 2020;Konstantinovskiy et al., 2021), claim span identification (Sundriyal et al., 2022), etc.By curating the AAWD corpus, Bender et al. (2011) pioneered the efforts in claim detection, the foremost step in the fact-checking tasks.Following this, linguistically motivated features, including sentiment, syntax, context-free grammar, and parse trees, were frequently used (Daxenberger et al., 2017;Lippi and Torroni, 2015;Levy et al., 2017;Sundriyal et al., 2021).Recently, large language models (LLMs) have also been used for claim detection (Chakrabarty et al., 2019;Barrón-Cedeño et al., 2020;Gupta et al., 2021;Gangi Reddy et al., 2022a,b).Most previous work on claim detection and extraction primarily concentrated on adapting to text that comes from similar distributions or topics.Moreover, it often relied on well-structured formal writing.In contrast, our objective is to develop a system that specifically addresses the challenges posed by posts in social media and aims to extract the central claim in a more simplified manner, which goes beyond extracting a text subspan in a social media post and aims at abstractive claim extraction that mimics what professional fact-checkers do.
To the best of our knowledge, we are the first to address the task of claim extraction in this very practical formulation.
Text Summarization.The task of claim normalization is closely related to the task of text summarization.In the latter, given a lengthy document, the goal is to summarize it into a much shorter summary.Previous work on text summarization has explored various approaches, including large pretrained seq2seq models to generate high-quality summaries (Radford et al., 2019;Lewis et al., 2020;Raffel et al., 2020).
One issue has been the faithfulness of the summary with respect to the source.To address this, Kryscinski et al. (2020) introduced FactCC, a weakly-supervised BERT-based entailment model, which augments the dataset with artificially introduced faithfulness errors.Similarly, (Utama et al., 2022) trained a model for detecting factual inconsistencies in data from controllable text generation that perturbs human-annotated summaries, introducing varying types of factual inconsistencies.Durmus et al. (2020) proposed a questionanswering framework that compares answers from the summary to those from the original text.
All these approaches primarily focused on general-purpose summarization and did not provide means for models to generate summaries primarily focusing on specific needs.To address this limitation, controlled summarization was introduced (Fan et al., 2018).One aspect of controlled summarization is length control, in which users can set their preferred summary length (Rush et al., 2015;Kikuchi et al., 2016).Recent research has discovered that, despite their fluency and coherence, state-of-the-art abstractive summarization systems produce summaries with contradictory information.
While text summarization systems can assist in condensing social media posts into shorter summaries, their primary goal is not to ensure verifiability.It aims to capture the key points of the text rather than emphasizing the specific claims within the text that need to be fact-checked.Our task of claim normalization, on the other hand, works at an entirely different level.It needs a thorough understanding of the claims made in the social media post and strives to ensure that the normalized claims are not only consistent with the original post, but are also self-contained and verifiable.
Despite the progress in text summarization, the task of claim normalization remains underexplored.In this work, we aim to tackle this challenging problem by developing a robust approach specifically tailored to the unique aspects of this task.

Dataset
Existing text summarization datasets have not specifically addressed the need for claim-oriented summaries.To address this gap, we propose a novel dataset CLAN (Claim Normalization), consisting of fact-checked social media posts paired with concise claim-oriented summaries (known as normalized claims), created by fact-checkers as part of the verification process.As a result, our dataset is not subjected to external annotation, thus averting potential biases and ensuring its high quality.

Data Collection
We gathered our fact-checked post and claim pairs from two sources: (i) Google Fact-Check Explorer5 and (ii) ClaimReview Schema. 6oogle Fact-Check Explorer.We acquired a list of fact-checked claims from multiple reputed factcheck sources via Google Fact-Check Explorer's API (GFC).This data collection pipeline followed a three-step process.First, we extracted the title, which is usually a single-sentence short summary of the information being fact-checked, and the factchecking site's URL.This step yielded a total of 22,405 unique fact-checks.We then proceeded to retrieve the social media post and the associated claim review if they were available on the factchecking site.Due to the collected posts having already undergone fact-checking and containing misleading claims, a significant number of them were unavailable for inclusion in our dataset.Moreover, a significant number of the posts only contained images or videos, which were unsuitable for our task at hand.As a result, we were left with a considerably smaller number of relevant instances.We also noted that in certain instances, the title in the Google Fact-Check Explorer and the claim review were identical; consequently, we included only one in the final dataset.
The ClaimReview Schema.We targeted the ClaimReview Schema elements with an entry for reviewed items as they were relevant to our requirements.Out of 44,478 entries, only 22,428 had this particular field.Therefore, we had to filter out the remaining entries.Next, we extracted all the links to social media posts and their corresponding claim reviews provided by the fact-checkers.
1 Research into the dangers of cooking with aluminum foil has found that some of the toxic metal can contaminate food.Increased levels of aluminum in the body have been linked to osteoporosis, and Alzheimer's disease.
Cooking in Aluminum foil causes Alzheimer's Disease.
2 Did you know when ur child turns 6. U can add them as authorized user to one of ur credit cards.Never give them card, & all payments u make from 6 to 18 goes to ur child credit too..ur kid will have a unbelievable credit score from years of payment history.
6-year-old kids can be added as authorized users on all credit cards.As mentioned above, we only processed textual claims and excluded other modalities, such as audio or video.Further, we ensured that all the entries were in English.

Data Statistics and Analysis
By using both of these data collection methods and by exercising careful consideration, we curated a total of 6,388 instances.To ensure the creation of a diverse and high-quality test set, we chose posts that comprised not one, but two reference normalized claims (c.f.Sec 3.1).This enabled us to capture different aspects and perspectives of the normalized claims by including multiple references, thereby increasing the test set's robustness and reliability.Representative examples from our dataset are shown in  Figure 2 shows an analysis of the cosine similarities between the social media posts and the corresponding normalized claims.We can see that the cosine similarities are consistently low for most examples, demonstrating that claim normalization involves more than just summarizing the social media post.This highlights the need for a specialized effort to accurately identify, extract, and normalize the claims within social media posts.

Proposed Approach
In this section, we explain our proposed approach, Check-worthiness Aware Claim Normalization (CACN), which aims to integrate task-specific information with large language models (LLMs).We focus our experiments on GPT-3 (text-davinci-003) (Brown et al., 2020).Our approach amalgamates two key ideas: (i) chain-of-thought prompting and (ii) reverse check-worthiness.
Chain-of-Thought Prompting.The realm of chain-of-thought (CoT) prompting has emerged as a veritable tour de force within LLMs (Wei et al., 2022).Instead of undergoing the laborious process of fine-tuning individual model checkpoints for every new task, we use CoT to navigate the complexity of claim normalization by using step-by-step reasoning.To accomplish this, we use claim checkworthiness, as described in the following subsection.This enables the model to iteratively enhance its comprehension and effectively generate precise normalized claims while eliminating the need for extensive fine-tuning.
Our proposed prompt example is shown in Figure 3. Chain-of-thought approaches a complicated problem by efficiently breaking it into a sequence of simpler intermediate stages.Reverse Check-Worthiness.The idea about reverse check-worthiness originates from the task of check-worthiness estimation, which in turn is an integral part of the manual fact-checking process (Nakov et al., 2021b).We leverage checkworthiness to steer the model's attention toward salient and pertinent information.By giving the model the ability to produce rationales in natural language that clearly explain the sequence of reasoning stages leading to the solution, we strengthen its capacity for cognitive reasoning with unwavering efficacy.Based on prior research on claim check-worthiness (Barrón-Cedeño et al., 2018;Shaar et al., 2021;Nakov et al., 2022b), we direct our model to prioritize claims that meet specific criteria within the given social media post.These criteria include identifying claims within social media posts that (i) contain verifiable factual claims, (ii) have a higher likelihood of being false, (iii) are of general public interest, (iv) are likely to be harmful, and (v) are worth fact-checking.For instance, in Figure 3, the claim normalization process begins by identifying the central claim within the input social media post.Subsequently, we reckon the claim's verifiability, i.e., whether it is selfcontained and verifiable (e.g., as opposed to not containing a claim or expressing an opinion, etc.).We further evaluate the likelihood of the claim being false and its overall check-worthiness.This step-by-step process ensures a comprehensive analysis of the central claim's characteristics, allowing for effective claim normalization.By incorporating these aspects into our approach, we aim to improve the model's ability to identify and prioritize claims that require scrutiny and verification.Evaluation Measures.To evaluate lexical overlap, we use ROUGE (1, 2, L) and BLEU-4 (Papineni et al., 2002).We further use METEOR (Banerjee and Lavie, 2005) and BERTScore (Zhang et al., 2019) to assess the similarity between the gold and the generated normalized claims.
Zero-Shot Learning.Zero-shot learning aims to apply the previously acquired capabilities of PLMs to similar tasks in a low domain.We hereby assess its suitability for the claim normalization task.
Few-Shot Learning.We adopt few-shot learning with 10, 20, 50, and 100 training examples.This gradual exposure to additional labeled data aims to enhance the models' ability to generate accurate and contextually appropriate normalized claims.
Prompt Tuning.Prompt-tuning entails adding a specific prefix to the model's input customized to the downstream tasks (Zhang et al., 2023).We investigate the impact of affixing different prompts to the given posts on the performance of T5-based and GPT-3 models.To exert control over the generated normalized claims, we use five control aspects: tokens, abstractness, number of sentences, claimcentricity, and entity-centricity.A comprehensive description of all these prompts is given in the Appendix (A.2).
In-Context Learning.LLMs have the remarkable ability to tackle diverse tasks with a minimal amount of examples given in-context learning prompts (Brown et al., 2020).We use GPT-3 (text-davinci-003) with three different prompts: (i) direct prompt (DIRECT), (ii) question-guided prompt (Q-GUIDED), and (iii) zero-shot chain-ofthought (ZS-CoT).Detailed prompt templates are given in Appendix A.3.

Experiments and Evaluation
Our experiments reveal that our CACN outperforms all baselines across most evaluation measures.We further examine all systems aiming to answer the research questions listed below.Do meticulously crafted prompts enhance the performance of generative models?The findings exhibit a significant performance improvement when using prompt-tuning, specifically with incontext examples.Table 3 shows the effectiveness of various prompts across all evaluation measures.However, a notable enhancement of approximately 2-3 points absolute is observed for all semantic measures when transitioning from conventional prompts to our proposed approach when using the same in-context examples.This emphasizes the importance of our framework tailored for the specific task.Moreover, an upsurge in ROUGE-F1 scores (1, 2, and L) emphasizes the resemblance between the generated normalized claims and such created by humans.This, in turn, validates the incorporation of the "reverse check-worthiness" chain-ofthought process, which effectively integrates taskspecific information into the generative system.We also attempt prompt-tuning in a zero-shot setup; the results are shown in the Appendix (A.2).To summarize, the deliberate design of prompts, along with in-context learning, substantially enhances the performance of generative models.
Is training models on a specific task less effective than in-context learning with a few examples?We observe substantial disparities in the performance of models trained on task-specific data compared to using in-context learning with a limited number of examples, as shown in Table 3.
We can see that the models exposed to in-context examples showcase superior performance, highlighting their efficacy in capturing task-specific patterns.While the trained models exhibit excellence in lexical metrics, their performance in semantic metrics is noticeably lower.Notably, BART LARGE , trained on our dataset, outperforms other trained models by sizable margins.These results strongly underline that, within the realm of LLMs, incorporating prompt-tuning with incontext learning holds more promise, leading to enhanced generalization capabilities.

Do models demonstrate inherent proficiency in generating normalized claims with minimal or no prior training?
We examine the potential benefits of zero-shot and few-shot learning to investigate the inherent proficiency in generating normalized claims.The zero-shot and the few-shot results are shown in Table 4. Zero-shot learning, which relies solely on the pre-trained language model without any task-specific fine-tuning, performs quite well.On the other hand, few-shot learning does not result in significant improvements.Surprisingly, the models trained using few-shot learning perform slightly worse than zero-shot learning, where the models have no exposure to task-specific data.After training on ten examples, the performance of FLAN-T5 LARGE drops by 6 BERTScore points absolute, and it continues to decline as more examples are provided. 7This unexpected result suggests that few-shot learning may be unsuitable for this intricate and complex task.The limited number of examples provided during few-shot learning may have been insufficient for the models to generalize and capture the underlying patterns of normalized claims effectively.Moreover, introducing task-specific data might have introduced conflicting information as these models were never trained on this task, leading to a degradation in performance.

Qualitative Analysis
Error Analysis.To comprehend the performance of CACN, we strive to qualitatively analyze the errors committed by our model in this section.Table 5 shows some randomly selected instances from our test dataset, along with gold normalized claims and predictions from CACN.For comparison, we also show predictions from two best-performing baselines, BART LARGE and DIRECT.Naturally, the predictions in the fine-grained analysis are much more intricate than in the coarsegrained quantitative setup.During our manual qualitative analysis, we unveiled several interesting patterns and errors in the generated responses.For example, although BART LARGE generated responses with a high BERTScore in example 1, we noticed that the factual alignment is incorrect, making this model untrustworthy for downstream tasks such as claim check-worthiness and claim verification.In contrast, our proposed model produced a response that is both correct and precise.The response generated by DIRECT is also accurate, but it is excessively long, which contradicts the objective of the normalized claims being concise and straightforward.This problem is also evident in example 3, where DIRECT produces a factually correct claim but is overly long.
In example 2, we observe that the BART LARGE model demonstrates the lowest number of hallucinations and adheres closely to the input social media post.In contrast, our model's BERTScore performed the worst for this example.However, upon closer inspection, we noticed that the normalized claim that our model generated was indeed correct and most relevant for fact-checking.These findings highlight the complexity and the trade-offs involved in generating normalized claims.While certain models may excel in certain cases, there is often a compromise in other aspects, such as factual accuracy and conciseness.Human Evaluation.We conducted an extensive human evaluation to assess the linguistic proficiency of the generated normalized claims.Building upon the measures proposed by van der Lee et al. ( 2021), we evaluated the generated claims based on four aspects: fluency, coherence, relevance, and factual consistency. 8We further introduced the parameter of self-contextualization to measure the extent to which the normalized claims included the necessary context for fact-checking within themselves.Each of these measures played a unique and vital role in evaluating the quality of the generated claims.To conduct the evaluation, we randomly selected 50 instances from our test set and assigned five human evaluators to rate every normalized claim on a scale of 1 to 5 for each of these five aspects.All evaluators were fluent English speakers with a Bachelor's or Master's degree.
To ensure reliability, each example was evaluated by all five evaluators independently, and then we averaged their scores.
8 See Appendix A.6 for more detail.
The average scores are presented in Table 6.For comparison, we also included the results from the best-performing baseline systems, namely BART LARGE and DIRECT.Our analysis reveals that the outputs generated by CACN exhibit qualitative superiority compared to the baseline systems across all dimensions.

Conclusion and Future Work
We introduced the novel task of claim normalization, which holds substantial value on multiple fronts.For human fact-checkers, claim normalization is a useful tool that can assist them in effectively removing superfluous texts from subsequent processing.This also benefits downstream tasks such as identifying previously fact-checked claims, estimating claim check-worthiness, etc.We further compiled a dataset of social media posts comprising over 6k posts and their normalized claims.We further benchmarked this dataset with a novel approach, CACN, and showed its superior performance compared to different state-of-the-art generative models across multiple assessment measures.We also documented our data collection process, providing valuable insights for future research in this domain.In future work, we plan to extend the dataset, including with new languages.We also plan to use more powerful LLMs.

Limitations
While our study has made major contributions to claim normalization, it is critical to recognize and to address its potential limitations.During our data collection process, we excluded claims about images and videos.Yet, we believe that including multimodal information may help improve claim normalization.Another key problem is that each fact-checking organization adheres to its own set of editorial norms, procedures, and subjective interpretations of claims.These variations in writing style and judgments make it challenging to establish a standardized claim normalization.Addressing this issue will necessitate attempts to develop consensus or guidelines among fact-checking organizations in order to ensure greater consistency and coherence in claim normalization.By acknowledging and addressing these limitations, we may endeavor to improve the reliability and soundness of claim normalization systems in the future.

Ethics and Broader Impact
Data Bias.It is important to acknowledge the possibility of biases within our dataset.Our data collection process involves gathering normalized claims from multiple fact-checking sites, each with its own set of editorial norms, procedures, and subjective interpretations.These elements can introduce systemic biases that impact the overall assessment of normalized claims.However, it is important to acknowledge that these biases are beyond our control.
Environmental Footprint.Large language models (LLMs) require a substantial amount of energy for training, which can contribute to global warming (Strubell et al., 2019).Our proposed approach, on the other hand, leverages few-shot in-context learning rather than training models from scratch, leading to a lower carbon footprint.It is worth mentioning that using LLMs for inference still consumes a considerable amount of energy, and we are actively seeking to reduce it by using more energy-efficient techniques.
Broader Impact and Potential Use.Our model can interest the general public and save time for human fact-checkers.Its applications extend beyond fact-checking to other downstream tasks such as detecting previously fact-checked claims, claim matching, and even estimating claim checkworthiness of new claims.

A Appendix
A.1 Task Motivation Claim normalization holds significant promise for combating the spread of misinformation by streamlining fact-checking processes and enhancing the reliability of retrieved evidence.To substantiate our hypothesis regarding the effectiveness of claim normalization, we conducted a well-structured retrieval experiment using the Google API.The objective was to demonstrate the practical benefits of claim normalization in assisting fact-checkers.We randomly selected a sample of 35 instances from our dataset, encompassing social media posts and their normalized claims.Leveraging the capabilities of the Google API, we sought the top-5 most relevant articles for each post and its normalized claim.In a meticulous evaluation process, three annotators individually assessed the relevance (0 or 1) of each retrieved article to the input (post or normalized claim).We then used majority voting to determine the final relevance score for each retrieved article.As depicted in Table 7, the results of our experiment consistently demonstrated the advantage of normalized claims in evidence retrieval.In top-k precision evaluations for various values of k (1, 3, and 5), normalized claims consistently outperformed their corresponding source posts.This observation indicates that claim normalization is not merely a theoretical concept, but significantly enhances the efficiency of evidence retrieval, resulting in more concise and effective tools for aiding the fact-checking process.
P@1 P@3 P@5 Post 0.82 0.64 0.58 Normalized Claim 0.88 0.73 0.69  2020) demonstrated that prompttuning could enable controllable text generation in T5-based models.We also investigate the impact of affixing different prompts to the given input on the performance of T5-based models for normalized claim generation, along with GPT-3 (text-davinci-003) (Brown et al., 2020).We experimented with various prompts suffixed to the input text before inference, in a zero-shot setting.We discuss our different prompts P i below.Uncontrolled.We investigated the use of the traditional prompt 'summarize' for uncontrollable models.This prompt lacks specific control signals, making it uncontrolled as it does not provide explicit guidance with specific attributes.
Token Limit.We found that normalized claims written by fact-checkers typically adhered to around ten words, as shown in Figure 4. Thus, in order to control the length of the normalized claims, we used the following prompt: "summarize within the length of 10 tokens".
Abstractness.Abstractness quantifies how much the generated text's words and phrases differ from those extracted directly from it: a fully abstractive summary expresses the central points of the input in very different words and sequences of words compared to the original input.Precisely, the more n-grams overlap between a summary and its original document, the less abstractive a summary is.Thus, in order to control the abstractness of the generated normalized claims, we use the prompt "summarize with abstractness of a", where a represents a value within the range [0;1], denoting the desired level of abstractness.Inspired by Dreyer et al. (2023), we compute the abstractness score a i for each pair of a post p i and a normalized claim s i , and then we take an average across all examples: where X is the harmonic mean of unigram overlaps precision, bi-gram overlap precision, and longest sub-sequence overlap precision.We found the average abstractness (a) to be around 0.8.Single Sentence.The normalized claim written by fact-checkers often consists of a concise singlesentence summary of the post.We use the prompt "summarise in one sentence" in order to limit the normalized claims to a single-sentence summary, ensuring brevity and conciseness in delivering the pivotal assertion.Claim-Centricity.The task at hand is more than just text summarization; it transcends conventional text summarization by seeking not only to condense the input social media post, but also to discern and to encapsulate the central claim within that input post concisely.Thus, we use the prompt "summarize the text identifying the central assertion" to helm the model to focus on the main assertion in the input text.
Entity-Centricity.Similarly to the claim-centric prompt, we investigated the technique of creating entity-centered summaries using the prompt, "summarize the text focusing on the given keywords (kw 1 , kw 2 , kw 3 , ...kw n )".For this approach, we use Open Information Extraction (Angeli et al., 2015) in order to extract subject-verb-object triples from the input texts.9Subsequently, we compile a keyword list encompassing all subjects and all objects within the text.The objective is to direct the model to produce summaries that align well with the subjects and the objects mentioned in the input text.
Results.We report the results for our zeroshot prompt-tuning experiments in Table 8.We can see that for the T5-based models, prompttuning did not yield any major improvements; rather, it decreased the performance as compared to the uncontrolled prompt.However,  showed some improvements when using these prompts.

A.3 In-Context Learning Templates
In Figure 5, we show the templates used for the three in-context learning methods used for GPT-3 (text-davinci-003) as mentioned in Section 5.

A.4 Few-Shot Additional Results
We report 50-shot and 100-shot experimental results in Table 9.Interestingly, we observe that introducing more examples to the model did not help it much.

A.5 Implementation Details
We performed basic data cleaning, e.g., removing non-alphanumeric characters, removing links and hashtags, etc. on our dataset CLAN, using nltk.For a standardized evaluation, we relied on widely recognized evaluation libraries such as py-rouge,10 nltk-bleu,11 nltk-meteor,12 and hugging-face bertscore. 13We trained all models for 50 epochs, with early stopping based on validation loss.We set the patience value at 5, and we optimized the models using the Adam optimizer.We set the weight decay to 0.01.For our proposed approach CACN, we used GPT-3 (text-davinci-0003) as base model.Finally, we set the maximum length of the generated response to 120 with a temperature of 0.6.

Figure 1 :
Figure 1: Illustration of our proposed Claim Normalization task, highlighting the normalized claims authored by fact-checkers for social media posts from distinct social media platforms.

3
As if it couldn't get any worse.#Hope4Cancer says #RootCanal causes #CANCER Solution... Rip Cancer patients teeth out.Monsters #FalseHope4Cancer #ProtectCancerPatients Having a root canal can cause cancer.Root canal treatment causes cancer Table 1: Examples of social media posts and their corresponding normalized claims from CLAN.The first two examples come from the training set and each has one reference normalized claim, while the last one comes from the test set, and thus it has two reference normalized claims.

Figure 2 :
Figure 2: Histogram of the cosine similarity between the social media posts and the corresponding normalized claims from our CLAN dataset.

Figure 3 :
Figure 3: Illustration of our proposed approach.To generate a normalized claim, we use the CACN prompt template, which encompasses explicit task instruction and relevant in-context examples, as well as chain-of-thought reasoning.

Figure 4 :
Figure 4: Box-plot for the number of tokens in normalized claims in CLAN.

Table 1
, and the final dataset statistics are shown in Table2.

Table 2 :
Statistics about our CLAN dataset.

Table 3 :
Experimental results of CACN and baseline systems on CLAN.We report ROUGE (1, 2, L), BLEU-4, METEOR, and BERTScore.The best scores are shown in bold, while the second-best scores are underlined, across each metric.The last row gives the percentage increase in performance between CACN and the best baseline.

Table 5 :
They gave you a MASK to cover your face AND then gave you George Floyd's signature phrase 'I can't breath' . . . .Yet 2.5 years later, most still struggle with the realization that EVERYTHING on TV is fake.Floyd and ConVid never existed GOLD: George Floyd and COVID-19 'never existed'.-Zelensky sold 17 million hectares of agricultural land to Monsanto, Dupont, and Cargill.Yes, you read it well, 17 million hectares to GM0/chemical companies.This is very bad for the entire world since Ukraine is the largest exporter of wheat and other grains.Zelensky sold 28% of the entire Ukrainian arable land.Australian National Review reports three major US cross-border consortiums have bought 17 million hectares of Ukrainian farmland.To compare: In all of Italy, there are 16.7 million hectares of agricultural land.It turns out that three American companies in Ukraine bought more valuable agricultural land than in all of Italy.The entire area of Ukraine -600,000 sqm, 170,000 sqm built.Three major US companies have bought 17 million hectares of Ukrainian agricultural land, which is more than in all of Italy.These companies are Cargill, Dupont and Monsanto, with their main shareholders being American venture capitalists Blackrock, Vanguard and Blackstone.Examples of generated normalized claims along with the gold reference.BART refers to BART LARGE .

Table 6 :
Human evaluation on the generated normalized claims.SC denotes Self-Contextualized, while BART refers to BART LARGE .

Table 7 :
Comparative top-k precision evaluations of normalized claim vs. original posts in evidence retrieval.

Table 8 :
Zero-shot prompt-tuning results for T5 and GPT-3 on our datatset CLAN.