PivotFEC: Enhancing Few-shot Factual Error Correction with a Pivot Task Approach using Large Language Models

,


Introduction
ChatGPT is an artificial intelligence chatbot released by OpenAI on November 30, 2022 and built upon on the company's Generative Pretrained Transformer (GPT) series of large language (LLMs) (Ouyang et al., 2022;OpenAI, 2023).Since its launch, ChatGPT has garnered significant global attention due to its comprehensive and eloquent responses across various knowledge domains.Within just two months, by January 2023, †Corresponding authors.it has amassed over 100 million users.However, one drawback of ChatGPT is its tendency to generate text that is nonsensical, or unfaithful to the provided source input, referred to as hallucination (Maynez et al., 2020;Raunak et al., 2021).To address this issue, the research community has dedicated efforts to the development of factual error correction (FEC), aiming to rectify false claims with minimal modifications to make them better supported by the given evidence.Consequently, research focused on this task plays a crucial role in mitigating the problem of hallucinations in LLMs.
The most straightforward way to develop FEC systems is by fine-tuning pre-trained models, such as BART (Lewis et al., 2020a) and T5 (Raffel et al., 2020), on parallel data, consisting of false claims along with their corresponding corrections.Nevertheless, the availability of such paired data is restricted due to the tremendous labor and time required for human annotations.
To overcome the data scarcity, researchers (Shah et al., 2020;Thorne and Vlachos, 2021;Chen et al., 2023) make use of distant supervision from the fact verification dataset, FEVER (Thorne et al., 2018).FEVER is a large resource consisting of claims paired with evidence from Wikipedia, where each instance is labeled as either SUPPORTED or RE-FUTED based on whether the claim is supported or refuted by the corresponding evidence.Existing distantly supervised models typically follow the mask-then-correct approach (Shah et al., 2020;Thorne and Vlachos, 2021).Concretely, the fact verification classifier (FVC), trained on FEVER, acts as the masker, designed to find problematic spans within false claims.The token-level explanations (Ribeiro et al., 2016;Chen et al., 2017) of FVC are usually exploited as masks.The corrector is trained on the SUPPORTED data from FEVER, with the objective of restoring/generating the original/correct claim based on the masked claim and evidence, during training/inference.Furthermore, Page: 2013 NBA draft Context: The 2013 NBA draft was held on June 27, 2013, at Barclays Center in Brooklyn, New York.

Destroy:
The 2013 NBA draft was held in Pennsylvania.

Wrong Claim
The 2013 NBA draft was held in New York.

Correct Claim
The 2013 NBA draft was held in New York.

Retrieve
The 2013 NBA draft was held in Los Angeles.The 2012 NBA draft was held in New York.The 2013 NBA draft was not held in New York.…

Wrong Claim
Figure 1: The comparison between factual error correction and factual error injection.Chen et al. (2023) propose using the mask-thencorrect method to iteratively refine the false claims instead of relying on a single pass.However, accurately identifying factual errors using FVC is nontrivial.This limitation often leads to over-erasure and incorrect masking issues, becoming a bottleneck that restricts the performance of FEC models.
To bypass these issues, we propose to solve the FEC task by introducing a pivot task, factual error injection (FEI), aiming to generate false claims by injecting factual errors into correct claims.Our main motivation is that the FEI task is relatively easier than the FEC task.As shown in Figure 1, an FEC model is expected to precisely identify the factual error in the false claim "The 2013 NBA draft was held in Pennsylvania."and correct it to "The 2013 NBA draft was held in New York."based on the given evidence.In contrast, the FEI task allows for multiple ways to introduce factual errors into a correct claim.For example, one can replace "New York" with "Los Angeles", substitute "2013" with "2012", or even insert a negative word such as "not" into the correct claim.This distinction demonstrates that the FEI task encompasses a considerably larger solution space compared to FEC.By exploring this expanded solution space, we can leverage the relatively easier nature of FEI to enhance the overall performance of FEC systems.
Our second motivation stems from the fact that LLMs, such as GPT-3.5, can serve as an excellent data annotator in few-shot settings, rivaling or even surpassing the performance of crowdsourced annotators (He et al., 2023).Inspired by this, we use LLMs to solve the FEI task, specifically generating false claims for correct claims.By doing so, we obtain a sufficient amount of paired data, which will be further used to train the FEC corrector.
Our contributions are summarized as follows: (1) We propose PivotFEC1 , a method that uses a Pivot task, factual error injection, to enhance FEC with LLMs in few-shot settings.(2) Compared with distantly supervised baselines, PivotFEC only requires 8 labeled samples from FECDATA, eliminating the need for labeled data, FEVER, to train the FVC.( 3) Extensive experiments conducted on the FECDATA dataset demonstrate that PivotFEC outperforms distantly supervised baselines by a large margin, achieving a new state-of-the-art (SOTA) result on the test set with scores of 66.3 on SARI and 66.68 on ROUGE-2.(4) PivotFEC exhibits much better performance than its few-shot counterpart (66.3 vs. 58.43 on SARI), where LLMs are directly used to solve FEC, confirming the effectiveness and necessity of our proposed pivot task.

Problem Statement
Factual error correction aims to revise the factual errors in claim C with minimal edits and generate a revised claim C ′ based on the provided evidence E. C ′ should be grammatical, supported by the evidence, and correct the factual errors in C.

Preliminary
In this section, we will introduce in-context learning and how to solve FEC using LLMs with incontext few-shot learning via prompting.

LLMs with In-context Learning
LLMs, especially ChatGPT, have demonstrated remarkable few-shot capability in various downstream tasks.Therefore, it is natural to employ LLMs for addressing the FEC task in a few-shot setting.Building upon the approach introduced by GPT-3 (Brown et al., 2020), we utilize LLMs with in-context few-shot learning through prompting to tackle FEC.Rather than fine-tuning LLMs specifically for individual tasks, we can efficiently prompt the model by providing a small set of input-output exemplars that demonstrate the task.(3) train the factual error corrector with the synthetic data.For simplicity, the process of evidence retrieval has been omitted.

Factual Error Correction with LLMs
To address FEC using LLMs, we begin by choosing a set of demonstrated examples.Each example comprises three elements: a gold evidence, a mutated claim, and an original claim.The objective is for LLMs to learn how to modify the mutated claim based on the provided samples.Figure 3 illustrates a simplified prompt where the LLM (in this case, ChatGPT) accurately corrects the factual error in the mutated claim by replacing "Los Angeles" with "New York."For the full prompt of the FEC task, please refer to Table 6 in Appendix C.

Approach
In this section, we will first provide an overview of PivotFEC in §4.1.As illustrated in Figure 2, our method comprises three main steps.We will begin by introducing the FEI task and demonstrating the utilization of LLMs to address the FEI task in §4.2.Next, we will present the process of gathering synthetic paired data for the FEC task in §4.3.Finally, we will demonstrate the training of the corrector using the synthetic data in §4.4.

Overview
The main limitation in developing FEC systems is the scarcity of paired data comprising correct claims and their corresponding false claims.To mitigate this limitation, previous studies (Shah et al., 2020;Thorne and Vlachos, 2021;Chen et al., 2023) follow the mask-then-correct method, with the assumption that there are sufficient human annotated fact verification data (i.e., FEVER), which is used to train the FVC.They train the corrector by masking certain portions of correct claims and then recovering them.Therefore, during testing, it becomes necessary to identify factual errors within false claims and mask these errors with the FVC before using the corrector to revise them.Con-  sequently, previous approaches suffer from issues such as over-erasure and incorrect masking.
Our primary motivation is to generate false claims by injecting factual errors into correct claims.This allows us to obtain FEC data consisting of correct claims paired with their corresponding false claims, which can be directly used for training the FEC corrector.To achieve this goal, we introduce the pivot task, factual error injection (FEI), for FEC, and then employ LLMs to address the FEI task using a few-shot in-context learning approach.Compared to the previous mask-thencorrect method, our approach eliminates the need to mask factual errors before correction, thus avoiding the over-erasure and incorrect masking issues.Moreover, our approach does not depend on labeled fact verification data.Instead, we only require correct claims and a few labeled FEC samples.

Factual Error Injection with LLMs
To address FEC, LLMs are expected to identify factual errors and correct them.As previously analyzed, FEI requires LLMs to introduce factual errors into correct claims and has a significantly larger solution space than FEC (see Figure 1).Therefore, we assume that FEI is comparatively easier for LLMs than FEC.This is why we intro-duce this pivot task for FEC.
Similar to FEC, we also employ LLMs to tackle FEI using the few-shot in-context learning approach.For fair comparisons, we utilize the same demonstrated exemplars as used in few-shot FEC, with the only difference being the order of the original claim and mutated claim.The left portion of Figure 2 illustrates a simplified prompt for FEI, where the LLM (specifically ChatGPT) injects a factual error by substituting"New York" with "Los Angeles."For the complete prompt of the FEI task, please refer to Table 7 in Appendix D.

Creating Synthetic Data for FEC
We assume that correct claims are readily available.For each correct claim C t , we use LLMs to inject factual errors into C t via in-context learning with the prompt in Table 7.The generated claim is referred to as C f .By doing so, we collect the synthetic data , where C t i and C f i denote the i-th correct claim and the corresponding false (i.e., generated) claim, respectively.

Training FEC Corrector
After obtaining the synthetic data D, we acquire the FEC corrector by fine-tuning pre-trained language models, such as BART or T5 on this data.To be concrete, we concatenate the false claim C f and the gold evidence or retrieved evidence, and directly input them into the encoder (refer to the right part of Figure 2 for the input format).For more detailed information on obtaining evidence for the false claim, please refer to Appendix B.1.During training, we optimize the corrector by maximizing the conditional probability of C t : where θ represents the parameters of the corrector, E i denotes the corresponding evidence, and C t i,j<n refers to the sub-sequence preceding C t i,n .

Experimental Setups
Dataset.Following previous work, we evaluate our model on the evidence-based FEC dataset (FECDATA) (Thorne and Vlachos, 2021), created based on the large fact verification dataset, FEVER (Thorne et al., 2018).The construction of the Evaluation Metrics.For automatic evaluation, we resort to SARI (Xu et al., 2016) 2 and ROUGE-2 (Lin, 2004) 3 metrics.The SARI metric explicitly assesses the goodness of words in the revised claim that are added, deleted and kept by FEC models from the source (mutated claim), compared with the referenced ground truth (original claim).We report the n-gram F1 score for "keep" operations (Keep), the n-gram precision score for "delete" operations (Delete), the n-gram F1 score for "add" operations (Add), and the average of these three scores (Final).ROUGE-2 measures the surfacelevel similarities between revised claims and reference claims.The SARI Final score serves as the primary evaluation metric due to its strong positive correlation with manual evaluation, as indicated by Thorne and Vlachos (2021)'s statistical findings.
Baselines.We consider three types of baselines: Fully Supervised Baselines estimate the ceiling performance of FEC models, under the assump-tion that a substantial amount of data is accessible.For this purpose, we fine-tune BART-base and T5base on FECDATA, where the encoder takes the false claim and corresponding evidence as inputs, while the decoder generates the revised claim.
Distantly Supervised Baselines adopt the 'mask-then-correct' pipeline, consisting of a masker and a corrector.The masker can take various forms, such as the token-level explanations (Ribeiro et al., 2016;Chen et al., 2017) of a fact verification classifier (FVC), random masking, or heuristic masking.The FVC is initialized with BERT-base (Devlin et al., 2019) or RoBERTa-large (Liu et al., 2019), and trained on FEVER.On the other hand, the corrector is trained exclusively on the SUPPORTED data instances from FEVER.(1) Dual encoder pointer network (DEPN) (Shah et al., 2020) utilizes an FVC to predict masked words and subsequently generates a revised claim using the dual encoder pointer generator with the copy mechanism (See et al., 2017).( 2 (3) T5MC-MLM differs from T5MC in that it uses the masked language model BERT as the masker during inference.(4) T5MC-V is a variant of T5MC, using FVC as the masker to predict the masked tokens.( 5) VENCE (Chen et al., 2023) iteratively executes steps of mask-then-correct over the claim to make it supported by the evidence.
Rule-based Baselines first generate synthetic paired data, and then train factual error correctors on them.Rule-based methods are employed to construct the inconsistent summaries with the aim of improving the faithfulness in abstractive summarization (Cao et al., 2020;Cao and Wang, 2021).We begin by utilizing spaCy 4 , a free open-source library for NLP, to recognize name entities in correct claims, and then implement two rule-based baselines for factual error correction: (1) The first rule-based method creates false claims by swapping named entities from the correct claims with alternative entities of the same entity type randomly chosen from the training dataset.This method is referred to as SwapEntity.

Experimental Results
The main experimental results on the FECDATA test set in Table 2 reveal the following key findings: LLMs exhibit a remarkable few-shot ability for FEC.Directly fine-tuning T5 on the 8 labeled data instances (i.e., 8-shot T5-base) does not bring any improvement over previous distantly supervised baselines, such as VENCE.However, the few-shot in-context learning baseline (i.e., 8-shot ChatGPT) achieves a noteworthy SARI Final score of 58.43, surpassing VENCE (RoBERTa) by approximately 3 points.These results highlight the impressive few-shot capability of ChatGPT.Our proposed pivot task is highly effective for few-shot FEC.To demonstrate the effectiveness of our proposed PivotFEC, we compare it with the 8-shot ChatGPT model.To ensure fair comparisons, both few-shot models are based on ChatGPT.As shown in Table 2, our proposed model, 8-shot PivotFEC, far exceeds its few-shot counterpart (8shot ChatGPT) by a significant margin across all metrics.The SARI metric increased from 58.43 to 66.30 and the ROUGE-2 score increased from 49.43 to 66.68.
On the other hand, PivotFEC also notably outperforms the rule-based baselines.Additionally, PivotFEC achieves the peak performance with just 2,000 synthetic data instances for training the corrector, while rule-based methods require 10,000 synthetic data instances to reach their peak performance when training the correctors.These results can be attributed to the enhanced quality of false claims produced by ChatGPT.Rule-based methods, by contrast, often produce false claims with suboptimal quality in two key aspects: (1) Grammatical errors might be present in generated false claims.
(2) False claims generated by rule-based methods might deviate from the original topics present in correct claims.With ChatGPT's remarkable incontext learning capabilities, injecting factual errors into correct claims hardly introduces grammatical errors or deviates from the original topics.
These compelling improvements establish a new SOTA result and provide strong evidence for the effectiveness of the pivot task in enhancing the performance of FEC.PivotFEC lags behind supervised baselines.While PivotFEC outperforms distantly supervised and few-shot models, there still exists a significant performance gap compared to supervised methods.For example, the supervised T5-base achieves a score of 74.24 on SARI Final, whereas PivotFEC only scores 66.30, indicating that there is ample room for further improvement.The retrieved evidence is inadequate compared with gold evidence.To further explore the ceiling performance of our method, we conduct experiments using gold evidence as well.The results reveal that when using gold evidence, PivotFEC improves the SARI Final score by approximately 4 points.This demonstrates the inadequacy of retrieved evidence, which aligns with previous findings (Thorne and Vlachos, 2021).As our work mainly focuses on improving FEC through the introduction of the pivot task, we defer the improvement of evidence retrieval to future work.

More Analysis and Discussion
The experiments conducted in this section utilize ChatGPT with 8-shot in-context learning and retrieved evidence, if there is no particular statement.
Effect of the Number of Synthetic Data.To show the effect of synthetic data generated with FEI on PivotFEC, we first generate synthetic data for FEC with 8-shot PivotFEC (ChatGPT), and then fine-tune T5-base on varied numbers of synthetic data instances.For comparison, we also evaluate the performance of T5-base trained on the gold data (FECDATA).As shown in Figure 4 (a), when the data size does not exceed 1k, the performance of 8-shot PivotFEC increases linearly with the increase of data, even matching the performance of T5-base trained on the gold data.Nevertheless, our 9965   model's performance plateaus once the data size reaches 2k.In contrast, T5-base trained on gold data continues to improve.Even when using all FECDATA training data, its performance does not reach its peak.This observation suggests that the generated data contains noise compared to the gold data, limiting the upper performance of PivotFEC.

Effect of the Number of In-Context Examples.
Table 2 demonstrates  Effect of Different LLMs.To further emphasize the advantages of our method, we compare it with 8-shot FEC across different LLMs, including three InstructGPT models (text-ada-001, textbabbage-001, and text-curie-001) and ChatGPT (gpt-3.5-turbo-0301).Figure 4 (c) illustrates our method consistently outperforms the few-shot FEC baseline across different LLMs.Moreover, both our method and the baseline exhibit noticeable performance improvements as the model parameters increase.However, even with small models, our method still performs exceptionally well.For example, the smallest model falls only around 5 points behind the largest model.In contrast, the baseline's performance is heavily influenced by the choice of model, particularly evident in text-ada-001, which experiences a decrease of approximately 12 points compared to the larger model, text-curie-001.Effect of Mutation Types.As shown in Figure 6 in Appendix A, the refuted claims mainly stem from four mutation types.Therefore, we construct four prompts, each composed of examples from the corresponding mutation.For prompts consisting of examples from negate, substitute similar, substitute dissimilar, and specific mutations, please refer to Tables 7, 8, 9, and 10.Table 3 shows that PivotFEC with negate prompt yields the best results, possibly because this mutation constitutes the largest portion of the test set.Additionally, we present the performance of PivotFEC on separate test cases.Figure 5 illustrates that: (1) PivotFEC performs well on a test case of the specific mutation type when using a prompt tailored to that type, and (2) the variations in PivotFEC performance, when us- ing different prompts, mainly arise from the negate test case.

Human Evaluation
We conduct a human evaluation to compare Piv-otFEC with the fully supervised T5-base, 8-shot T5-base and 8-shot ChatGPT models.The fully supervised T5-base model utilizes gold evidence to rectify false claims, while the others use retrieved evidence.For each model, we randomly sample 50 cases and ask three annotators5 to assess the revised claims based on the following Boolean questions: (1) Is the revised claim grammatically correct?(2) Is the revised claim supported by evidence?(3) Has the factual error in the false claim been corrected?
The final question, measuring the correction of factual errors, is the most important metric in our human evaluation.As shown in Table 4, our proposed model outperforms the few-shot baselines on the corrected metric; however, there is still a gap to reach the ceiling performance of the supervised baseline.Inter-annotator agreement measured by Fleiss' kappa (Fleiss, 1971) is 0.75, 0.86, and 0.81 for grammatical, supported, and corrected scores, implying substantial agreement (> 0.6) (Landis and Koch, 1977).

Samples and Analysis
Table 5 presents the revised claims generated by our approach and the baselines.From this table, we observe that the 8-shot T5-base method cannot identify errors in false claims.Similarly, 8-shot ChatGPT often struggles to precisely locate errors within false claims, and tends to simply copy content from the evidence into the modified claims rather than correct them.For example, in the first example, although 8-shot ChatGPT corrects the factual error in the original sentence, it does not make the minimal edits.As for the second example, 8-shot ChatGPT fails to identify the error in the false claim, resulting in a text that exhibits a low correlation with the original claim.This also explains why this method achieves a relatively high supported value but demonstrates a low corrected score during the human evaluation.Most notably, our method can accurately identify errors and make modifications based on the retrieved evidence, similar to the performance of the supervised T5 model.
6 Related Work

Grammatical Error Correction
Grammatical error correction (GEC) (Ng et al., 2014;Yuan and Briscoe, 2016;Bryant et al., 2017;Awasthi et al., 2019;Liu et al., 2021) refers to the process of identifying and rectifying grammatical errors in written text.It has practical applications in several domains, such as helping nonnative speakers enhance their writing skills, aiding language learners in improving their grammatical accuracy, and assisting professional writers in producing error-free and polished content.GEC aims to improve the accuracy and fluency of the language by fixing various grammatical errors, including missing prepositions, mismatched subject-verb agreement, misspellings, and word choice errors.
In comparison, factual error correction involves correcting the factual errors instead of the grammatical errors in the given content, such as incorrect dates, names, or historical events.

Retrieval-Augmented Generation
Retrieval-augmented generation (Lewis et al., 2020b) combines the power of information retrieval and language generation techniques to elevate the overall quality of the generated content.For example, (He et al., 2022) use dense retrievers to retrieve relevant sentences from an external corpus for the given keywords to improve lexically constrained text generation (He and Li, 2021;He, 2021).By incorporating external knowledge, retrieval-augmented generation effectively mitigates the risk of generating inaccurate or nonsensical content.Furthermore, factual error correction is another facet of retrieval-augmented generation.It rectifies factual inaccuracies based on the evidence retrieved, thereby fitting under the broader umbrella of retrieval-augmented generation.

Fact Verification
Fact verification, also known as fact-checking, aims to validate the accuracy of a given claim by examining the available evidence.This field of study has been extensively researched in recent years.Researchers assess the claim by analyzing both unstructured sources, such as political news (PolitiFact) (Vlachos and Riedel, 2014;Wang, 2017), Wikipedia (FEVER) (Thorne et al., 2018;Liu et al., 2020), and scientific literature (Wadden et al., 2020), as well as structured sources, including Wikipedia tables (TabFact) (Chen et al., 2020) and knowledge base (Iso et al., 2020).Fact verification seeks to determine whether a claim is supported or refuted by evidence, while factual error correction takes it a step further.Factual error correction not only involves identifying factual errors but also requires modifying them to obtain correct claims.

Conclusion
In this paper, we present PivotFEC, which introduces a pivot task, factual error injection, to improve factual error correction.Specifically, we first intentionally introduce factual errors into correct claims using LLMs under the few-shot setting.By doing so, we can obtain enough synthetic paired data for FEC, consisting of correct claims paired with their corresponding false claims, which will be used to train the FEC corrector.As a result, Pivot-FEC demonstrates a significant improvement over previous distantly supervised baselines, establishing a new SOTA performance on FECDATA.Furthermore, our approach significantly outperforms its few-shot counterpart, providing strong evidence for the effectiveness of the pivot task.

Limitations
There are two potential limitations to this study.Firstly, due to limited computational resources, we only assess the effectiveness of our proposed method on GPT series models.Future work should include additional experiments with other language models, such as PaLM and LLaMA.The second limitation is that the retrieved evidence may not always be relevant to the input claim, which means it may not provide useful information for correcting factual errors within the claim.As our primary focus is on enhancing factual error correction through the introduction of the pivot task, we leave the task of improving evidence retrieval for future research.

B.1 Evidence Retrieval
Considering that our research does not focus on improving the retrieval model, we adopt the same retrieval process as previous studies (Thorne and Vlachos, 2021;Chen et al., 2023).The retrieval module primarily consists of two steps: First, we employ a pre-trained seq2seq model called GENRE (Cao et al., 2021) to predict the relevant Wikipedia pages for the input claim.Then, we utilize the dense passage retrieval model DPR (Karpukhin et al., 2020) to retrieve the most relevant passages from the pages predicted by GENRE.

B.2 PivotFEC
Synthetic data for FEC.We generate synthetic FEC data using REFUTED data instances from FECDATA by solving the FEI task.Since we assume that correct claims are readily available, we utilize only the correct claims from the REFUTED data instances and exclude their paired false claims.
To provide clarity, we construct a total of 1296 validation data instances and 2000 training data instances.It's worth noting that increasing the amount of training data does not yield substantial improvements, as discussed in Section 5.3.

Training and Inference.
During training, we initialize the PivotFEC corrector with the T5-base model.Following previous work (Thorne and Vlachos, 2021), we input the top-2 retrieved evidence or gold evidence paired with the false claim into the T5 encoder, as additional evidence does not yield significant improvements.The corrector is optimized using the AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 4e − 5, a batch size of 64, and a linear learning rate schedule with 10% warm-up steps for 400 steps.The learning rate is selected from the set {5e − 6, 1e − 5, 2e − 5, 3e − 5, 4e − 5, 5e − 5}.We set the maximum source length to 512 and the maximum target length to 256.During training, we evaluate the model every 50 steps on the synthetic validation set and choose the checkpoint with the lowest negative log-likelihood (NLL) loss on the validation set.
During inference, we employ beam search decoding with a beam width of 5 to generate revised claims for the test set.

B.3 Supervised Models
We fine-tune the pre-trained models, BART-base and T5-base, on the FECDATA training set for 4000 steps.We evaluate the model every 200 steps using the FECDATA validation set.Other parameters remain consistent with those of PivotFEC, as stated in Section B.2.
To implement all models, we utilize the Hug-gingFace Transformers library (Wolf et al., 2019).Additionally, all experiments are conducted on 2 NVIDIA Tesla V100 GPUs with 32 GB of memory.

C Full Few-shot Prompts for FEC
Table 6 shows the few-shot exemplars prompt of the negate mutation type for the FEC task.Table 6 and Table 7 use the same demonstrated exemplars with the only difference being the order of the original claim and mutated claim.

D Full Few-shot Prompts for FEI
Tables 7, 8, 9, and 10 show the few-shot exemplars prompt of the negate, substitute similar, substitute dissimilar and specific mutation types for the FEI task, respectively.

Figure 3 :
Figure 3: Factual error correction with in-context learning using ChatGPT.Text in red color denotes the output of ChatGPT, while the remaining parts are the input.
) T5 Masker-Corrector (T5MC) (Thorne and Vlachos, 2021) differs from DEPN in two main ways: (a) The corrector (i.e., generator) is based on T5-base.(b) It randomly masks words in the input claim during training, but during inference, heuristic masking is employed, where words do not present in the evidence are masked.
(2) The second rule-based baseline resorts to the 'mask-then-fill' pipeline to create false claims.In this approach, named entities within the correct claims are substituted with [MASK] tokens.The masked claims are then processed through the BART-base model to generate new claims by filling in the [MASK] tokens.These newly generated claims are considered as false claims.This method is termed MaskEntity.Few-shot Baselines contains two models: 8shot T5-base fine-tunes T5-base using 8 data examples.8-shot ChatGPT revises false claims by prompting ChatGPT with 8 demonstrated examples.For fair comparisons, the few-shot baselines and our PivotFEC use the same set of examples.Implementation Details.Our implementation details are shown in Appendix B.

Figure 4 :
Figure 4: Subfigure (a) shows the performance on the test set for T5-base trained with different numbers of gold or generated data instances.Subfigure (b) shows the performance on the test set for different few-shot in-context learning.Subfigure (c) shows the performance on the test set for different LLMs with 8-shot in-context learning.
the superiority of PivotFEC over the few-shot FEC with ChatGPT under the 8-shot setting.To further validate the effectiveness of PivotFEC, we compare its performance with the few-shot FEC using ChatGPT with varying numbers of in-context examples.As depicted in Figure 4 (b), PivotFEC exhibits a notable improvement of approximately 7 to 10 points on the SARI Final score compared to the few-shot FEC model at different shots.When utilizing 8 demonstrated examples, our model reaches a plateau.

Figure 5 :
Figure 5: Results on different test cases, with the test set being divided according to mutation types.Gold denotes T5-base trained on FECDATA.Negate, Sub_sim, Sub_dissim, and Specific refer to PivotFEC with their respective mutation prompts.

Figure 6
Figure 6 presents the distribution of mutation types for revised claims of the test set.
The Invention of Lying; The supporting cast features Jennifer Garner, Jonah Hill, Louis C.K., Rob Lowe and Tina Fey.Original claim: The Invention of Lying's cast includes Louis C.K. Evidence: The 2013 NBA draft was held on June 27, 2013, at Barclays Center in Brooklyn, New York.Mutated claim: The 2013 NBA draft was held in Los Angeles.Original claim: The 2013 NBA draft was held in New York.

Table 1 :
The basic statistics of FECDATA with the number of data instances for each split and label.FUTED data instances from FEVER, and uses the original claims and mutated claims as the paired data.Table1shows the basic statistics of this dataset.To gain further insights, we provide Figure6in Appendix A, displaying the distribution of mutation types for the REFUTED claims.

Table 2 :
(Chen et al., 2023)n results (%) of our model and baselines with retrieved evidence or ground truth evidence on the FECDATA test set.Results marked with †, ‡, and * are from VENCE(Chen et al., 2023), T5MC-V (Thorne and Vlachos, 2021) and our reproduction, respectively.Bb and Rl denote BERT-base and RoBERTa-large.Enumerate refers to using the FVC model to rank 20 generated claims and select the best one.underline indicates the best model and bold indicates the second best.RG-2 refers to ROUGE-2.

Table 3 :
Evaluation results (%) of PivotFEC with prompts created using examples from different mutation types on the test set.RG-2 denotes ROUGE-2.

Table 5 :
Revised claims generated by our model and baselines based on the evidence for false claims extracted from the test set.The supervised T5-base revises false claims based on the gold evidence, while others utilize retrieved evidence.For simplicity, we do not show the gold evidence.Text in blue, red, and orange colors represents factual errors, correct modifications, and copied text, respectively.