Zero-shot Faithful Factual Error Correction

Faithfully correcting factual errors is critical for maintaining the integrity of textual knowledge bases and preventing hallucinations in sequence-to-sequence models. Drawing on humans’ ability to identify and correct factual errors, we present a zero-shot framework that formulates questions about input claims, looks for correct answers in the given evidence, and assesses the faithfulness of each correction based on its consistency with the evidence. Our zero-shot framework outperforms fully-supervised approaches, as demonstrated by experiments on the FEVER and SciFact datasets, where our outputs are shown to be more faithful. More importantly, the decomposability nature of our framework inherently provides interpretability. Additionally, to reveal the most suitable metrics for evaluating factual error corrections, we analyze the correlation between commonly used metrics with human judgments in terms of three different dimensions regarding intelligibility and faithfulness.


Introduction
The task of correcting factual errors is in high demand and requires a significant amount of human effort. The English Wikipedia serves as a notable case in point. It is continually updated by over 120K editors, with an average of around six factual edits made per minute 2 . Using machines to correct factual errors could allow the articles to be updated with the most current information automatically. This process, due to its high speed, can help retain the integrity of the content and prevent the spread of false or misleading information.
In addition, the hallucination issues have been shown to be a prime concern for neural models, 1 The code and data have been made publicly available: https://github.com/khuangaf/ZeroFEC 2 https://en.wikipedia.org/wiki/

Wikipedia:Statistics
Unfaithful Factual Error Correction

Evidence
The novel COVID-19 is highly contagious and is transmitted mostly through respiratory droplets. But, whether its transmission can be forwarded by touching a surface (i.e., a fomite) is uncertain.... COVID-19 has a case fatality rate of below 2%.

Final Correction
COVID-19 is not infectious. Figure 1: An example of a factual but unfaithful correction leading to misleading information. While it is technically true that the majority of people infected with COVID-19 will recover, there is no information in the evidence that supports the final correction. Additionally, when this statement is taken out of context, it could mislead people to believe that COVID-19 is not dangerous and that there is no need for precautions, which is false. A factual and faithful correction is "COVID-19 is highly contagious.".

Input Claim
where they are prone to generate content factually inconsistent with the input sources due to the unfaithful training samples (Maynez et al., 2020) and the implicit "knowledge" it learned during pre-training (Niven and Kao, 2019). Factual error correction can be used in both pre-processing and post-processing steps to rectify the factual inconsistencies in training data and generated texts, respectively. This can help build trust and confidence in the reliability of language models.
Prior work typically formulates factual error correction as a sequence-to-sequence task, either in a fully supervised or in a distantly supervised manner (Shah et al., 2020;Thorne and Vlachos, 2021). While these approaches have made great strides in generating fluent and grammatically valid corrections, they only focus on the aspect of factuality: whether the outputs are aligned with facts. Little emphasis was placed on faithfulness: the factual consistency of the outputs with the evidence. Faithfulness is critical in this task as it indicates whether a generated correction reflects the information we intend to update. If faithfulness is not ensured, this could lead to the spread of misleading content, causing serious consequences. Figure 1 shows a concrete example. In the context of automatically updating textual knowledge bases, the topic of an unfaithful output would likely deviate much from that of the expected correction. Therefore, such an edit is not desirable, even if it is factual.
In this work, we present the first study on the faithfulness aspect of factual error correction. To address faithfulness, we propose a zero-shot factual error correction framework (ZEROFEC), inspired by how humans verify and correct factual errors. When humans find a piece of information suspicious, they tend to first identify potentially false information units, such as noun phrases, then ask questions about each information unit, and finally look for the correct answers in trustworthy evidence (Saeed et al., 2022;Chen et al., 2022). Following a similar procedure, ZEROFEC breaks the factual error correction task into five sub-tasks: (1) claim answer generation: extracting all information units, such as noun phrases and verb phrases, from the input claim; (2) question generation: generating question given each claim answer and the original claim such that each claim answer is the answer to each generated question; (3) question answering: answering each generated question using the evidence as context; (4) QA-to-claim: converting each pair of generated question and answer to a declarative statement; (5) correction scoring: evaluating corrections based on their faithfulness to the evidence, where faithfulness is approximated by the entailment score between the evidence and each candidate correction. The highest-scoring correction is selected as the final output. An overview of our framework is shown in Figure 2. Our method ensures the corrected information units are derived from the evidence, which helps improve the faithfulness of the generated corrections. In addition, our approach is naturally interpretable since the questions and answers generated directly reflect which information units are being compared with the evidence.
Our contributions can be summarized as follows: • We propose ZEROFEC, a factual error correction framework that effectively addresses faithfulness by asking questions about the input claim, seeking answers in the evidence, and scoring the outputs by faithfulness. • Our approach outperforms all prior methods, including fully-supervised approaches trained on 58K instances, in ensuring faithfulness on two factual error correction datasets, FEVER (Thorne et al., 2018) and SCIFACT (Wadden et al., 2020). • We analyze the correlation of human judgments with automatic metrics to provide intuition for future research on evaluating the faithfulness, factuality, and intelligibility of factual error corrections.

Task
In Thorne and Vlachos (2021)'s setting, retrieved evidence is used, which means the model may be able to correct factual errors, even though there is no supporting information in the evidence. In this case, although the prediction is considered correct, the model is hallucinating, which is not a desired property. Additionally, due to the way data was collected, they require systems to alter the input claim even if the input claim is already faithful to the evidence. We argue that no edit is needed for claims that are faithful to the evidence.
To address these shortcomings, our setup aims to edit a claim using a given piece of grounded evidence that supports or refutes the original claim (see Figure 2). Using gold-standard evidence avoids the issue where a system outputs the correct answer by chance due to hallucinations. In our setting, a system must be faithful to the evidence to correct factual errors, allowing us to evaluate system performance more fairly. Furthermore, we require the model not to edit the original claim if it is already factually consistent with the provided evidence.
Concretely, the input to our task is a claim C and a piece of gold-standard evidence E that supports or refutes C. The goal of factual error correction is to produce a corrected claimĈ that fixes factual errors in C while being faithful to E. If C is already supported by E, models should output the original claim (i.e.Ĉ = C).

Proposed Methods
Our framework, ZEROFEC, faithfully corrects factual errors using question-answering and entailment.
Specifically, we represent the input claim C as question-answer pairs {(Q 1 , A C 1 ), ..., (Q n , A C n )} such that each question Q i reflects the corresponding information unit A C i , such as noun phrases and adjectives ( §3.1 and §3.2). Based on each question Q i , we look for an Night of the living dead is a Spanish Comic Book.

Question Generation
Claim Answer Generation 1. a Spanish Comic Book 2. Night of the living dead ...

Candidate Corrections
Night of the Living Dead is a horror film.

Final Correction
Figure 2: An overview of our framework. First, given an input claim, we generate the claim answers by enumerating all information units in the input claim. Second, conditioned on each extracted answer and the input claim, a question is generated. Third, each question is then fed to a question answering model to produce an evidence answer using the given evidence as context. Fourth, using a sequence-to-sequence approach, each evidence answer and the corresponding question are transformed into a statement, which serves as a candidate correction. Finally, the final correction is produced by scoring candidate corrections based on faithfulness.
answer A E i in the given evidence E using a learned QA model ( §3.3). Each candidate correction S i is obtained by converting the corresponding pair of Q i and A E i into a declarative statement ( §3.4). This guarantees that the corrected information units we replace factual errors with are derived from the evidence and thus ensures high faithfulness. The final output of ZEROFEC is the S i with the highest faithfulness score computed by an entailment model ( §3.5). An overview of our framework is shown in Figure 2.
One major challenge that makes our task more difficult than prior studies on faithfulness Fabbri et al., 2022a) is that we need to handle more diverse factual errors, such as negation errors and errors that can only be abstractively corrected. For instance, in the second example of in Table 2, the QA model should output "Yes" as the answer, which cannot be produced by extractive QA systems. To address this issue, we adopt abstractive QG and QA models that can handle diverse question types and train our QA-to-claim model on multiple datasets to cover cases that cannot be handled by extractive systems. The following subsections illustrate the details of each component in our framework.

Claim Answer Generation
The goal of claim answer generation is to identify information units in the input claim that may be unfaithful to E. We aim to maximize the recall in this step since the missed candidates cannot be recovered in later steps. Therefore, we extract all noun chunks and named entities using Spacy 3 and extract nouns, verbs, adjectives, adverbs, noun phrases, verb phrases using Stanza 4 . Additionally, we also extract negation terms, such as "not" and "never", from the input claim. We name the extracted information units claim answers, denoted as

Question Generation
Upon claim answers are produced, we generate questions that will be later used to look for correct information units in the evidence. Questions are generated conditioned on the claim answers using the input claim as context. We denote the question generator as G. Each claim answer A C i is concatenated with the input claim C to generate a question Q i = G(A C i , C). We utilize MixQG (Murakhovs'ka et al., 2022) as our question generator G to cover the wide diversity of factual errors and candidates extracted. MixQG was trained on nine question generation datasets with various answer types, including boolean, multiple-choice, extractive, and abstractive answers.

Question Answering
The question answering step identifies the correct information units A E i corresponding to each question Q i in the given evidence E. Our QA module answers questions from the question generation steps with the given evidence as context. Let F denote our QA model. We feed the concatenation of a generated question and the evidence to the QA model to produce an evidence answer A et al., 2022) is used as our question answering model. UnifiedQA-v2 is a T5-based (Raffel et al., 2020b) abstractive QA model trained on twenty QA datasets that can handle diverse question types.

QA-to-Claim
After questions and answers are generated, we transform each pair of question and answer into a declarative statement, which serves as a candidate correction that will be scored in the next step. Previous studies on converting QAs to claims focus on extractive answer types only (Pan et al., 2021). To accommodate diverse types of questions and answers, we train a sequence-to-sequence model that generates a claim given a question-answer pair on three datasets: QA2D (Demszky et al., 2018) for extractive answers, BoolQ (Clark et al., 2019) for boolean answers, and SciTail (Khot et al., 2018) for covering scientific domain QAs. Note that samples in BoolQ do not contain converted declarative statements. Using Stanza's constituency parser, we apply heuristics to transform all QAs to their declarative forms in BoolQ. Our QA-to-claim model is a T5-base fine-tuned on these three datasets. Concretely, let M denote our QA-to-claim model. M takes in a generated question Q i and an evidence answer A E i as inputs and outputs a statement

Correction Scoring
The final correction is produced by scoring the faithfulness of each candidate correction from the previous steps w.r.t. the evidence. We use entailment score to approximate faithfulness. Here, DocNLI (Yin et al., 2021) is used to compute such document-sentence entailment relations. Doc-NLI is more generalizable than other documentsentence entailment models, such as FactCC (Kryscinski et al., 2020), since it was trained on five datasets of various tasks and domains. Conventional NLI models trained on sentence-level NLI datasets, such as MNLI (Williams et al., 2018), are not applicable since previous work has found that these models are ill-suited for measuring entailment beyond the sentence level (Falke et al., 2019). In addition, to prevent the final correction from deviating too much from the original claim, we also consider ROUGE-1 scores, motivated by Wan and Bansal (2022). The final metric used for scoring is the sum of ROUGE-1 score 5 and DocNLI entailment score. Formally, where C ′ is the final correction produced by our framework. Furthermore, to handle cases where the input claim is already faithful to the evidence, we include the input claim in the candidate correction list to be scored.

Domain Adaptation
During the early stage of our experiments, we found that our proposed framework did not perform well in correcting factual errors in biomedical claims. This results from the fact that our QA and entailment models were not fine-tuned on datasets in the biomedical domain. To address this issue, we adapt UNIFIEDQA-V2 and DOCNLI on two biomedical QA datasets, PUBMEDQA (Jin et al., 2019) and BIOASQ (Tsatsaronis et al., 2015), by further fine-tuning them for a few thousand steps. We later show that this simple domain adaptation technique successfully improves our overall factual error correction performance on a biomedical dataset without decreasing performance in the Wikipedia domain (see §5.1).
chos, 2021) is repurposed from the corresponding fact-checking dataset (Thorne et al., 2018) that consists of evidence collected from Wikipedia and claims written by humans that are supported or refuted by the evidence. Similarly, SCIFACT is another fact-checking dataset in the biomedical domain (Wadden et al., 2020). We repurpose it for the factual error correction task using the following steps. First, we form faithful claims by taking all claims supported by evidence. Then, unfaithful claims are generated by applying Knowledge Base Informed Negations (Wright et al., 2022), a semantic altering transformation technique guided by knowledge base, to a subset of the faithful claims. Appendix A shows detailed statistics.

Evaluation Metrics
Our evaluation focuses on faithfulness. Therefore, we adopt some recently developed metrics that have been shown to correlate well with human judgments in terms of faithfulness. BARTScore (Yuan et al., 2021) computes the semantic overlap between the input claim and the evidence by calculating the logarithmic probability of generating the evidence conditioned on the claim. FactCC (Kryscinski et al., 2020) is an entailment-based metric that predicts the faithfulness probability of a claim w.r.t. the evidence. We report the average of the COR-RECT probability across all samples. In addition, we consider QAFACTEVAL (Fabbri et al., 2022a), a recently released QA-based metric that achieves the highest performance on the SUMMAC factual consistency evaluation benchmark . Furthermore, we also report performance on SARI (Xu et al., 2016), a lexical-based metric that has been widely used in the factual error correction task (Thorne and Vlachos, 2021;Shah et al., 2020).

Baselines
We compare our framework with the following baseline systems. T5-FULL (Thorne and Vlachos, 2021) is a fully-supervised model based on T5-base (Raffel et al., 2020a) that generates the correction conditioned on the input claim and the given evidence. MASKCORRECT (Shah et al., 2020) and T5-DISTANT (Thorne and Vlachos, 2021) are both distantly-supervised methods that are composed of a masker and a sequence-to-sequence (seq2seq) corrector. The masker learns to mask out information units that are possibly false based on a learned fact verifier or an explanation model (Ribeiro et al., 2016) and the seq2seq corrector learns to fill in the masks with factual information. The biggest difference is in the choice of seq2seq corrector. T5-DISTANT uses T5-base, while MASKCORRECT utilizes a two-encoder pointer generator. For zeroshot baselines, we selected two post-hoc editing frameworks that are trained to remove hallucinations from summaries, REVISEREF (Adams et al., 2022) and COMPEDIT (Fabbri et al., 2022b). RE-VISEREF is trained on synthetic data where hallucinating samples are created by entity swaps.
COMPEDIT learns to remove factual errors with sentence compression, where training data are generated with a separate perturber that inserts entities into faithful sentences.

Implementation Details
No training is needed for ZEROFEC. As for ZEROFEC-DA, we fine-tune UNIFIEDQA-V2 and DOCNLI on the BIOASQ and PUBMEDQA datasets for a maximum of 5,000 steps using AdamW (Loshchilov and Hutter, 2019) with a learning rate of 3e-6 and a weight decay of 1e-6. During inference time, all generative components use beam search with a beam width of 4. Table 1 summarizes the main results on the FEVER and SCIFACT datasets. Both ZEROFEC and ZEROFEC-DA achieve significantly better performance than the distantly-supervised and zeroshot baselines. More impressively, they surpass the performance of the fully-supervised model on most metrics, even though the fully-supervised model is trained on 58K samples in the FEVER experiment. The improvements demonstrate the effectiveness of our approach in producing faithful factual error correction by combining question answering and entailment predictions. In addition, even though our domain adaptation technique is simple, it successfully boosts the performance on the SCIFACT dataset while pertaining great performance on the FEVER dataset. The first example in Table 2  anaphase?", indicating poor entailment assessment.

Main Results
With domain adaptation, ZEROFEC-DA resolves this issue by enabling DocNLI to approximate faithfulness more accurately. It is true that ZEROFEC-DA requires additional training, which is different from typical zero-shot methods. However, the key point remains that our framework does not require any task-specific training data. Hence, our approach still offers the benefits of zero-shot learning by not requiring any additional training data beyond what was already available for the question answering task, a field with much richer resources compared to the factchecking field.

Qualitative Analysis
To provide intuition for our framework's ability to produce faithful factual error corrections, we manually examined 50 correct and 50 incorrect outputs made by ZEROFEC on the FEVER dataset. The interpretability of ZEROFEC allows for insightful examinations of the outputs. Among the correct samples, our framework produces faithful corrections because all intermediate outputs are accurately produced rather than "being correct by chance". For the incorrect outputs, we analyze the source of mistakes, as shown in Figure 3. The vast majority of failed cases result from DocNLI's failure to score candidate corrections faithfully. In addition to the mediocre performance of DocNLI, one primary reason is that erroneous outputs from other compo-nents would not be considered mistakes so long as the correction scoring module determines the resulting candidate corrections unfaithful to the evidence. A possible solution to improve DocNLI is to further fine-tune it on synthetic data generated by perturbing samples in FEVER and SCIFACT. Examples of correct and incorrect outputs are presented in Table 7 and Table 8

Human Evaluation
To further validate the effectiveness of our proposed method, we recruited three graduate students who are not authors to conduct human evaluations on 100 and 40 claims from FEVER and SCIFACT, respectively. For each claim, human judges are presented with the ground-truth correction, the goldstandard evidence, and output produced by a factual error correction system and tasked to assess the quality of the correction with respect to three dimensions. Intelligibility evaluates the fluency of the correction. An intelligible output is free of grammatical mistakes, and its meaning must be Example 1 Input claim: Clathrin stabilizes the spindle fiber apparatus during anaphase. Evidence: ...but is shut down during mitosis, when clathrin concentrates at the spindle apparatus... Gold correction: Clathrin stabilizes the spindle fiber apparatus during mitosis. T5-DISTANT's output: Fuller House ( TV series ) isn't airing on HBO. Table 2: Example outputs from different approaches. The outputs from our framework are directly interpretable, as the generated questions and answers reflect which information units in the input claim are erroneous and which information in the evidence supports the final correction. We only show the generated answers and questions directly related to the gold correction. In the first example, ZEROFEC-DA corrects a mistake made by ZEROFEC thanks to domain adaptation. In the second example, ZEROFEC successfully produces a faithful factual error correction, whereas the output of T5-DISTANT, the distantly-supervised baseline, is factual yet unfaithful to the evidence.
understandable by humans without further explanation. Factuality considers whether the input claim is aligned with facts. Systems' output can be factual and semantically different from the gold correction as long as it is consistent with the world's knowledge. Faithfulness examines whether the input is factually consistent with the given evidence. Note that a faithful output must be factual since we assume all evidence is free of factual error. To evaluate the annotation quality, we compute the inter-annotator agreement. Krippendorff's Alpha (Krippendorff, 2011) is 68.85%, which indicates a moderate level of agreement. Details of our human evaluation can be found in Appendix B.
The human evaluation results are demonstrated in Table 3. We observe that: (1) ZEROFEC and ZEROFEC-DA achieve the best overall performance in Factuality and Faithfulness on both datasets, even when compared to the  fully-supervised method, suggesting that our approach is the best in ensuring faithfulness for factual error correction.
(2) Our domain adaptation for the biomedical domain surprisingly improves faithfulness and factuality in the Wikipedia domain (i.e. FEVER). This suggests that fine-tuning the components of our framework on more datasets helps improve robustness in terms of faithfulness.
(3) Factual output produced by ZEROFEC and ZEROFEC-DA are always faithful to the evidence, preventing the potential spread of misleading information caused by factual but unfaithful corrections. The second example in Table 2 demonstrates an instance of factual but unfaithful correction made by baseline models. Here, the output of T5-DISTANT is unfaithful since the evidence does not mention whether Fuller House airs on HBO. In fact, although Fuller House was not on HBO when it premiered, it was later accessible on HBO Max. Therefore, the correction produced by T5-DISTANT is misleading.

Correlation with Human Judgments
Recent efforts on faithfulness metrics have been mostly focusing on the summarization task. No prior work has studied the transferability of these metrics to the factual error correction task. We seek to bridge this gap by showing the correlation between the automatic metrics used in Table 1   human evaluation results discussed in §5.3. Using Kendall's Tau (Kendall, 1938) as the correlation measure, the results are summarized in Table 4. We have the following observations. (1) SARI is the most consistent and reliable metric for evaluating Factuality and Faithfulness across two datasets. Although the other three metrics developed more recently demonstrate high correlations with human judgments of faithfulness in multiple summarization datasets, their transferability to the factual error correction task is limited due to their incompatible design for this particular task. For example, QA-based metrics like QAFACTEVAL are less reliable for evaluating faithfulness in this task due to their inability to extract a sufficient number of answers from a single-sentence input claim. In contrast, summaries in summarization datasets generally consist of multiple sentences, enabling the extraction of a greater number of answers. To validate this, we analyzed the intermediate outputs of QAFACTEVAL. Our analysis confirms that it extracts an average of only 1.95 answers on the FEVER dataset, significantly lower than the more than 10 answers typically extracted for summaries. (2) Across the two datasets, the correlations between all automatic metrics and Intelligibility are low. The extremely high proportion of intelligible outputs may explain the low correlation. (3) The correlation for learning-based metrics, including QAFACTEVAL and FACTCC, drop significantly when applied to SCIFACT. This is likely caused by the lack of fine-tuning or pre-training with biomedical data.
6 Related Work

Factual Error Correction
An increasing number of work began to explore factual error correction in recent years, following the rise of fact-checking (Thorne et al., 2018;Wadden et al., 2020;Gupta and Srikumar, 2021;Huang et al., 2022b) and fake news detection (Shu et al., 2020;Fung et al., 2021;Huang et al., 2022a). Shah et al. (2020) propose a distant supervision learning method based on a maskercorrector architecture, which assumes access to a learned fact verifier. Thorne and Vlachos (2021) created the first factual error correction dataset by repurposing the FEVER (Thorne et al., 2018) dataset, which allows for fully-supervised training of factual error correctors. They also extended Shah et al. (2020)'s method with more advanced pre-trained sequence-to-sequence models. Most recently, Schick et al. (2022) proposed PEER, a collaborative language model that demonstrates superior text editing capabilities due to its multiple text-infilling pre-training objectives, such as planning and realizing edits as well as explaining the intention behind each edit 6 .

Faithfulness
Previous studies addressing faithfulness are mostly in the summarization field and can be roughly divided into two categories, evaluation and enhancement. Within faithfulness evaluation, one line of work developed entailment-based metrics by training document-sentence entailment models on synthetic data (Kryscinski et al., 2020;Yin et al., 2021) or applying conventional NLI models at the sentence level . Another line of work evaluates faithfulness by comparing information units extracted from summaries and input sources using QA Deutsch et al., 2021). There is a recent study that integrates QA into entailment by feeding QA outputs as features to an entailment model (Fabbri et al., 2022a). We combine QA and entailment by using entailment to score the correction candidates produced by QA. Within faithfulness enhancement, some work improves factual consistency by incorporating auxiliary losses into the training process (Nan et al., 2021;Cao and Wang, 2021;Tang et al., 2022;Huang et al., 2023). Some other work devises factuality-aware pre-training and fine-tuning objectives to reduce hallucinations (Wan and Bansal, 2022). The most similar to our work are studies that utilize a separate rewriting model to fix hallucinations in summaries. For example, Cao et al.

Conclusions and Future Work
We have presented ZEROFEC, a zero-shot framework that asks questions about an input claim and seeks answers from the given evidence to correct factual errors faithfully. The experimental results demonstrate the superiority of our approach over prior methods, including fully-supervised methods, as indicated by both automatic metrics and human evaluations. More importantly, the decomposability of ZEROFEC naturally offers interpretability, as the questions and answers generated directly reflect which information units in the input claim are incorrect and why. Furthermore, we reveal the most suitable metric for assessing faithfulness of factual error correction by analyzing the correlation between the reported automatic metrics and human judgments. For future work, we plan to extend our framework to faithfully correct misinformation in social media posts and news articles to inhibit the dissemination of false information.

Limitations
Although our approach has demonstrated advantages in producing faithful factual error corrections, we recognize that our approach is not capable of correcting all errors, particularly those that require domain-specific knowledge, as illustrated in Table 3. Therefore, it is important to exercise caution when applying this framework in user-facing settings. For instance, end users should be made aware that not all factual errors may be corrected.
In addition, our approach assumes evidence is given. Although this assumption is also true for applying our method to summarization tasks since the source document is treated as evidence, it does not hold for automatic textual knowledge base updates. When updating these knowledge bases, it is often required to retrieve relevant evidence from external sources. Hence, a reliable retrieval system is required when applying our method to this task.

Ethical Considerations
While no fine-tuning is needed for ZEROFEC, its inference time and memory usage are three to four times more than similar-sized baseline systems due to its multi-component architecture, implying higher environmental costs during test time. In addition, the underlying components of our method are based on language models pre-trained on data collected from the internet. These language models have been shown to exhibit potential issues, such as political or gender biases. While we did not observe such biases during our experiments, users of these models should be aware of these issues when applying them.

A Dataset Statistics
Details of the dataset statistics are shown in Table 5.

B Human Evaluation Details
In this section, we describe the details of our human evaluation. We recruit three engineering and science graduate students to ensure high-quality evaluation. For each HIT, annotators are provided with an input claim, the corresponding evidence and gold correction, and a predicted correction generated by a model. Based on the presented predictions, annotators are tasked to answer three questions shown on the right segment of the interface, each of which corresponds to Intelligence, Factuality, and Faithfulness. They need to determine whether the predicted correction meets the three criteria according to each prompt. Our human evaluation interface is displayed in Figure 4. Since the evaluation questions are selfexplanatory, we only provide the human evaluators with terminology definitions and multiple examples of how evaluations should be conducted. Terminology is defined as follows: • Input claim: A sentence fed into a factual error correction system.
• Predicted correction: The output from the factual error correction system.
• Gold correction: Ground-truth label that the system aims to produce.
• Evidence: A document that the factual error correction system used to fix factual errors.
We maintain frequent communication with the human evaluators, including answering any questions they may have, to facilitate the evaluation process.  other components the same as ZEROFEC. We report the performance on the FEVER dataset in SARI and QAFACTEVAL since these two metrics demonstrate the highest correlation with human judgments regarding faithfulness. Ablation results are presented in Table 6.

Effect of Question Generation
We compared MixQG with a T5-base model trained on SQuAD (Rajpurkar et al., 2016). The results indicate that the final performance is not significantly affected by the use of either model. Upon further investigation, we surprisingly discovered that despite SQuAD exclusively comprising extractive question answering examples, the T5-base trained on it could generalize to other answer types. For example, given an answer "not" and a claim "Cleopatre is not a queen.", T5-base (SQuAD) generates "Is Cleopatre a queen?". Therefore, the training of MixQG on multiple QA datasets does not yield advantages.

Effect of Question Answering
We experimented with an abstractive QA model, UnifiedQA (Khashabi et al., 2020), and two extractive QA models trained on SQuAD. We found that UnifiedQA performs similarly to UnifiedQA-v2, whereas using both extractive QA models leads to significant performance drops. This is likely due to the fact that SQuAD only includes extractive answer types. Although the encoder-decoder architecture of T5-base allows it to output words that do not present in the context, it fails to generate these types of answers. For instance, given a question "Was Cleopatre a queen." and a context "Cleopatra VII Philopator was Queen of the Ptolemaic Kingdom of Egypt...", T5-base (SQuAD) would output "Queen" instead of "Yes".
Effect of QA-to-claim For QA-to-claim, we ablated different training data while keeping the same model architecture. Similar to our findings in the ablation studies on QA, when T5-base is only trained on QA2D or SciTail, it cannot convert boolean-typed questions and answers to declarative sentences, resulting in a marked decline in performance.

Effect of Correction Scoring
We studied other scoring methods, including replacing DocNLI with FactCC and removing ROUGE-1. Using FactCC leads to a great performance drop, suggesting that DocNLI is likely a better approximation of faithfulness than FactCC. Furthermore, incorporating ROUGE-1 into the scoring criteria allows us to select a faithful correction that is most relevant to the input claim. Thus, we observe a huge drop in SARI when ROUGE-1 is removed.

D Additional Qualitative Analysis
As mentioned in §5.1, we analyzed 50 correct and 50 incorrect outputs produced by ZEROFEC. All 50 correct outputs are generated by asking the correct questions, answering correctly using the evidence, and scoring faithfully w.r.t. the evidence. Examples are demonstrated in Table 7. For incorrect outputs, most of the errors are caused by DocNLI's inability to approximate faithfulness, as shown by the last instance in Table 8, even though DocNLI is the state-of-the-art document-sentence entailment model. In addition, annotation errors occur due to how the FEVER dataset was constructed (i.e. for fact-checking purposes). As demonstrated by the first example in Table 8, our correction is faithful to the evidence, and it is also more relevant to the input claim compared to the ground truth. As for errors in the question answering module, most of them are under-specified answers. For example, in the second instance in Table 8, the generated answer "pop music duo" is faithful to the evidence but is under-specified compared to the expected answer "R&B singers".

E Software and Hardware Configurations
All experiments were conducted on a Ubuntu 18.04.6 Linux machine with a single NVIDIA V100. We use PyTorch 1.11.0 with CUDA 10.2 as the Deep Learning framework and utilize Transformers 4.19.2 to load all pre-trained language models.