Fact-Checking Complex Claims with Program-Guided Reasoning

Fact-checking real-world claims often requires collecting multiple pieces of evidence and applying complex multi-step reasoning. In this paper, we present Program-Guided Fact-Checking (ProgramFC), a novel fact-checking model that decomposes complex claims into simpler sub-tasks that can be solved using a shared library of specialized functions. We first leverage the in-context learning ability of large language models to generate reasoning programs to guide the verification process. Afterward, we execute the program by delegating each sub-task to the corresponding sub-task handler. This process makes our model both explanatory and data-efficient, providing clear explanations of its reasoning process and requiring minimal training data. We evaluate ProgramFC on two challenging fact-checking datasets and show that it outperforms seven fact-checking baselines across different settings of evidence availability, with explicit output programs that benefit human debugging. Our codes and data are publicly available at https://github.com/mbzuai-nlp/ProgramFC.


Introduction
The proliferation of disinformation, e.g., in social media, has made automated fact-checking a crucial application of natural language processing (NLP). Given a claim, the goal is to find evidence and then to make a verdict about the claim's veracity based on that evidence Glockner et al., 2022;Guo et al., 2022).
Evaluating the veracity of real-world claims often involves collecting multiple pieces of evidence and applying complex reasoning (Jiang et al., 2020;Nguyen et al., 2020;Aly and Vlachos, 2022;Chen et al., 2022a). For instance, consider the claim "Both James Cameron and the director of the film Interstellar were born in Canada". It may be challenging to find direct evidence on the web that refutes or supports this claim.
Instead, a human fact-checker needs to decompose the claim, gather multiple pieces of evidence, and perform step-by-step reasoning (Nakov et al., 2021a), as illustrated in Figure 1. This makes verifying complex claims much more challenging than the typical setting explored in previous work, where information from a single article is sufficient to support/refute the claim Saakyan et al., 2021;Schuster et al., 2021;Pan et al., 2021;Wadden et al., 2022a;Krishna et al., 2022).
Besides multi-step reasoning, we still need to consider two key aspects for developing a reliable fact-checking system: (i) Explanability: The model should not only predict the veracity of the claim, but it should also provide a clear explanation of its reasoning process to help users understand and trust the results. (ii) Data efficiency: Human annotation is often time-consuming, costly, and potentially biased, making it difficult to collect sufficient highquality labeled data for model training, particularly for complex claims. Therefore, it is desirable to build a model that can perform well with minimal or no training data. Despite a few models (Zhou et al., 2019;Zhong et al., 2020;Aly and Vlachos, 2022) being proposed to facilitate multi-step reasoning in fact-checking, they either lack explainability in their reasoning process or require a large number of task-specific training examples.
In this paper, we present Program-Guided Fact-Checking (PROGRAMFC), a novel fact-checking framework that is both explanatory and dataefficient. Figure 1 illustrates our approach. To verify complex claims, PROGRAMFC decomposes them into simpler sub-tasks that can be solved using a shared library of specialized sub-task functions. To be specific, PROGRAMFC begins by generating a reasoning program for the input claim, which is a sequence of sub-tasks (e.g., S1-S4 in Figure 1) in the form of ACTION [ARGUMENT], where ACTION and ARGUMENT define the type and the content of the sub-task, respectively.
The generated reasoning program serves as a step-by-step guide for verifying the claim. We then execute the program by sequentially delegating each sub-task to the corresponding sub-task handler, as shown in the functions columns in Figure 1. These sub-tasks may include answering questions, verifying simple claims, or conducting logical reasoning.
PROGRAMFC combines explainability with data efficiency. It uses reasoning programs to provide clear explanations of its reasoning process. For data efficiency, Large Language Models (LLMs) can solve various tasks given only a few examples as prompts, e.g., in-context learning (Brown et al., 2020). We leverage this ability of LLMs to generate reasoning programs for a given claim by showing the model just a few dozen of (claim, program) pairs as demonstrations. PROGRAMFC is also flexible as it allows for easy swapping of subtask function implementations to work under different settings of fact-checking, without affecting the rest of the system. We can allow the functions to retrieve information from external sources (in an open-book setting) or we can ask them to generate answers based solely on the LLM's internal parametric knowledge (in a closed-book setting).
We evaluate PROGRAMFC on two challenging datasets designed for fact-checking complex claims: HOVER (Jiang et al., 2020) and FEVER-OUS (Aly et al., 2021), and we show that it outperforms seven few-shot fact-checking baselines on both datasets ( § 4.1).
The strategy of program-guided reasoning becomes increasingly effective as the required reasoning depth increases ( § 4.1). In the open-domain setting, we find that reasoning programs can enhance the retrieval of relevant evidence from knowledge sources ( § 4.2). Moreover, PROGRAMFC is robust even when we use weak models as sub-task solvers ( § 4.2). We also evaluate the interpretability of the reasoning programs through human evaluation and error analysis ( § 4.3).

Related Work
Fact-Checking. Automated fact-checking has gained significant attention in the NLP research community in recent years as a means of combating misinformation and disinformation. Various datasets have been proposed that enable the development and the evaluation of systems for automatic fact-checking, the most popular ones being based on human-crafted claims from Wikipedia content Sathe et al., 2020;Schuster et al., 2021) and naturally occurring claims in the political or in the scientific domain (Wang, 2017;Nakov et al., 2021bNakov et al., , 2022Augenstein et al., 2019;Saakyan et al., 2021;Gupta and Srikumar, 2021;Wadden et al., 2020Wadden et al., , 2022a. Notably, most of these datasets are constructed in a way that the evidence to support or to refute a claim can be found in a single document. For example, in FEVER , more than 87% of the claims only require information from a single Wikipedia article (Jiang et al., 2020).
To bridge this gap, datasets have been proposed to study fact-checking complex claims that require multi-step reasoning (Jiang et al., 2020;Aly et al., 2021). Graph-based models (Zhou et al., 2019;Liu et al., 2020;Zhong et al., 2020;Nguyen et al., 2020;Barnabò et al., 2022Barnabò et al., , 2023 are used to facilitate the reasoning over multiple pieces of evidence. Although such models achieve sizable performance gains, they lack explanability and thet rely on large amounts of training data. To address the above problems, we propose an explainable, flexible, and data-efficient model that generates reasoning graphs as explanations and utilizes incontext learning to enable few-shot learning. Explanation Generation. Facing the complexities of real-world claims, simply giving a final veracity to a claim often fails to be persuasive (Guo et al., 2022). Previous research has proposed various approaches to provide post-hoc explanations for model predictions, such as using attention weights to highlight relevant parts of the evidence (Popat et al., 2017;Cui et al., 2019;Lu and Li, 2020), generating justifications with logic-based systems based on knowledge graphs (Gad-Elrab et al., 2019;Ahmadi et al., 2019), and generating a summary of the retrieved relevant evidence (Atanasova et al., 2020;Kotonya and Toni, 2020;Jolly et al., 2022). In contrast, we propose to use reasoning programs to provide explanations that consist of sub-tasks described in a program-like natural language. This offers several advantages: it allows for explanations that are not confined to the evidence, like attention weights, it is more flexible than logic-based explanations, and it is more concise than free-form summarization.
Chain-of-Thought Reasoning. Moreover, unlike previous work that generates post-hoc explanations, we also use reasoning programs as guidance for predicting the veracity of the claim. This is motivated by the recent success of chain-of-thought prompting (CoT) Kojima et al., 2022;, which generates step-bystep natural language reasoning steps to guide the model in answering complex questions. We adopt this idea to fact-checking complex claims. Unlike the original CoT, which uses a single LLM for both decomposition and question answering, we use the language model only to generate reasoning programs as the blueprint for problem-solving, and we delegate each sub-task to specialized functions. This approach reduces the burden on the language model and allows for more flexibility in incorporating necessary components for factchecking such as an evidence retriever. The strategy of program-guided reasoning is also in line with the recent trend of tool-augmented language models (Mialon et al., 2023;Schick et al., 2023), i.e., augmenting language models with access to external tools and resources.

PROGRAMFC
We first formulate the problem of fact-checking and then we introduce our proposed model for Program-Guided Fact-Checking (PROGRAMFC).

Problem Formulation
Given a claim C, a fact-checking model F aims to predict a label Y to evaluate the claim as TRUE or FALSE, based on a knowledge source K. The model is also required to output an explanation E to justify the predicted veracity label. We summarize three different settings of fact-checking depending on the type of knowledge source K. • Gold evidence: For each claim, K is the set of gold evidence documents that can support or refute the claim. This setting is also called claim verification (Pan et al., 2021;Wright et al., 2022).
• Open-book setting: K is a large textual corpus such as Wikipedia. The model first retrieves relevant evidence from the corpus and then predicts the veracity label based on the evidence (Jiang et al., 2021;Wadden et al., 2022b).
• Closed-book setting: The model does not have access to any external knowledge source (K = ∅). It needs to leverage the knowledge stored in its parameters (acquired during pre-training and finetuning) to verify the claim. This setting was explored in work that applies large language models for fact-checking (Lee et al., 2020(Lee et al., , 2021.

Program-Guided Reasoning
Our goal is to fact-check a complex claim C that requires multi-step reasoning. We focus on the fewshot setting, where only a small set of in-domain examples are available to teach the model. To solve this, PROGRAMFC follows a program generationand-execution paradigm, as shown in Figure 1.
Program Generation. At this stage, given the input claim C, a planner P generates a reasoning program P = [S 1 , · · · , S n ] for it, which consists of n sequentially ordered reasoning steps S i . Each reasoning step S i ∈ P is an instruction in controlled natural language that directs S i to a function in an auxiliary set of sub-task functions F available to the system. To be specific, we define S i = (f i , A i , V i ), where f i specifies the sub-task function f i ∈ F, A i is the argument passed to the function f i , and V i is the variable that stores the returned result from the function call f i (A i ). For a valid reasoning program, the return value of the last reasoning step must be a Boolean value indicating the veracity label of the claim C, i.e., V n ∈ {TRUE, FALSE}.
Program Execution. In the execution stage, the reasoning program P is run by an interpreter to derive the veracity label of the claim C. The interpreter sequentially parses the reasoning steps in P . For each step S i = (f i , A i , V i ), it calls the corresponding off-the-shelf sub-task function f i and passes the argument A i to it. The argument A i is either a logical expression or a natural language sentence, e.g., a question or a simple claim. The result of the function call is then stored in the variable V i . As it is common for a subsequent step to depend on the results from previous steps, we allow the argument A i to refer to variables V 1 , · · · , V i−1 in previous steps. For example, in Figure 1, the argument in S 3 is "{ANSWER_1} was born in Canada.", which refers to the return variable {ANSWER_1} from S 2 . When executing S 3 , the variable is replaced by its actual value, and the argument becomes "Christopher Nolan was born in Canada". After executing the last step, the return value is the predicted veracity of the claim C.
Aggregating Reasoning Paths. Note that there might be multiple reasoning paths that can reach the final veracity label. Therefore, we generate a diverse set of N candidate reasoning programs P = {P 1 , · · · , P N } for the input claim. After executing all programs in P, we take the majority vote over all N predicted labels as the final label. This approach is similar to how humans rely on multiple methods of validation to increase their confidence in fact-checking. It also makes the model less susceptible to errors in individual reasoning programs.

Reasoning Program Generation
We base our program generator on Codex (Chen et al., 2021), a code-pretrained LLM, which can parse natural language into symbolic representations such as SQL (Cheng et al., 2022) or Python programs (Gao et al., 2022;Chen et al., 2022b). However, the grammar of a reasoning program is different from the grammar of a programming language. We take advantage of Codex's few-shot generalization ability and we find that it can learn effectively from only a small number of in-context examples D = {d 1 , · · · , d |D| }. Each example d i consists of a claim and a program. The program has a Python-like grammar, where each reasoning step is written in the format V i = f i (A i ). At inference time, we prompt Codex with an instruction of the task, K in-context examples, and the input claim C. Codex then attempts to complete the following texts, and thereby generates a program for C. The prompt template is shown in Figure 2. We use K = 20 to maintain a tradeoff between the diversity of reasoning types and the model's maximum input capacity. We use sampling-based decoding (temperature of 0.7) to generate different reasoning programs for multiple runs.

Sub-Task Functions
We implement three sub-task functions for the model to call during the program execution.
• QUESTION: This sub-task function is a questionanswering module that takes a question Q as the input argument and returns the answer A to the question. We use FLAN-T5 (Chung et al., 2022), an improved T5 model (Raffel et al., 2020) pretrained on more than 1.8K tasks with instruction tuning, which has achieved state-of-the-art zero/few-shot performance on many QA benchmarks. As shown in Figure 3, we prompt the model differently depending on the settings defined in Section 3.1. For the closed-book setting, the input prompt is Q: QUESTION ? The answer is: For the other two settings, the input prompt is EVIDENCE Q: QUESTION ? The answer is: • VERIFY: This is a fact verification module that takes a claim C as the input argument and returns a label of either TRUE or FALSE. We also use FLAN-T5 for this module, by prompting the model with the following question-answering format. EVIDENCE Q: Is it true that CLAIM ? True or False? The answer is: • PREDICT: This module takes as input a logical expression that performs AND, OR, NOT operations over the variables in the previous steps. Its output is returned as the predicted veracity label.
''' Generate a python -like program that describes the reasoning steps required to verify the claim step -by -step . You can call three functions in the program : 1. Question () to answer a question ; 2. Verify () to verify a simple claim ; 3. Predict () to predict the veracity label . ''' # The claim is that Both James Cameron and the director of the film Interstellar were born in Canada . def program () : fact_1 = Verify (" James Cameron was born in Canada .") Answer_1 = Question (" Who is the director of the film Interstellar ?") fact_2 = Verify ("{ Answer_1 } was born in Canada .") label = Predict ( fact_1 and fact_2 ) (· · · more in-context examples here · · ·) # The claim is that <input_claim> def program () :

Gold Evidence
Open-book Closed-book Figure 3: Implementation of the question-answering sub-task function for three different settings.

Experiments
Datasets. Most fact-checking datasets consist primarily of simple claims that can be substantiated through a single piece of evidence. However, here we focus on complex claims that need multi-step reasoning. Given this context, we opt to evaluate our model on the only two datasets that, to the best of our knowledge, fulfill these criteria: HOVER (Jiang et al., 2020) and FEVEROUS (Aly et al., 2021). We use the validation sets for evaluation since the test sets are not publicly released. HOVER contains claims that require integration and reasoning over multiple Wikipedia articles. We divide its validation set into three subsets based on the number of "hops" required to verify the claim: 1,126 two-hop claims, 1,835 three-hop claims, and 1,039 four-hop claims. FEVEROUS focuses on fact-checking complex claims over unstructured and structured data, where each claim is annotated with evidence in the form of sentences and/or cells from tables in Wikipedia. Since we focus on textual fact-checking, we only selected claims that require exclusively sentence evidence, constituting 2,962 claims. We call this subset FEVEROUS-S.
For evaluation in the open-book setting, we use the corresponding Wikipedia corpus constructed for these two datasets as the knowledge sources. HOVER uses the October 2017 Wikipedia dump processed by Yang et al. (2018), consisting of the introductory sections of 5.2 million Wikipedia pages. FEVEROUS uses the December 2020 dump, including 5.4 million full Wikipedia articles.
Baselines. We compare PROGRAMFC to seven baselines, categorized into three groups. (i) Pretrained models: BERT-FC (Soleimani et al., 2020) and LisT5 (Jiang et al., 2021) are two models that leverage BERT and T5 for fact verification, respectively. (ii) FC/NLI fine-tuned models: we choose three pretrained models that are fine-tuned on other fact-checking datasets or natural language inference (NLI) datasets. RoBERTa-NLI (Nie et al., 2020)  We use these examples either for fine-tuning pre-trained models (BERT-FC and LisT5), for continuous fine-tuning the FC/NLI fine-tuned models, or as in-context examples for FLAN-T5 and Codex. For PROGRAMFC, we use them as in-context examples for reasoning program generation.
We evaluate both the gold evidence setting and the open-book setting. The baseline models are the same for both settings. However, during testing in the open-book setting, the models are given the retrieved evidence rather than the ground-truth evidence. We use BM25 (Robertson and Zaragoza, 2009) implemented with the Pyserini toolkit (Lin et al., 2021) as the retriever for both PROGRAMFC and the baselines. We use as evidence the top-10 paragraphs retrieved from the knowledge corpus.

Main Results
We report the overall results for PROGRAMFC and for the baselines for few-shot fact-checking in Table 1. PROGRAMFC achieves the best performance on 7 out of 8 evaluations, demonstrating its effectiveness. We have three more specific observations.
ProgramFC is more effective on deeper claims. On the HOVER dataset, ProgramFC (N=5) outperforms the baselines on average by 10.38%, 11.37%, and 14.77% on two-hop, three-hop, and four-hop claims, respectively. This suggests that ProgramFC becomes increasingly effective as the required reasoning depth increases. Among the baselines, DeBERTaV3-NLI performs comparably to ProgramFC on two-hop claims, indicating that large-scale pre-training on simpler claims can help the model generalize to more complex claims.
However, this generalization becomes more challenging as the complexity of the claims increases. On HOVER, the F1 score of DeBERTaV3-NLI drops from 77.22 for 2-hop claims to 60.49 for 4-hop claims, which is a decrease of 21.7%. In contrast, the performance drop for ProgramFC, which uses the strategy of program-guided reasoning, is much smaller: just 11.7%.
Decomposition is more effective than one-step prediction. The ProgramFC model, which uses the same FLAN-T5 model as the sub-task functions, outperforms the baseline of directly verifying claims with FLAN-T5 on all four datasets. On average, there is a 6.0% improvement in the gold evidence setting and a 4.5% improvement in the open-book setting. This suggests that decomposing a complex claim into simpler steps with a program can facilitate more accurate reasoning. This is especially evident when the required reasoning is complex: there is a 14.9% improvement in the gold evidence setting and a 6.7% improvement in the open-book setting for 4-hop claims.
Aggregating reasoning programs is helpful.
We find that aggregating the predictions of N = 5 reasoning programs improves the performance over using a single program by an average of 1.5%. This aligns with the findings of , where the idea was applied for question answering: if multiple different ways of thinking lead to the same answer, we can have greater confidence that the final answer is correct. This intuition also applies to fact-checking, as each program represents a unique reasoning chain to verify the claim.

How Does the Reasoning Program Help?
To further understand how reasoning programs facilitate fact-checking, we compare the performance of PROGRAMFC with FLAN-T5 using different language model sizes: small, base, large, XL, and XXL. The results are shown in Figure 4 and indicate that program-guided reasoning is particularly effective when the model size is small. As smaller models have less capacity for complex reasoning, the performance of the end-to-end FLAN-T5 model decreases significantly with decreasing model size. However, this trend is less notable for PROGRAMFC. The high-level reasoning plan offered by reasoning programs substantially alleviates the demands on the subsequent subtask solvers. Our results show that the programguided model using FLAN-T5-small (80M parameters) as sub-task solvers can achieve comparable performance to the 137x larger FLAN-T5-XXL (11B) model with end-to-end reasoning for 4-hop claims.
In the open-domain setting, we find that reasoning programs can enhance the retrieval of relevant evidence from the knowledge source. Figure 5 compares the retrieval performance of the one-step BM25 retriever used in the baselines to the iterative step-by-step BM25 retriever in PROGRAMFC.
We measure the recall of the gold paragraphs for the top-10 retrieved paragraphs (recall@10). For PROGRAMFC, we combine the retrieved paragraphs of all steps and we consider the top-10 results. We can see in Figure 5 that PROGRAMFC outperforms one-step retrieval on all datasets, with the largest improvement of 37.1% on HOVER 4hop. This is because some information may not be present in the original claim, but is only revealed during the reasoning process (e.g., "Christopher Nolan" in Figure 1). Thus, iterative retrieval guided by the reasoning program yields better results.

Interpretability of Reasoning Programs
An advantage of PROGRAMFC is that it improves the interpretability of fact-checking compared to end-to-end models, as the explicit program can aid human understanding and debugging. Examples of generated reasoning programs can be found in Figure 7 of Appendix B. To assess the quality of the generated reasoning programs, we sampled 300 claims where PROGRAMFC incorrectly predicted the final veracity labels from the HOVER 2-hop, 3-hop, and 4-hop datasets, with 100 examples per dataset. We asked human annotators to analyze the error types and we classified the results into three categories: (i) Syntactic errors, where the program does not conform to the defined grammar and cannot be parsed, (ii) Semantic errors, which include incorrect or missing arguments/variables (Token), incorrect program structure (Structure), and incorrect sub-task calls (Subtask), and (iii) Incorrect execution, where the program is correct, but where the incorrect prediction is a result of its execution.
We show the error analysis in Table 2. First, no syntax errors were found in our samples, indicating that Codex effectively generates executable programs through few-shot in-context learning.

Claim:
Emery, located in the same state as Edison Local School District, is a ghost town. It is near the city that lies close to the Ohio Turnpike, a 241.26 mi highway. Figure 6: An error case from the HOVER 4-hop dataset where the generated reasoning program has an incorrect program structure. The incorrect segment(s) are marked in red, and the correct revisions are marked in green.

Error Type
Proportion Second, for 2-hop claims, we find that 71% of the programs are correct. The majority of the errors are the result of incorrect program execution, where the question answering or the fact-checking modules failed to return the correct answer.
Third, as the complexity of the claims increases, the proportion of semantic errors in the programs also increases, with structural errors becoming particularly prevalent. This highlights the difficulty of generating the appropriate step-by-step reasoning strategies for claims that require long-chain reasoning. An example structural error is shown in Figure 6, where the model fails to parse the second sentence of the claim into correct program instructions. Additional error examples can be found in Appendix C.

Closed-Book Fact-Checking
Finally, we evaluate the closed-book setting, where the model does not have access to any knowledge source and needs to rely on its parametric knowledge only. The baseline models from groups I and II in Table 1 are trained with (evidence, claim) pairs and thus are not applicable in this setting. We compare our method to the baselines that use large language models for in-context learning, including Codex (code-davinci-002) and FLAN-T5 from Table 1  We also include the 175B-parameter Instruct-GPT (text-davinci-002) (Ouyang et al., 2022) with four different prompts: (i) direct prompting with the claim, (ii) CoT  or chain-of-thought prompting with demonstrations, (iii) ZS- CoT (Kojima et al., 2022) or zero-shot chain-of-thought with the prompt "let's think step by step", and (iv) Self-Ask (Press et al., 2022), which is a variant of CoT that guides the model reasoning by asking a series of questions. The detailed prompting templates are given in Appendix E.
Our results, presented in Table 3, show that most models achieve a Macro-F1 score only slightly above random guessing on the HOVER dataset, indicating the difficulty of solely relying on parametric knowledge of LLMs for fact-checking complex claims. Similar to the observations in Section 4.1, we see a trend of improved performance as the number of the required reasoning hops increases. Chain-of-thought prompting scores an average 2.7 points higher than direct prompting, highlighting the importance of step-by-step reasoning for complex fact-checking. It outperforms our PROGRAMFC on HOVER 2-hop and FEVEROUS but performs worse on HOVER 3-hop and 4-hop. This can be due to CoT generating free-form explanations, which can lead to unpredictable errors in long reasoning chains. In contrast, our program generation-and-execution strategy proves to be more stable for longer reasoning chains.

Conclusion and Future Work
We proposed PROGRAMFC, a few-shot neurosymbolic model for fact-checking that learns to map input claims to a reasoning program consisting of a sequence of sub-task function calls for answering a question, for fact-checking a simple claim, and for computing a logical expression. Then factchecking is performed by executing that program. PROGRAMFC combines the advantages of symbolic programs, such as explainability, with the flexibility of end-to-end neural models. Using Codex as the program generator, PROGRAMFC demonstrates promising performance on HOVER and FEVEROUS with only a small number of incontext demonstrations and no additional training. We also investigated the impact of model size and the benefits of programs for retrieval, and we analyzed the errors. The results indicated that PRO-GRAMFC effectively balances model capability, learning efficiency, and interpretability.
In future work, we want to adapt PROGRAMFC to more real-world fact-checking scenarios, such as fake news detection and multi-modal fact-checking, with advanced reasoning program design and subtask functionalities.

Limitations
There are two main limitations of PROGRAMFC. First, despite being complex in their surface form, the claims in HOVER and FEVEROUS mostly require only explicit multi-step reasoning, i.e., the decomposition can be derived from the claim's syntactic structure or how the claim is framed. This lowers the difficulty of generating reasoning programs. However, for many real-world complex claims, the reasoning is often implicit. For example, for the claim "Aristotle couldn't have used a laptop", the reasoning program is: answer_1 = Question("When did Aristotle live?"); answer_2 = Question("When was the laptop invented?"); fact_1 = Verify("answer_1 is before answer_2."); label = Predict(fact_1) Generating reasoning programs for such implicit complex claims requires a deeper understanding of the claim and also access to world and commonsense knowledge. We conducted preliminary experiments on these types of claims, but we found that our Codex-based generator struggled to produce a correct reasoning program. This highlights the gap in applying our PROGRAMFC to fact-check real-world claims. Addressing these challenges is an important direction for future work.
Second, PROGRAMFC incurs a higher computational cost than baseline end-to-end fact-checking models. It requires calling large language models for program generation and further calling multiple sub-task models. This results in the actual computational time that is ∼4-5× higher than for an endto-end FLAN-T5 model. Developing more efficient methods for program generation and execution is an important direction for future work.

Ethics Statement
Biases. We note that there might be some biases in the data used to train the LLMs, as well as in factuality judgments. Both are beyond our control.
Intended Use and Misuse Potential. Our models can be of interest to the general public and could also save a lot of time to human fact-checkers. However, they could also be misused by malicious actors. We ask researchers to exercise caution.
Environmental Impact. The use of large language models requires a significant amount of energy for computation for training, which contributes to global warming. Our work performs fewshot in-context learning instead of training models from scratch, so the energy footprint of our work is less. The large language model (Codex) whose API we use for inference consumes significant energy.

A Implementation Details about the Baselines
In this section, we give the implementation details for the seven baselines we used in our work. Typical ways to perform few-shot fact-checking using large language models are fine-tuning and incontext learning. Thus, we categorize the baselines into three categories.

A.1 Pre-trained Models
Pre-trained models use pretrained Transformers (Vaswani et al., 2017) such as BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020) for factchecking. For few-shot learning, we fine-tune them using 20 randomly sampled training examples from HOVER or FEVEROUS. We ran the training 10 times with different random seeds and report the average performance on the validation set. We chose two models: • BERT-FC (Soleimani et al., 2020) (Raffel et al., 2020), as its backbone. We adopt the "listwise concatenation" proposed in the paper for label prediction, which concatenates all candidate evidence sentences into a single input and we train the t5-large model to directly classify the claim as Supported or Refuted. We use the original implementation of this model. 3

A.2 FC/NLI Fine-Tuned Models
These models are pretrained Transformer models that have been specifically fine-tuned on singlehop fact-checking datasets (e.g., FEVER) or natural language inference (NLI) datasets. This additional training allows these models to excel at fact-checking simple claims, and thus they can generalize better to complex claims that require multihop reasoning during further few-shot fine-tuning.

A.3 In-Context Learning Models
These models have recently shown strong few-shot learning ability in various NLP tasks. By prompting a large language model with a few in-context examples, the model can quickly learn a task from demonstrations. To make a fair comparison to our model, we choose two in-context learning baselines as follows.
• Codex (Chen et al., 2021) is used in our model to generate reasoning programs. One straightforward baseline directly uses it for fact-checking.
To this end, we prompt Codex (code-davinci-002) as follows: "<Evidence> Based on the above information, is it true that <Claim>? True or False? The answer is:". We prefix the same 20 in-context examples for our model before the prompt as demonstrations.
• FLAN-T5 (Chung et al., 2022) is an improved version of T5, which is fine-tuned on 1.8K tasks phrased as instructions, with and without exemplars, i.e., zero-shot and few-shot. The model has shown strong performance in various in-context few-shot learning NLP tasks, such as reasoning, and question-answering. We prompt the model with the same format as we used in Section 3.4: "<Evidence> Q: <Claim> Is it true that <Claim>? True or False? The answer is:", prefixing with the same 20 in-context examples. We also use the same model size (FLAN-T5-XXL 3B) with our model for fair comparison. Figure 7 shows six examples of generated reasoning programs by PROGRAMFC that cover diverse reasoning chains. Figure 8 shows five examples of erroneous cases where the generated reasoning programs are incorrect. We provide explanations for each of the error cases below:

C Error Analysis for Reasoning Programs
Example 1 It generates a wrong logical reasoning operator for the final step. The correct logic should be "not (fact_1 and fact_2)" instead of "fact_1 and fact_2".
Example 2 It fails to perform co-reference resolution for the arguments in the third and the fourth reasoning steps. "This album" should be replaced with "The bluegrass" to make the sub-task contextindependent. "This musical" should be replaced with the variable "answer_1" from the first step.
Example 3 It fails to create a meaningful problem decomposition for the claim. It generates a trivial program that simply repeats the original claim.
Example 4 It fails to generate a fine-grained reasoning structure for the input claim. It also generates a trivial program that simply separates the claim into sentences.
Example 5 It generates a redundant reasoning step "Question("When was the musician born?")", which does not add any new information to the reasoning chain.

D Program Generation Prompts
Our manually written prompts for the HOVER and the FEVEROUS-S datasets are given in Listings 1 and 2, respectively.

E Prompts for Closed-Book Fact-Checking
Below we show the templates for the four prompting methods used for InstructGPT for the closedbook fact-checking setting in Section 4.4.

Direct Prompting
# Answer the following true / false questions : Is it true that The woman the story behind Girl Crazy is credited to is older than Ted Kotcheff ? The answer is : False Is it true that <input_claim>?
The answer is :

ZS-CoT Prompting
# Answer the following true / false question : Is it true that <input_claim>? True or False ? Let us think step -by -step . The answer is : CoT Prompting # Answer the following true / false questions : Is it true that The woman the story behind Girl Crazy is credited to is older than Ted Kotcheff ? Let 's think step by step . Girl Crazy 's story is credited to Hampton Del Ruth . Hampton Del Ruth was born on September 7 , 1879. Ted Kotcheff was born on April 7 , 1931.
Therefore , the answer is : False .

Self-Ask Prompting
# Answer the following true / false questions : Is it true that The woman the story behind Girl Crazy is credited to is older than Ted Kotcheff ? Q : The story behind Girl Crazy is credited to whom ? A : Hampton Del Ruth Q : Is Hampton Del Ruth older than Ted Kotcheff ? A : No So the final answer is : False .

Claim:
The country that Fujairah College is located in had a 2013 population of 9.2 million until it was hit by the plague in 1483 when the population was halved.

Claim:
The first female board member for the Liberal Party, she was born in Vestfold county in Norway.

Claim:
The solicitor who won the show Back to Reality ahead of Maureen Rees and Craig Phillips is English. The solicitor that was a chair of Global Witness is also English.

Claim:
The critically acclaimed film, that Buddy Baker scored in 1975, is a Walt Disney film. It was produced first before the film that featured Bruce M. Fischer as Mr. Coogar.

Claim:
Tritonia and Phyteuma are both names for a plant genus.

Claim:
Anthony Burgess addressed the novelist and essayist, the author of Grimus, in a lengthy love letter. The author is of the same nationality as Raj Koothrappali.

Example 4:
The film Deanna Oliver produced in 1999 grossed $36.8 million domestically. The musical film based on coach Herman Boone, did not.

Example 2:
The record producer that produced the bluegrass album was born on 22 June, 1944. This album inspired a Tony award winning musical. This musical had a character that was originated by Carmen Cusack.

Example 5:
The musician, who founded Morningwood with Max Green, is older than Max Green.   # The claim is that Before the first Europeans arrived or copra companies leased it , Maupihaa was home to Inca 's in ancient times . def program () : fact_1 = Verify (" Maupihaa was home to Inca 's in ancient times .") fact_2 = Verify (" Maupihaa was home to Inca 's before the first Europeans arrived or copra companies leased it .") label = Predict ( fact_1 and fact_2 ) # The claim is that Shulin , a 33.1288 km (12.7911 sq mi ) land located in New Taipei  Listing 2: The prompt used for Program Generation for FEVEROUS-S.