Faithful Chain-of-Thought Reasoning

While Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of complex reasoning tasks, the generated reasoning chain does not necessarily reflect how the model arrives at the answer (aka. faithfulness). We propose Faithful CoT, a reasoning framework involving two stages: Translation (Natural Language query $\rightarrow$ symbolic reasoning chain) and Problem Solving (reasoning chain $\rightarrow$ answer), using an LM and a deterministic solver respectively. This guarantees that the reasoning chain provides a faithful explanation of the final answer. Aside from interpretability, Faithful CoT also improves empirical performance: it outperforms standard CoT on 9 of 10 benchmarks from 4 diverse domains, with a relative accuracy gain of 6.3% on Math Word Problems (MWP), 3.4% on Planning, 5.5% on Multi-hop Question Answering (QA), and 21.4% on Relational Inference. Furthermore, with GPT-4 and Codex, it sets the new state-of-the-art few-shot performance on 7 datasets (with 95.0+ accuracy on 6 of them), showing a strong synergy between faithfulness and accuracy.


Introduction
Complex reasoning tasks, such as commonsense reasoning and math reasoning, have long been the Achilles heel of LMs (Bengio, 2019), until a recent line of work on Chain-of-Thought (CoT) reasoning (Wei et al., 2022;Wang et al., 2022, i.a.) brought striking performance gains.CoT prompts an LM to generate a reasoning chain along with the answer, given only a few in-context exemplars.
Besides performance improvement, CoT is also claimed to "provide an interpretable window into the behavior of the model" (Wei et al., 2022).However, it is unclear how accurately these reason-Q: John plans to sell all his toys and use the money to buy video games.He has 13 lego sets and he sells them for $15 each.He ends up buying 8 videogames for $20 each and has $5 left.How many lego sets does he still have?
A: He started with 13 lego sets.He sold them for $15 each.So he must have gotten 13 * $15 = $195 from selling them.He used this money to buy 8 video games for $20 each.This means he spent 8 * $20 = $160 on the video games.He had $5 left, so in total he must have had $195 + $5 = $200.The answer is 0.

Input
CoT output Figure 1: An example of unfaithful output from CoT prompting (Wei et al., 2022) on GSM8K.The answer (green) does not follow from the reasoning chain (blue).
ing chains reflect the underlying reasoning process behind the model's prediction, namely, how faithful they are as explanations (Jacovi and Goldberg, 2020, i.a.).In standard CoT, faithfulness is not guaranteed and even systematically violated (Turpin et al., 2023), as the final answer does not necessarily follow from the generated reasoning chain.In other words, CoT can "lie" about the model's true reasoning process.Figure 1 exemplifies such an unfaithful CoT generation from Wei et al. (2022) on GSM8K: the answer "0" is not even mentioned in the reasoning chain.This, along with more examples in Appendix B.1, illustrates that standard CoT does not provide interpretability of how the model predicts the answer.The lack of faithfulness in CoT can be dangerous in high-stake applications because it may mislead people into believing that the model is selfinterpretable, while there is no actual causal relationship between the reasoning chain and the answer.Even worse, when an unfaithful explanation looks plausible (i.e., convincing to humans) (Jacovi and Goldberg, 2020), this makes it easier for people (e.g., legal practitioners) to over-trust the model (e.g., a recidivism predictor) even if it has implicit biases (e.g., against racial minorities) (Pruthi et al., 2020;Slack et al., 2020).
To address this concern, we propose Faithful CoT, a reasoning framework where the answer is the result of deterministically executing the reasoning chain.Specifically, we break down a com- plex reasoning task into two stages: Translation and Problem Solving (Figure 2).During Translation, an LM translates a query into a reasoning chain, which interleaves NL and Symbolic Language (SL).The NL component decomposes the original query into multiple simpler, interdependent subproblems.Then, each subproblem is tackled in a task-dependent SL, such as Python, Datalog, or Planning Domain Definition Language (PDDL).In the Problem Solving stage, the reasoning chain is executed by a deterministic solver, e.g., a Python/Datalog interpreter, or a PDDL planner, to derive the answer.
Our reasoning chain (outcome of Translation) is guaranteed to provide a faithful explanation of how the final answer is produced (outcome of Problem Solving), therefore making our method more interpretable than standard CoT methods.2While interpretability is not the same as correctness (i.e.our method can reveal the reasoning process behind both correct and wrong answers), we find that it does empirically improve correctness: when evaluated on 10 reasoning datasets from 4 diverse domains (MWP, Planning, Multi-hop QA, and Relational Inference), Faithful CoT brings consistent performance gains over three existing baselines, across different LMs and decoding strategies.With Codex, our approach outperforms vanilla CoT on 9 of the 10 datasets, with a relative accuracy gain of 6.3% on MWP, 3.4% on Planning, 5.5% on Multihop QA, and 21.4% on Relational Inference.With GPT-4, our method sets the new SOTA few-shot performance on 7 datasets, with 95.0+ accuracy on 6 of them.This suggests that interpretability does not have to come at the cost of performance; instead, there exists a strong synergy in between.
Our key contributions are as follows: (a) We propose Faithful CoT, a framework that decomposes reasoning into Translation and Problem Solving.The reasoning chain interleaves userunderstandable natural language comments and executable symbolic language programs, thus providing faithful interpretability of how the model arrives at the answer.(b) Our approach is generalizable to multiple domains beyond arithmetic reasoning and simple symbolic reasoning, thanks to its flexible integration with any choice of SL and external solver.We set the new SOTA performance on 7 out of the 10 reasoning datasets, showing a strong synergy between faithfulness and accuracy.(c) We provide an extensive analysis of the strengths and weaknesses of our method, showing its generalizability across LMs, robustness to the choice of exemplars and prompt phrasing, the pivotal role of the solver, the plausibility of generated reasoning chains, as well as frequent error patterns where it still struggles.

Related Work
Faithfulness.
In interpretability, faithfulness (also called fidelity or reliability) means that an explanation should "accurately represent the reasoning process behind the model's prediction", which is a fundamental requirement of an explanation (Harrington et al., 1985;Ribeiro et al., 2016;Gilpin et al., 2018;Jacovi and Goldberg, 2020). 3It should be contrasted with plausibility (a.k.a.persuasiveness or understandability), which refers to "how convincing an explanation is to humans" (Herman, 2019;Jacovi and Goldberg, 2020).In the context of CoT prompting, a faithful reasoning chain needs to accurately reflect how the model arrives at the final answer, whereas a plausible reasoning chain is one that looks reasonable and coherent to humans.Standard CoT (Wei et al., 2022) generates the rea-soning chain in pure NL, which may often look plausible; nevertheless, the final answer does not need to causally follow from the reasoning chain, thus not guaranteeing faithfulness.Chain-of-Thought-style prompting.In CoTstyle prompting, given a complex question Q, an LM is prompted to generate a reasoning chain C along with the final answer A. Specifically, the prompt consists of a few examples of (Q, C, A) triples, called in-context exemplars.This allows pre-trained LMs (e.g., GPT-3 (Brown et al., 2020)) to solve unseen questions with much higher accuracy than standard prompting, where the exemplars do not contain the reasoning chain C.
We create a taxonomy of existing CoT-style prompting methods into three types: all-at-once, ensemble-based, and modularized.All-at-once prompting means that the LM produces C and A as one continuous string, without any dependencies or constraints in between.Scratchpad (Nye et al., 2021), standard CoT (Wei et al., 2022), and "Let's think step by step" (Kojima et al., 2022), are all examples of this kind.Ensemble-based prompting is designed to overcome the local optimality issue of the one-shot generation in previous methods by sampling multiple (C, A) pairs and choosing the best answer via strategies like majority voting.Examples include Self-Consistent CoT (Wang et al., 2022), Minerva (Lewkowycz et al., 2022), and DI-VERSE (Li et al., 2022), which differ mainly in the voting granularity and the underlying LM.Modularized methods break down Q into subproblems and then conquer them individually (Jung et al., 2022;Qian et al., 2022, i.a.).In particular, Leastto-Most prompting (Zhou et al., 2022) has a similar question decomposition process to ours, but there is still no faithfulness guarantee since the reasoning chain is entirely in NL.
Concurrent with our work, Chen et al. ( 2022) and Gao et al. (2022) both generate Python programs (i.e., SL-only reasoning chains) to derive the answer.We want to highlight the following qualitative differences:4 (a) In terms of motivation, our approach is interpretability-driven, whereas theirs are performance-driven.(b) Our reasoning chain involves a structured decomposition of the problem in NL, allowing users without a programming background to better understand and potentially interact with the system.(c) They only use Python as the SL and only tackle math and simple symbolic reasoning tasks, whereas we demonstrate the generalizability of our approach to multiple symbolic languages and various other domains.In particular, we innovatively recast a diverse set of realistic tasks (Planning, Multi-hop QA, and Relational Inference) into a symbolic representation, which allows us to tackle them with a single framework.(d) We perform a more comprehensive analysis compared to previous work, especially a human evaluation of the reasoning chain correctness.

Method
Our method, Faithful CoT, is a 2-stage pipeline, as seen in Figure 2. Like previous CoT-style work, our prompt consists of (Q, C, A) triples.Notable differences lie in our unique interleaving of NL (natural language) and SL (symbolic language) in C, as well as the way we derive the final answer A.
In the Translation stage, given a complex query Q in NL, we prompt an LM to translate it into a reasoning chain C, which interleaves NL comments and SL programs.The NL component decomposes the original query into multiple simpler, interdependent subproblems.Then, each subproblem is tackled in a task-dependent SL, such as Python, Datalog, or PDDL.In the Problem Solving stage, we call a deterministic external solver, e.g., a Python interpreter, a Datalog executor, or PDDL planner, depending on the task, to obtain the answer A from the reasoning chain C. As shown in Figure 3, we define C N L to be the NL component (black) and C SL to be the SL component (blue) in C. Though we separate the two components notationally, they are interleaved in the generation.Using this approach, C is guaranteed to be a faithful model explanation, since our final A is the result of deterministically executing C SL .Moreover, C N L allows the user to better understand the reasoning process. 5 We apply this method to 4 types of complex reasoning tasks: MWP, Multi-hop QA, Planning, and Relational Inference.Next, we will illustrate how our method works for each of them, with examples from Figure 3.

Math Word Problems (MWP)
Given a grade-school math question Q written in NL ("If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?", shown in green in Figure 3), we want to 5 While no constraints are enforced between CNL and CSL in our main experiments, we analyze this in Section C.4. # 1.How many cars are there in the beginning?(independent, support: ["there are 3 cars in the parking lot"]) n_cars_begin = 3 # 2. How many cars arrive?(independent, support: ["2 more cars arrive"]) n_cars_arrive = 2 # 3. Final answer: How many cars are in the parking lot? (depends on 1, 2) n_cars_total = n_cars_begin + n_cars_arrive // 1.What is the density of an apple?// The density of an apple is about 0.75 g/cm^3.// 2. What is the density of water?// Water has a density of 1 g/cm^3.// Then, we represent these answers in Datalog: // 1.The density of an apple is about 0.75 g/cm^3..declHas_density(Object:symbol, Density:float) Has_density("apple", 0.75).// 2. Water has a density of 1 g/cm^3.Has_density("water", 1).// Now, we derive the final answer: Would an apple sink in water?// The answer is Yes only if an apple is more dense than water.
.decl Answer() Answer() :-Has_density("apple", density1  obtain A as a real-valued number (5).In the Translation stage, we prompt the LM to take in Q and generate a reasoning chain C, which interleaves C N L and C SL .Specifically, the C N L component consists of three types of information: (a) Subquestions: Q is broken down into multiple smaller-scale subquestions, e.g., "1.how many cars are there in the beginning?","2.how many cars arrive?", and "3.how many cars are in the parking lot?".

Reasoning Chain Answer
(b) Dependency Graph: Each subquestion can either be answered directly via context (subquestions 1 and 2 are "independent") or rely on answers to previous subquestions (subquestion 3 "depends on 1 and 2").
(c) Rationales: Each subquestion is accompanied with rationale(s) to support the answer (the "support" field).The rationales can be either a subset of the original context ("2 more cars arrive") or any external knowledge ("there are 7 days in a week") relevant to the subquestion.
Each subquestion and its corresponding dependencies and rationales inform the subsequent generation of C SL .In our example in Figure 3, C SL consists of Python code generated to answer each subquestion in C N L .During the Problem Solving stage, we execute C SL using our solver, a Python interpreter, to derive A (5 cars in the end).

Multi-hop QA
Given a complex question Q that involves multiple steps of reasoning (e.g., "Would a pear sink in water?", shown in red in Figure 3), we want to obtain the answer A as a Boolean value or string value variable.Similar to our MWP task formulation, C interleaves C N L (NL comments), and C SL (symbolic program).Depending on the nature of the task, the format of the reasoning chain C is slightly different: for some datasets, the LM first generates all subquestions and their answers in NL, and then represents these answers as SL to derive A (see Figure 3); for others, the LM interleaves the NL subquestions and the SL program, similar to the case of MWP (see Table 14 and Table 15 for examples).In terms of SL, we use both Python and Datalog, also depending on the dataset.As Multi-hop QA problems involve multi-step reasoning to solve, C SL often utilizes Boolean algebra and string comparisons (in Python) along with relation definitions and logic programming (in Datalog).We use their corresponding interpreter as our deterministic solver to execute C SL and obtain A.
In the example from Figure 3, the LM first generates the subquestions, "1.What is the density of a pear?" and "2.What is the density of water?", which are individually answered in NL.The answers ("Water has a density of 1g/cm 3 ") are converted to Datalog statements (Has_density("water", 1)), which are then combined to formalize the truth condition of the final answer.Finally, we execute the Datalog program to determine that a pear would not sink in water.

Planning
In a user-robot interaction scenario, given a household task query Q from a user, we want to come up with a plan of actions A that the robot should take in order to accomplish the task.For example, in Figure 3, given user query "I spilled my coke on the table, could you throw it away and bring something to clean with?", a possible plan can be "find(coke), pick(coke), find(trash), put(coke) ...".In the Translation stage, an LM translates Q into C, consisting of C N L (which breaks down Q into subtasks) and C SL (which represents the subtasks as a symbolic goal in PDDL6 -a language to define and solve classical planning problems).Figure 3 shows this translation, with C SL in blue and C N L in black.Finally, we call a PDDL Planner as the deterministic solver to obtain A, a plan to accomplish the goal C SL under the predefined scenario.

Relational Inference
Given a relational inference problem Q written in NL, we want to obtain A as a string-valued variable.For example, the CLUTRR (Sinha et al., 2019) dataset involves inferring the family relationship (e.g., "grandson") between two people from a short story (e.g., "[Gabrielle] drove her daughter [Dorothy] to the hospital.[Dorothy]'s son [Vincent] showed up shortly after.How is [Vincent] related to [Gabrielle]?", shown in yellow in Figure 3).During the Translation stage, we prompt the LM to generate C, consisting of C N L and C SL .Similar to previous tasks, C N L breaks down Q into subquestions ("How is [Vincent] related to [Dorothy]" and "How is [Dorothy] related to [Gabrielle]"), as well as provide input extracts as rationales to support the answer ("[Dorothy]'s son [Vincent] showed up shortly after", etc.).Each subquestion in C N L is answered in C SL via a relational expression representing the relation between the mentioned entities, for example, relation(Vincent, Dorothy)=son denotes that Vincent is Dorothy's son.In the Problem Solving stage, our solver is a simple relational inference engine that relies on a set of transitivity rules provided by Zhang et al. (2022) among possible family relationships, e.g., son@daughter=grandson (the son of one's daughter is one's grandson).Our solver recursively applies these rules on C SL to derive A, and determine that Vincent is Gabrielle's grandson.

Datasets
Here, we summarize the evaluation datasets used for each domain.We select the same number (6 to 10, depending on the task) of exemplars as in Wei et al. (2022) to form our few-shot prompt, which can be found in our repository.Unless otherwise stated, we use the official splits: training set for exemplar selection, validation set for prompt tuning, and test set for evaluation. 7ath Word Problems (MWP).We follow Wei et al. (2022) and consider the same five MWP benchmarks: GSM8K (Cobbe et al., 2021), SVAMP (Patel et al., 2021), MultiArith (Roy and Roth, 2015), ASDiv (Miao et al., 2020), and AQuA (Ling et al., 2017).For all datasets, the input question is phrased in NL.The answer is a string-valued mathematical expression for AQuA, and one or more integer(s) for all other datasets.We use the same 8-shot prompt for all datasets except AQuA.Multi-hop QA.We consider the three datasets: StrategyQA (Geva et al., 2021), a dataset of opendomain questions that require an implicit multistep strategy to answer, e.g., "Did Aristotle use a laptop?"involves answering "1.When did Aristotle live?", "2.When was the laptop invented?",and "3.Is #2 before #1?"; Date Understanding from BIG-bench (BIG-Bench collaboration, 2021), which asks the model to infer a date from a context, by performing computation on relative periods of time; and finally, Sports Understanding from BIG-bench, which asks the model to decide whether an artificially constructed statement related to sports is plausible or implausible.Since the latter two datasets do not have a training set, we follow Wei et al. (2022) and select 10 examples from the test set to form the prompt and use the rest for evaluation.Planning.We use the SayCan dataset (Ahn et al., 2022), which assumes a scenario of a robot operat-

Model Output
Chain of Thought (CoT) Prompting (Wei et al., 2022) Standard Prompting The answer is 4 hours and 20 minutes.

Model Output
Least-to-Most Prompting (Zhou et al., 2023) To answer the question, we need to know: Q1. How many minutes does Rob take?Rob takes 2 hours.Q2.How many minutes does Royce take?Royce takes 40 minutes more than double Rob.
The answer is 84.

Model Output
Python Interpreter >>> 280 >>> Figure 4: A sample output for a math question from three baselines and Faithful CoT (our method).The ground-truth answer is 280, and only our method correctly computes the answer.
ing in a kitchen, helping the user with household tasks, e.g., "bring a coke to the table".There are a number of locations and objects that the robot can interact with.The robot can only perform a fixed set of actions, including find, pick, and put.
The task is to map a user query in NL to a plan of predefined actions.Following Wei et al. (2022), we manually write 7 exemplars, since no training set is provided.
Relational inference.We use the CLUTRR (Sinha et al., 2019) benchmark described in Section 3.4.The dataset has multiple splits based on the number of intermediate steps K required to reach the answer.We construct the prompt using 8 exemplars with K ∈ {2, 3}, and test the models on the remaining examples with K up to 10.

Evaluation Metrics
We evaluate the model performance with final answer accuracy as the main metric.Following previous work (Wei et al., 2022;Wang et al., 2022;Chen et al., 2022), for all MWP datasets (except AQuA) where the answer contains integer(s), a correct answer is defined as the exact match between the prediction and the ground truth both converted to the nearest integer; for StrategyQA and Sports Understanding where the answer is a Boolean value, it is defined as the exact match between the prediction and the ground truth both evaluated as a Boolean variable; for SayCan, the generated plan is considered correct if it is among the ground truth plans; for all other datasets, we rely on the exact match between the prediction string and the ground truth string.Additionally, we evaluate the human-rated plausibility of the reasoning chain in Appendix D.

Baselines
We compare our method to three other few-shot prompting baselines, shown in All prompting methods are compared under two decoding strategies: greedy decoding, where the LM samples the most probable next token from the vocabulary (i.e., temperature = 0.0); and selfconsistency decoding (Wang et al., 2022), where the LM generates multiple reasoning chains and chooses the final chain based on majority voting on the evaluated answer (we use a temperature of 0.4 and 40 generations for all datasets). 8We reproduce the baseline results ourselves in cases when they are not reported on certain tasks or when we clean the test set.

LMs
We use OpenAI Codex (Chen et al., 2021) (code-davinci-002) in Section 5 and experiment with four other code-generation LMs in Appendix C.3.

Results
Our results on all datasets are shown in Table 1.
On the other hand, we do not see clear empirical gains on two multi-hop QA datasets, Sports Understanding on StrategyQA.On Sports Understanding, Faithful CoT and LtM both have near perfect accuracy (99+), suggesting that the dataset is almost saturated.On StrategyQA, however, the performance of our method is still far from the baselines.To understand why, we specifically compare the examples where CoT makes a correct prediction but our method fails.As shown in Figure 11 in Appendix F, we find that the likely primary cause is the sparsity of Datalog in the pretraining data for Codex, as an overwhelming 29% of errors are syntax-related.Moreover, including Datalog in the prompt also interferes with NL generation, making it harder for Codex to produce relevant subquestions (17%), retrieve knowledge correctly (10%), and come up with valid reasoning from the knowledge to the answer (10%).Another potential cause is the nature of the task, as the difficulty for many StrategyQA questions does not lie in reasoning but rather in knowledge retrieval, which makes the advantages of our deterministic solver less obvious.Still, with further pretraining on Datalog, we believe that there is room for improvement.
To see how generalizable our method is, we also experiment with four alternative LMs and observe consistent gains brought by Faithful CoT over the baselines, as shown in Appendix C.3.In particular, with GPT-4, we set the new few-shot SOTA results on 7 datasets, achieving 95.0+ accuracy in four out of five MWP and two out of three Multi-hop QA datasets.Overall, these results suggest that faithfulness does empirically improve performance.

Analysis
In this section, we perform an extensive analysis of the strengths and weaknesses of our method, to better understand the role of different components, the robustness to design choices, the plausibility of generated reasoning chains, as well as frequent error patterns where it still struggles.Here, we only show the first two aspects; see the rest in Appendix C. Unless otherwise stated, we choose one dataset from each domain (GSM8K, Date Understanding, Say-Can, and CLUTRR) and use code-davinci-002 outputs with greedy decoding.

Ablation Study
Given the strong performance of Faithful CoT, we now address a natural question: how much does each part of the prompt contribute to the accuracy?We perform an ablation study where we re- move different parts of the prompt and see how the performance changes.In addition to the original prompt ("Full"), we test four variations, illustrated with the example from Figure 4: No rationale.We remove the rationales, i.e., everything in the brackets from the NL comments, e.g., "independent, support: ['There are 15 trees']".
No NL but nudge.We remove all NL comments except the "nudge" line: e.g., "# To answer this question, we write a Python program to answer the following subquestions".
No NL.We remove all NL comments.No solver.Instead of calling the external solver, we add "Answer: {answer}" to the end of every exemplar and let the LM predict the answer itself.
Figure 5 shows the results of all prompt variations.On GSM8K, Date Understanding, and Say-Can, NL comments contribute little to the performance, and sometimes even slightly hurt it.On CLUTRR, however, their role is crucial, since the exclusion of each component (rationale, nudge, subquestions) results in a clear accuracy drop.In particular, comparing No NL but nudge and No NL, the nudge line itself brings a striking improvement by 31.3 points.
The external solver relieves the burden of problem solving from the LM.Without it, the accuracy suffers a huge decline on GSM8K, Date Understanding, and CLUTRR (-50.8, -22.9, and -19.4 respectively), while on SayCan it improves by 2.9 nonetheless.One potential influencing factor is that SayCan might be too homogeneous, as it contains a set of only 3 predefined actions.This can make the task relatively easy, which allows all model variants to achieve around 90% accuracy and renders the solver unnecessary.Another potential reason is the level of correspondence between the final answer and the reasoning chain for different datasets: as shown in Figure 3, the answer in SayCan is a sequence of actions (e.g., find(redbull)), each directly corresponding to one step in the reasoning chain (e.g., at redbull trash).However, the answer in the other three datasets is only a single number or string, which can only be derived after executing all the steps in the reasoning chain.Therefore, the latter type of tasks further necessitates the presence of an external solver.

Robustness to Exemplars
We now answer the next question: how much does the choice of exemplars matter?To do this, we annotate 20 examples in total, randomly sample k (7-10, depending on the dataset) to construct the prompt, and repeat the process five times.Table 2 shows the performance of all six runs, including the original (from Table 1).The mean accuracy is close to the original (-1.5 to +1.2), still above the baselines by a large margin (7 to 17) on all datasets except the arguably easiest SayCan, considering the standard deviation (1.3 to 2.9).This strongly suggests that the benefits of Faithful CoT are minimally influenced by the choice of exemplars.

Human Evaluation of Plausibility
Our main experiments use final answer accuracy as the performance measure, but this does not necessarily correspond to the validity of the reasoning chain.Technically, a model can sometimes accidentally arrive at the correct answer with an invalid reasoning chain.We then ask: when the answer is correct, how often is the reasoning chain truly correct?In other words, we want to evaluate the plausibility of the reasoning chains.
We conduct a human evaluation study on Prolific: given a generated reasoning chain that results in a correct answer, a crowd-worker selects whether it is A) completely correct, or, if incorrect, specify why with B) incorrect NL and/or C) incorrect SL.Alternatively, they can select D) flawed question and E) I am confused. 9 The results of our study are shown in Figure 6 (see Table 8 in the Appendix for numerical results). 9See Appendix D for more details on the human study.For most domains, we see that annotators often find the reasoning chain fully correct -Sports, Say-Can, and MWP have a 90%+ correctness rate.To gain more insight into when the reasoning chain can be "incorrect", we perform an in-depth analysis of user annotations for the three worst-performing datasets -StrategyQA (66.7% correctness), Date (87.9% correctness), and CLUTRR (88.0%correctness).From our inspection, we find that annotators mistakenly mark a correct reasoning chain as incorrect at different rates based on the task (8.3% of the time for StrategyQA, 41.7% for Date Understanding, and 100% for CLUTRR).We find annotators are inaccurate for Date because they incorrectly believe the generated code misuses a Python library (relativedelta), or they complain that there is a better way to answer the question.For CLUTRR, annotators mark chains as incorrect due to known ambiguity in the dataset.For example, the grandmother of one's child may not necessarily be their parent, but also a parent-in-law.
As for the remaining truly incorrect reasoning chains, we find the LM can sometimes add unnecessary steps or arrive at the correct answer by chance.The latter is especially an issue in Strate-gyQA -given that all questions have a True/False answer, it is common for an incorrect reasoning chain to result in a correct answer.For example, the LM correctly answers the question "Was Karachi a part of Alexander the Great's success?" as "True."However, the reasoning chain contains the flawed subquestion "Which countries are in Pakistan?Pakistan includes Pakistan, Afghanistan, and India." Though the final answer is correct and faithful to the LM generation, the reasoning chain contains wrong knowledge.
Overall, Faithful CoT does generate valid reasoning chains for the vast majority of the time when the answer is correct.However, we still see exceptions where the model arrives at the right answer via an incorrect reasoning chain.Though this happens infrequently, it raises concerns about when people should trust LMs.To our knowledge, we are the first to conduct a systematic human study on the plausibility of CoT-style reasoning chains, and we hope to see future work further investigate and improve on the flaws that our study brings to light.

Conclusion
We propose Faithful CoT, a framework that decomposes complex reasoning into Translation and Problem Solving.It guarantees that the reasoning chain is a faithful explanation of how the model arrives at the answer.We demonstrate the efficacy of our approach on 4 types of complex reasoning problems: Math Word Problems, Multi-hop QA, Planning, and Relational Inference.Our method sets new SOTA performance on 7 of the 10 datasets, while additionally providing a faithful explanation for the final answer.These results give empirical evidence that improving model interpretability, by guaranteeing the faithfulness of an explanation, does not come at the expense of overall performance; in fact, we see a strong synergy in between.Through a comprehensive analysis of the strengths and weaknesses of our method, we show its robustness to the choice of exemplars, the pivotal role of the solver, as well as frequent error patterns where it still struggles.

Limitations
One crucial limitation of our study is that on March 23rd, 2023, OpenAI discontinued the use of codedavinci-002.This has rendered part of our results unreplicable for any teams or researchers who have not been granted continued access to the model.This discontinuation was unexpected during our study.It raises important questions about using closed-source models for academic research.
Meanwhile, one methodological limitation of our approach lies in the scope of faithfulness.Currently, we guarantee that Problem Solving stage is faithful.However, the Translation stage is still opaque, meaning it is not self-interpretable how the LM generates the reasoning chain from the question.It is still an under-explored question whether it is possible to improve the interpretability of the LM generation process in general, and a few recent studies have made promising early progress (Yin and Neubig, 2022;Sarti et al., 2023) that might be used to improve the faithfulness of the Translation stage.
Finally, it still needs further exploration of the role NL comments in the reasoning chain.From our ablation study, in terms of performance, whether to include the NL comments in the reasoning chain does not make a big difference on many of the datasets, especially those where the task is not knowledge-intensive.Nevertheless, speaking of interpretability, NL comments can make the reasoning chain more structured and understandable to the end user.Further, NL comments can be an interface that allows users without a programming background to interact with and debug the model, which we leave for future work.

Ethics Statement
With the recent success of generative large LMs, they are now being used to solve complex reasoning problems.When using the output of an LM for reasoning, there is a danger that if the reasoning appears realistic, then the final answer or conclusion will also be considered reliable.As we highlighted in Figure 1 and 7, this is often not true, since an LM may produce a reasoning chain that looks plausible, but the final answer is still wrong.This work is a step in the direction of making the use of LMs more trustworthy by using the LM for just expressing its reasoning in a symbolic program and executing the program independently.In this work, we have ensured the faithfulness of the reasoning chain w.r.t how the final answer is produced in a variety of domains, but admittedly the Translation phase is still opaque.Therefore, our pipeline is still not entirely interpretable.Furthermore, as we have stressed in Section 1, faithfulness does not guarantee correctness, so our method can still sometimes produce erroneous answers, which may pose a risk for users that rely on it for decision making.

A Implementation Details
In all our experiments, we use OpenAI GPT-3 (text-davinci-001 and text-davinci-002) and Codex (code-davinci-001 and code-davinci-002) models through the Python API available at beta.openai.com,from Sept, 2022 to Jan, 2023.The inference cost per example is $0 for all Codex models since they are in limited beta period, and $0.01 -$0.03 for GPT-3 models depending on the dataset.It takes 2-15 seconds to run inference on one example with Codex models under a rate limit of 150,000 tokens/minute, and 1-8 seconds with GPT-3 models under 250,000 tokens/minute, also depending on the dataset.For example, on the GSM8K test set of 1,319 examples, it takes 3.5h to finish the inference with Codex and 2.3h with GPT-3.
Any unspecified hyper-parameters are set to the default value on https://beta.openai.com/docs/api-reference/completions/create.

B Extended Results
In this section, we present more results that do not fit into the main text.

B.1 Examples of Unfaithful CoT Output
Figure 7 shows examples of unfaithful output from the CoT method (Wei et al., 2022) on three datasets: GSM8K (MWP), StrategyQA (Multi-hop QA), and SayCan (Planning).Note that here we only show model outputs that are apparently unfaithful.In other cases, there is no known inconsistency between the reasoning chain and the answer, but this still does not imply that the reasoning chain is a faithful explanation, since the ground truth (how the model predicts the answer) is unknown.
The GSM8K example is explained in Section 1.In the StrategyQA example, though the reasoning chain correctly identifies that a hummingbird weighs much more than a pea, the answer is still "yes"; in the SayCan example, the reasoning chain only mentions "visit the table and the counter", Q: John plans to sell all his toys and use the money to buy video games.He has 13 lego sets and he sells them for $15 each.He ends up buying 8 videogames for $20 each and has $5 left.How many lego sets does he still have?
A: He started with 13 lego sets.He sold them for $15 each.So he must have gotten 13 * $15 = $195 from selling them.He used this money to buy 8 video games for $20 each.This means he spent 8 * $20 = $160 on the video games.He had $5 left, so in total he must have had $195 + $5 = $200.The answer is 0.

Input CoT output
Visit the table and the counter.
The user has asked me to visit the table and the counter, I will do so in that order.1. find(table), 2. pick(table), 3. find(counter), 4. put(counter), 5. done()

Planning (SayCan)
CoT output Q: Could a bee hummingbird balance a scale with a single pea on it?
Multi-hop QA (StrategyQA) A: The bee hummingbird is the smallest bird in the world.The average weight of a bee hummingbird is about 2 grams.A single pea weighs about 0.1 grams.Thus, a bee hummingbird could balance a scale with a single pea on it.So the answer is yes.

Math Word Problem (GSM8K)
Figure 7: Examples of unfaithful output from CoT prompting (Wei et al., 2022) on three datasets.The answer (green) does not follow from the reasoning chain (blue).but the plan contains unnecessary "pick" and "put" operations.Wei et al. (2022) claim that CoT "provides an interpretable window into the behavior of the model, suggesting how it might have arrived at a particular answer and providing opportunities to debug where the reasoning path went wrong".As we have pointed out in Section 1, since CoT does not guarantee faithfulness, how the model arrives at the answer could differ drastically from what is shown in the reasoning chain.Furthermore, it is still hard for the user to debug the model: even if they manually correct the reasoning chain and let the model regenerate the answer, it might still be wrong, since there is no causality between the reasoning chain and the answer.

B.2 Comparison with Few-shot SOTA
We compare the results of Faithful CoT with the published few-shot SOTA in Table 3  With Codex and GPT-4, Faithful CoT sets new SOTA performance on 7 out of the 10 datasets across four domains, achieving 95.0+ accuracy on 6 of them.

B.3 Empirical Comparison with Concurrent Work
Two pieces of concurrent work, Program of Thoughts (PoT) (Chen et al., 2022) and Program-Aided Language Models (PAL) (Gao et al., 2022), were announced on arXiv within three months of our work.Essentially, they both generate Python programs, or SL-only reasoning chains, to derive the answer.Our approach differs from them mainly in the additional component of structured NL comments, which decomposes the original problem into simpler, inter-dependent subproblems.Aside from the qualitative differences highlighted in Section 2, we perform an empirical performance comparison with them on the same set of 10 datasets used in our main evaluation.Since both papers have only tackled math reasoning and symbolic reasoning tasks, we reimplement their methods by using the "noNL" prompt in our ablation study from Section 6.1.The comparison is done with code-davinci-002 as the underlying LM and greedy decoding.
As shown in Figure 8, on 6 of the 10 datasets (including most MWP datasets, SayCan, and Date Understanding), PAL/PoT and Faithful CoT have very close accuracy (<2.0 difference).On AQuA, PAL/PoT is visibly better.On the remaining three datasets (StrategyQA, Sports Understanding, and CLUTRR), Faithful CoT reasonably outperforms PAL/PoT.This may suggest that our method has an advantage when the task requires extensive external knowledge (e.g., StrategyQA and Sports Understanding) or when the SL is not frequent in the LM's pretraining data (e.g., Datalog, or our selfdefined relational expressions).
Finally, note that the key contribution of our method lies in interpretability.Though the addition of structured NL comments sometimes does not make a difference in performance, it does make the reasoning chain more understandable to the user.Furthermore, it may even enable users without a programming background to debug the model, by only interacting with the NL subproblems (e.g., adding/removing/editing a subproblem), which is worth further exploration in the future.

C Extended Analysis C.1 Ablation Study
Table 4 shows the full results of the ablation study from Section 6.1.

C.2 Robustness to Prompt Phrasing
We study the sensitivity of our method to subtle differences in the prompt design.We experiment with three prompt variations: 1. randomly permuting the order of independent subquestions/reasoning steps; 2: Changing the variable names; 3. changing the nudge line (e.g. from "# To answer this question, write a Python program to answer the following subquestions" to "# To solve this question, we answer each of the following subquestions with a Python program").
We rerun the evaluation of all three variations on 4 datasets (when applicable) used in the Section 6, under greedy decoding.Table 5 shows the results.Overall, the performance is quite stable, always above each baseline on all four datasets.

C.3 Model Sensitivity
In this section, we want to answer the question: how much does the choice of LM matter?All results in Section 5 are obtained using code-davinci-002.Here, we examine the effect of using four alternative code-generation models as the Translator: code-davinci-001, text-davinci-002, text-davinci-003, gpt-4.We compare our method with the three baselines using each of the above LM on five MWP datasets, using the greedy decoding strategy.
As shown in Table 6, regardless of the underlying LM, Faithful CoT consistently outperforms all baselines on the vast majority of the datasets, and performs very closely with the best method (<2.0 difference) on the remaining ones.On average, it has a relative accuracy gain of 16.1%, 11.0%, 9.4%, and 4.6 % over the best-performing method among the baselines, for each LM respectively.This indicates that even though the absolute performance varies depending on the LM, Faithful CoT brings a relatively consistent accuracy gain.
Notably, with GPT-4 as the underlying LM, Faithful CoT results in 95.0+ accuracy in 4 of the 5 MWP datasets, far outperforming the previous fewshot SOTA on three of them (GSM8K, SVAMP, and ASDiv).

C.4 Enforcing Constraints
Since our generated reasoning chain contains structured components (e.g., dependency graphs), another natural question to ask is: will it be helpful to enforce certain constraints on the generation?Using MWP datasets as a case study, we examine the effect of three such constraints: Graph validity.The dependency graph must be a Directed Cyclic Graph (DAG), e.g., it is not allowed for a subquestion to depend on itself.
No over-dependency.The code cannot depend on any variable that its corresponding subquestion has not mentioned, e.g. in Figure 4, since Q5 says "depend on 4", then the corresponding code should not use the variable eggs_in_dozen, since it is not the output of Q4.
No under-dependency.The code must depend on all variables that its corresponding subquestion has mentioned, e.g. in the same example, since Q5 says "depend on 4", then the corresponding code must use the variable eggs_in_dozen.
We investigate the effect of adding constraints on the generations under self-consistent decoding.Starting with our original results (without any constraint), we add a different set of constraints at each time and report the accuracy change in Table 7. Individually, the graph validity constraint results in little to no change in the performance, but the other two constraints lead to a more unstable changemostly a decrease-across datasets.Adding two or more constraints further lowers the performance in almost all cases except on MultiArith (the easiest dataset), revealing the tradeoff between accuracy and satisfying the constraints.It also indicates that a proportion of generations (1.0% to 8.9%) in our existing results do not satisfy all constraints.However, it may still be worth enforcing some of these constraints (e.g., graph validity) at the cost of performance, in order for users to better control and interact with the model.

D Human Evaluation Details
We hire crowd workers on Prolific to evaluate the correctness of model-generated reasoning chains that result in a correct answer.We sample Table 7: Accuracy change after enforcing different constraints on the generation.The "None" row shows the original performance without any constraint (from Table 1).Each row below adds a different set of constraints: G stands for "graph validity", O for "no overdependency", and U for "no under-dependency".Results are on all MWP datasets under self-consistent decoding.
100 reasoning chains for each domain generated by code-davinci-002 with the greedy decoding strategy, where each set of 100 contains an equal number of samples from all datasets within the domain.We further require annotators to have experience coding in the programming language of the dataset they annotate (Python for MWP/CLUTRR/Sports/Date, and Scala for Strat-egyQA, as Datalog was not an option in Prolific).
We have a single survey for each domain, with the exception of Multi-hop QA (in this case, we have separate surveys for StrategyQA, Date, and Sports, given the different nature of each dataset).
Additionally, there was no way to ensure annotators knew PPDL on Prolific.In order to ensure high-quality annotations for SayCan, the authors   6.Each row represents the percent (0-100) of different answer choices selected by human evaluators in each domain/dataset, as well as the inter-annotator agreement.
response repeatedly or complete the survey in under 3 minutes).After the surveys are complete, we compute annotator agreement and then take the majority label for each reasoning chain as the final label in our analysis.Our annotator population consists of 100 annotators with an average age of 25 years and an average income of 40k per year.87.9% of the annotators are males and 56% have a year college degree.Annotators are compensated at $16/hr, and average 2 minutes per question.Our full study cost $280.Sample instructions for the CLUTRR survey can be found in the Supplementary Materials.

E Dataset Details E.1 Statistics
We show the dataset details in Table 9, including the statistics, the number of few-shot exemplars used in the prompt, and example inputs and outputs.
In particular, we notice that in one of our baselines Wei et al. (2022), the reported number of exemplars used in the prompt is inconsistent between the main text (10) and the appendix (6).To ensure fair comparison, we rerun the baseline with 10 exemplars for our results in Table 1, which is what we use for our method.

E.2 URLs and Licenses
We use the same distribution of datasets following Wei et al. (2022): Math Word Problems

LICENSE.
We obtain the publicly distributed version available at https: //drive.google.com/file/d/1SEq_e1IVCDDzsBIBhoUQ5pOVH5kxRoZF/view, specifically the data_089907f8 split.
We use all the above datasets for research purposes only, consistent with their intended use.

E.3 Data Cleaning
We perform manual cleaning on ASDiv, Date Understanding, Sports Understanding, and SayCan as we discover a number of annotation issues.In our experiment, we rerun all baselines on our cleaned version of the test sets.They are provided in our repository to assist future research.
Specifically, we clean each of the datasets as follows: ASDiv: We start with the test set used by Wei et al. (2022), which removes all questions with float-valued and string-valued answers.However, in their released version, we notice an error in the answer extraction step for questions with more than one value in the answer (e.g., "what is the width and length of X?", where the answer consists of two values).In their implementation, only the first value is extracted as the ground truth answer, which is then compared against model outputs.This might artificially inflate the final accuracy.To fix this, we extract all values in the answer as a set and compare model outputs against it.
Date Understanding: We find a number of wrong answers in the test set.For example, for the question "Jane and John married on Jan 2, 1958.It is their 5-year anniversary today.What is the date today in MM/DD/YYYY?", the provided answer is "01/02/1961", whereas the correct answer should be "01/02/1963".We manually correct these answers, and the resulting test set has the same number of examples as the original one.
Sports Understanding: We notice a few ambiguities with the Sports Understanding dataset.For instance, running out of bounds is illegal in many sports.The phrase "Domantas Sabonis ran out of bounds" is labeled as implausible, however, Domantas Sabonis is a basketball player, and basketball players can indeed run out of bounds on the court.We remove 8 questions with such actionbased ambiguities.Additionally, since the release of this dataset, a few new athletes have risen to fame with identical names to those mentioned in the dataset.For example, the question "Chris Paul struck out the side" is implausible, as the referenced "Chris Paul" is a famous basketball player.However, "Chris Paul" is also the name of a new MLB baseball player, in which case this statement is plausible.We remove 5 questions with such name-based ambiguities.
SayCan: We discover a few issues in the test set: (1) the environment setup (e.g., the list of objects, the list of locations, and the initial location of each object) is not the same for all examples; (2) the annotation of the ground truth answer is often incomplete (i.e., for a given task like "visit all locations", there exist many possible plans in terms of the order of locations visited, but not all of them are included in the annotation); (3) there are ambiguous descriptions in certain queries, for example, in "Could you get me something refreshing?", it is unclear what drinks are considered "refreshing".For these questions, we complete the annotation whenever possible, and filter out the rest.The resulting test set contains 103 examples out of the original 120.

E.4 Dataset Splits
As stated in Section 4.1, we use the official splits whenever possible: training set for exemplar selection, validation set for prompt tuning, and test set for evaluation.In cases where they are available, we adopt the following strategies for each dataset: GSM8K: it only has training and test sets.We form the validation set by randomly sampling 1,000 examples from the training set.
Other MWP datasets: for AQuA, we use the official training/validation/test split.For the other datasets, only the test sets are used, since we have the same prompt for GSM8K and them.
Date Understanding and Sports Understanding: they only have test sets.We follow Wei et al. (2022)  CLUTRR: this dataset is split into multiple folds.There is a training fold with K ∈ {2, 3} (where K is the number of intermediate steps required to reach the answer), and one test fold for each K from 2 to 10.We construct the few-shot prompt using exemplars from the training fold, and test our method on the concatenation of all test folds.

F Error Analysis
To further investigate where our method still fails, we inspect 100 errors10 from model predictions on each of the four datasets and manually annotate the error categories.As shown in Figure 10, we categorize the errors on GSM8K into 6 types, inversely sorted with frequency: Wrong Subquestion (49%): The LM produces a wrong NL subquestion, which eventually leads to the incorrect answer.While this is the majority error type in our sample, it is worth noting that in a typical human-in-the-loop collaboration, these errors are easily fixable.Even if the user is unfamiliar with programming, they can inspect the NL subquestions and potentially correct the model error by simply deleting or editing a wrong subquestion.Wrong Code (24%): The NL subquestion is correct, but the code fails to answer the subquestion correctly.For example, the code uses a variable that has not been previously defined.Semantic Understanding Error (12%): The LM incorrectly interprets certain semantic subtleties in the query.This is the most complex and most interesting error category.For example, consider the # 1.How many pounds will Martin lose per week if he eats Cheerios every day for breakfast?(independent, support: ["he'll lose 1.25 pounds/week"]) pounds_lost_cheerios = 1.25 # 2. How many pounds will Martin gain per week if he eats donuts every day for breakfast?(independent, support: ["he'll gain 1.75 pounds/week"]) pounds_gained_donuts = 1.75 # 3. How many weeks are there in 5 weeks?(independent, support: ["External knowledge: there are 7 days in a week"]) weeks_in_5_weeks = 5 # 4. How many pounds will Martin lose in 5 weeks if he eats Cheerios every day for breakfast?(depends on 1 and 3, support: []) pounds_lost_cheerios_5_weeks = pounds_lost_cheerios * weeks_in_5_weeks # 5. How many pounds will Martin gain in 5 weeks if he eats donuts every day for breakfast?(depends on 2 and 3, support: []) pounds_gained_donuts_5_weeks = pounds_gained_donuts * weeks_in_5_weeks # 6.What will be the difference in his weight at the end of 5 weeks between the two breakfast options?(depends on 4 and 5, support: []) difference_5_weeks = pounds_gained_donuts_5_weeks -pounds_lost_cheerios_5_weeks # 7. Final Answer: What will be the difference in his weight at the end of 5 weeks between the two breakfast options?(depends on 6, support: []) answer = difference_5_weeks

F.1 GSM8K
Table 10: Generated code for the question in Appendix F.1, as an example of "semantic understanding error".
following problem: If Martin eats Cheerios every day for breakfast, he'll lose 1.25 pounds/week.If he eats donuts every day for breakfast, he'll gain 1.75 pounds/week.What will be the difference in his weight at the end of 5 weeks between the two breakfast options?
The generated code, as shown in Table 10, does not assign opposite polarities (signs) for "pounds lost" vs. "pounds gained".For other examples in this category, we notice errors like missing that a pair of something has 2 items in it, missing to subtract 2 for "two years ago" when it occurs as a subjunctive, and so on.Fixing these errors, in general, will require more than providing additional examples in the prompt.Generation Cutoff (7%): The generation stops midway, mainly due to the LM producing the same steps over and over again.These errors could be easily detected in postprocessing and possibly fixed by re-prompting the LM.Wrong Gold Label (5%): We find 5 (out of our 100) examples that are genuine annotation errors in the gold labels.Missing Subquestion (3%): The LM misses a relevant subquestion needed for the rest of the reasoning chain to work.These errors are also potentially fixable via human-in-the-loop interaction, where the user can insert a subquestion into the reasoning chain.

F.2 StrategyQA
As mentioned in Section 5, Figure 11 shows the error type distribution on a sample of 70 instances from StrategyQA, where we specifically compare the cases where the prediction of CoT is correct whereas ours is wrong.There are 8 error types, from most to least frequent: Syntax (29%): these are syntax-related errors that result in an invalid Datalog program.For example, the LM redefines a relation that is previously declared, uses an invalid string comparison operation, generates an ungrounded variable in the answer definition, and so on.With further pretraining on Datalog, we believe Codex is able to overcome these errors to a reasonable extent.
Infinite Loop (24%): the LM generates an in-finite loop of subquestions, which are oftentimes rather philosophical.For example, for the question "Are potatoes native to the European continent?", the LM generates a list of 40 subquestions (potentially more, if not truncated by the token limit): 1 We notice that this type of errors also happen if the prompt does not include Datalog code but just the NL subquestions, but much less often.
Wrong Subquestions (17%): The LM generates an irrelevant subquestion to the original question.For instance, given the question, "Would toast for a vegan have margarine instead of butter?", one subquestion that is generated is "What is the difference between a vegan and a vegetarian?", which does not help with reaching the final answer.
Wrong Reasoning (10%): The subquestions are correct, but the reasoning from the retrieved facts to the final answer is problematic.For example, to answer the question, "Do solo pianists require a conductor?", the model correctly generates the relevant subquestions and their respective answers: 1. What is the definition of a solo pianist?A solo pianist is a pianist who performs alone.2. What is the definition of a conductor?A conductor is a person who leads an orchestra.
3. What is the definition of an orchestra?An orchestra is a large instrumental ensemble that contains string, brass, woodwind, and percussion instruments.
But in deriving the final answer, the reasoning is wrong: "Now, we derive the final answer: Do solo pianists require a conductor?The answer is Yes only if a solo pianist is not a conductor." Wrong Knowledge (10%): the LM fails to retrieve the correct knowledge to answer the subquestions.For example, given the original question "Is the largest city in New Mexico also known as Yootó?", the model correctly generates the subquestions "What is the largest city in New Mexico?" (answer: Albuquerque) and "Is Albuquerque also known as Yootó?".However, when answering the second subquestion, it retrieves a wrong piece of knowledge ("Albuquerque is also known as Yootó.",whereas in reality, it should be "Santa Fe" that is known as Yootó).
Answer Definition (6%): In our prompt, we always derive the answer in the format of "The answer is Yes only if ...", which is followed by a Datalog rule containing conditions that should be satisfied for the answer to be true.However, the LM sometimes generates this as "The answer is No only if ...", which outputs the reversed answer.
Knowledge Representation (3%): The retrieved knowledge in NL is correct, but the representation of it in Datalog is wrong.For example, for the piece of knowledge "The Lucy Show is not the same TV series as JAG (TV series)", the model represents it as follows: .decl Same_TV_series(TV_series1:symbol, TV_series2:symbol) Same_TV_series("The Lucy Show", "JAG (TV series)")." which actually means the reverse (they are the same).
Unknown (1%): There is a very small proportion of errors (1 out of 70) where we are unsure of the cause.Specifically, we expect the solver to output True, but it outputs False instead.Unlike GSM8K, we only have 69 errors out of the 359 test examples, so we annotate them all, as shown in Figure 12.The error categories for date understanding are similar to GSM8K, except that we do not see any generation errors in the samples, but we see questions with ambiguous phrasing allowing both the gold and predicted answers to be correct based on interpretation.(not (at jalapeno-chips counter))

F.4 SayCan
(not (at jalapeno-chips table)) (not (at jalapeno-chips trash)) (not (at jalapeno-chips bowl)) (not (at jalapeno-chips user)) ) ) Wrong Object (36%): Here the model generates the wrong object/object types in the goal.For example, a request such as "I opened a pepsi earlier.How would you bring me an open can?" fails because the model generates actions with water instead of Pepsi.For CLUTRR, we group all error cases by K, the number of steps in their gold reasoning chain, as a proxy for problem complexity, and perform importance sampling on these groups to select 100 examples.Our annotation of these examples reveals 5 error categories, as shown in Figure 14: Inversed Relation (41%): This stands out as the majority of the errors.These errors are caused by the reversal of directional relationships for the actors in the problem, i.e., predicting "mother" or "nephew" when the answer is "daughter" or "uncle" respectively.Wrong Relation (30%): Here the model extracts the relation incorrectly (not even the inverse).For example, for the subquestion "How is [Donald] related to [Jason]?" with the correctly identified support "[Jason] is father of their father", the model produces relation(Donald, Jason) = son when the correct relation should be "grandson".Nonexistent Relation (4%): The model hallucinates a non-existent relation (e.g."adopted" for daughter).Wrong Path (12%): Here, the model does not generate a correct reasoning path from target entity A to target entity B in the question.Wrong Gold Label (13%): These are annotation errors in the CLUTRR dataset.In one example, for the sentence, "[Gloria] asked her mother [Laura] if she could go outside and play with her friends.",the annotation says Laura is Gloria's grandmother.

G Prompts
Due to the space limit, we show one exemplar in the prompt for each dataset here.The full prompts can be found in our repository.
Among all the MWP datasets, our prompt for AQuA is different from the rest, because the answers are in a multiple-choice format instead of integers.To produce a multiple-choice answer, we take a two-step approach by first producing a numerical answer in the same way as for the other math datasets.Then, we perform an additional step of converting the numerical answer into an answer choice by again prompting the language model to generate which answer choice is closest to the previously produced numerical answer.An exemplar of this 2-step prompt is shown in Table 11.EXEMPLAR FOR GSM8K, SVAMP, MULTIARITH, AND ASDIV # Q: There are 15 trees in the grove.Grove workers will plant trees in the grove today.After they are done, there will be 21 trees.How many trees did the grove workers plant today?# To answer this question, write a Python program to answer the following subquestions: # 1.How many trees are there in the beginning?(independent, support: ["There are 15 trees"]) trees_begin = 15 # 2. How many trees are there in the end?(independent, support: ["there will be 21 trees"]) trees_end = 21 # 3. How many trees did the grove workers plant today?(depends on 1 and 2, support: []) trees_today = trees_end -trees_begin # 4. Final Answer: How many trees did the grove workers plant today?(depends on 3, support: []) answer = trees_today Table 12: An exemplar from our prompt for GSM8K, SVAMP, MultiArith, and ASDiv.
EXEMPLAR FOR STRATEGYQA // Q: Would a pear sink in water?
EXEMPLAR FOR SPORTS UNDERSTANDING # Q: Is the following statement plausible?Sam Darnold passed the puck # To answer this question, write a Python program to answer the following subquestions: # 1. Sam Darnold is a player in which sport? (independent, support: ["Sam Darnold is an NFL Quarterback", "NFL is the National Football League"]) player_sport = "football" # 2. The phrase "passed the puck" implies playing which sport? (independent, support: ["Players pass the puck in hockey"]) playing_sport = "hockey" # 3. Is the following statement plausible?Sam Darnold passed the puck (depends on 1 and 2, support: ["Sam Darnold is an NFL Quarterback", "NFL is the National Football League", "Players pass the puck in hockey"]) plausibility = (player_sport == playing_sport) # 4. Is the following statement plausible?Sam Darnold passed the puck (depends on 3, support: []) answer = int(plausibility) ; it has to satisfy the following conditions (and ; the snack must not be sweet (not (is-sweet ?s)) ; bring it to the user (at ?s user) ) ) ) Table 16: An exemplar from our prompt for SayCan.

Figure 2 :
Figure 2: An overview of our Faithful CoT framework, consisting of Translation, where an LM translates a query (in NL/Natural Language) into a reasoning chain (which interleaves NL and SL/Symbolic Language), and Problem Solving, where an external solver executes the reasoning chain to derive the answer.

Figure 3 :
Figure 3: Examples from each task (Math Word Problems, Multi-hop QA, Relational Inference, and Planning) showing our 2-stage Translation and Problem Solving pipeline.
Royce takes 40 minutes more than double Rob to shingle a house.If Rob takes 2 hours, how many minutes does Royce take?Query Faithful CoT Reasoning (ours) # To answer this question, write a Python program to answer the following subquestions: # 1.How many minutes does Rob take to shingle a house?(independent, support: ["If Rob takes 2 hours"]) minutes_rob = 2 * 60 # 2. How many minutes does Royce take to shingle a house?(depends on 1, support: ["Royce takes 40 minutes more than double Rob"]) minutes_royce = minutes_rob * 2 + 40 # 3. Final Answer: How many minutes does Royce take?(depends on 2, support: []) answer = minutes_royce Model Output Royce takes 40 minutes more than double Rob.That means Royce takes 2 * 2 + 40 = 80 + 40 = 120 minutes.The answer is 120.
Figure 4: standard prompting, popularized by Brown et al. (2020), with demonstrations of only the question and the answer; CoT (Wei et al., 2022), which additionally includes an NL reasoning chain; and Leastto-Most (LtM) (Zhou et al., 2022), which decomposes the question in NL but does not involve SL.

Figure 5 :
Figure 5: Ablation study results: accuracy when we remove different parts of the prompt.See Section 6.1 for details.
Figure 6: Human evaluation results of reasoning chain plausibility.Each column represents the percent of different answer choices selected by human evaluators in each domain/dataset.

Figure 9 :
Figure 9: Example of our annotation interface for the CLUTRR survey to select the same number of examples from the test set to form the few-shot prompt and use the remaining examples as a new test set.SayCan: Following Wei et al. (2022), we manually write 7 few-shot exemplars, since no training set is provided.We evaluate the models on our cleaned version of the test set, described in the previous subsection.

Figure 10 :
Figure 10: Error analysis for GSM8K.For a detailed description of the error categories, see Appendix F.1.

Figure 11 :
Figure 11: Error analysis for StrategyQA.For a detailed description of the error categories, see Section F.2.

Figure 12 :
Figure 12: Error analysis for Date Understanding.For a detailed description of the error categories, see Appendix F.3.

Figure 13 :
Figure 13: Error analysis for SayCan.For a detailed description of the error categories, see Section F.4.Since SayCan only has 120 test examples and Faithful CoT produces 7 errors, we annotate all 7 of them, as shown in Figure 13.These 7 examples can be categorized into two types: Additional Subgoals (64%): Cases where the model generated unnecessary subgoals in the decomposition of the original task, leading the planner astray.This is illustrated by the request "Clear the jalapeno chips off the counter": (:goal

Figure 14 :
Figure 14: Error analysis for CLUTRR.For a detailed description of the error categories, see Section F.5.
In a flight of 600 km, an aircraft was slowed down due to bad weather.Its average speed for the trip was reduced by 200 km/hr and the time of flight increased by 30 minutes.The duration of the flight is: # Answer option: ['A)1 hour', 'B)2 hours', 'C)3 hours', 'D)4 hours', 'E)5 hours'] # Write Python code to solve the following questions.Store your result as a variable named 'answer'.# 1.What was the duration of the flight?(independent, support: ["The duration of the flight is"]) duration = Symbol('duration', positive=True) # 2. What is the delay of the flight?(independent, support: ["the time of flight increased by 30 minutes"]) delay = 30 / 60 # 3. What was the total flight distance?(independent, support: ["In a flight of 600 km"]) total_distance = 600 # 4. What was the original speed?(depends on 1 and 3, support: ["External knowledge: speed is distance over time"]) original_speed = total_distance / duration # 5. What was the reduced speed?(depends on 1, 2, and 3, support: []) reduced_speed = total_distance / (duration + delay) # 6.What was the duration of the flight if the original speed was 200 km/hr faster than the reduced speed?(depends on 4, 5, and 1, support: []) solution = solve_it(original_speed -reduced_speed -200, duration) answer = solution[duration] Step 2: Multiple Choice Conversion # Question: In a flight of 600 km, an aircraft was slowed down due to bad weather.Its average speed for the trip was reduced by 200 km/hr and the time of flight increased by 30 minutes.The duration of the flight is: # Answer option: ['A)1 hour', 'B)2 hours', 'C)3 hours', 'D)4 hours', 'E)5 hours'] # Prediction: 1.00000000000000 # Closest Option: A EXEMPLAR FOR CLUTRR # Context:[Jason]  always had some great adventure planned for his granddaughter[Guillermina]  when she came to visit.So, naturally, when [Myrna] told her daughter [Guillermina] that they would be going to visit[Jason]  she could hardly contain herself.#Question: How is [Jason] related to[Myrna]?# To answer this question, we write a program to answer the following subquestions: # 1.How is [Jason] related to [Guillermina]? (independent, support: "[Jason] always had some great adventure planned for his granddaughter[Guillermina]  when she came to visit.")relation(Jason, Guillermina) = grandfather # 2. How is [Guillermina] related to [Myrna]? (independent, support: "So, naturally, when [Myrna] told her daughter [Guillermina] that they would be going to visit[Jason]  she could hardly contain herself.")relation(Guillermina, Myrna) = daughter # 3. Final answer: How is [Jason] related to [Myrna]? (depends on 1, 2) relation(Jason, Myrna) = relation(Jason, Guillermina) @ relation(Guillermina, Myrna)

Table 1 :
(Wei et al., 2022)ent prompting methods on 10 reasoning datasets from 4 domains.We compare our method, FaithfulCoT, with standard (Brown et al., 2020), CoT(Wei et al., 2022), and Least-to-Most prompting(Zhou et al., 2022), with code-davinci-002 as the LM.The best results within each decoding strategy are bolded.majority of datasets and domains under both decoding strategies.With greedy decoding, Faithful CoT outperforms all baselines on 8 of the 10 datasets, by a relative improvement of up to 14.2% on MWP, 3.4% on Planning, 6.5% on Date Understanding from Multi-hop QA, and a surprising 21.4% on Relational Inference.Generally, we see larger gains on harder datasets.Take MWP as an example: on simpler datasets where CoT already performs decently (e.g., MultiArith and AsDiv, where most questions require only 1-2 steps to solve), the gains are smaller (0.3% to 2.4%); however, we see the largest gain (14%) on the most difficult GSM8K, which requires up to 8 steps to solve.With selfconsistency decoding, Faithful CoT still performs the best on 7 out of the 10 datasets.Compared to greedy decoding, the relative gain increases on 4 datasets (AQUA: 12.1% → 18.1%, SayCan: 3.4% → 5.5%, Date Understanding: 6.5% → 10.8%, and CLUTRR: 21.4% → 41.3%

Table 2 :
Robustness to the choice of exemplars.

Table 3 :
Comparison between the existing few-shot SOTA results and the optimal Faithful CoT results (with the best-performing LM (code-davinci-002 for SayCan and CLUTRR, and gpt-4 for the rest of the datasets).See Appendix B.2 for sources of SOTA results.

Table 4 :
Ablation study results that accompany Figure5.We report accuracy when we remove different parts of the prompt.

Table 6 :
Accuracy of different prompting methods with each underlying LM on 5 MWP reasoning datasets.

Table 8 :
Numerical results for human evaluation of reasoning chain correctness, accompanying Figure Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May.How many clips did Natalia sell altogether in April and May? A: 72 SVAMP 8 1,000 Q: Each pack of dvds costs 76 dollars.If there is a discount of 25 dollars on each pack.How much do you have to pay to buy each pack?

Table 9 :
Datasets used for evaluation."# Shot" stands for the number of few-shot examples in the prompt (following Wei et al. (2022)) and "# Test" stands for the number of test examples.

Table 11 :
An exemplar from our prompt for AQuA.

Table 15 :
An exemplar from our prompt for Sports Understanding.EXEMPLAR FOR SAYCANUser query: Bring me something not sweet to eat.

Table 17 :
An exemplar from our prompt for CLUTRR.