Tab-CoT: Zero-shot Tabular Chain of Thought

The chain-of-though (CoT) prompting methods were successful in various natural language processing (NLP) tasks thanks to their ability to unveil the underlying complex reasoning processes. Such reasoning processes typically exhibit implicitly structured steps. Recent efforts also started investigating methods to encourage more explicitly structured reasoning procedures to be captured. In this work, we propose Tab-CoT, a novel tabular-format CoT prompting method, which allows the complex reasoning process to be explicitly modelled in a highly structured manner. Despite its simplicity, we show that our approach is capable of performing reasoning across multiple dimensions (i.e., both rows and columns). We demonstrate our approach's strong zero-shot and few-shot capabilities through extensive experiments on a range of reasoning tasks.


Introduction
The chain-of-thought (CoT) prompting method (Wei et al., 2022) encourages the large language models (LLMs) to engage in a thought process before providing the answer to the given question.Such an approach shows impressive performance improvements in reasoning tasks.Notably, in the zero-shot setting, it was shown that a simple prompt such as "let's think step by step" could facilitate the step-by-step thinking process before an-on Monday morning.This is our starting point.
[...(another 62 tokens)] Putting all of this together, we can say that the bakers had 200 loaves to start, sold 132 of them, and had 6 returned, for a total of 62 loaves remaining.involves code in the prompt design, allowing structured information in the form of formal language to participate in the reasoning process.While effective, such methods require specific prompt engineering for different domains or defining multiple variables, which can be difficult to maintain or keep track of.
Inspired by the fact that state-of-the-art large language models, such as GPT-3 (Brown et al., 2020) and CodeX (Chen et al., 2021), have the capability of reasoning over tabular structured data (He et al., 2023) 2 , we propose a novel framework called Tabular Chain of Thought (Tab-CoT) that models the structured reasoning process using a table-filling procedure.
We show that the model can perform step-by-step reasoning by creating a table without further fine-tuning by using a table header with column names in the form of "|step|question|response|" as a prompt.While conventional natural language texts are generated in a 1-dimensional sequential order, the table has a 2-dimensional structure, allowing inference along both columns and rows to be performed simultaneously.Unlike previous works which focused on extracting information from existing tabular structured data (Gong et al., 2020, He et al., 2023), our approach generates the table while performing the reasoning process (and extracts the answer from the generated table at the end).
Figure 1 shows the results with standard prompting, conventional zero-shot CoT, and our zero-shot Tab-CoT.Our method generates a table as the output, which is more organized and concise than the output from the conventional CoT method.In this example, while zero-shot CoT generates 140 words, our method only generates 28.Besides, we found our method can reason horizontally and vertically at the same time. 3This demonstrates that our Tab-CoT method benefits from the 2-dimensional structure of the table, where the information can flow in two dimensions.
We summarize our main contributions in this work as follows: • We propose a new approach called Tabular Chain-of-Thought (Tab-CoT) that utilizes a tabular structured reasoning scheme in combination with state-of-the-art large language models to generate answers.To the best of our knowledge, this is the first method that uses tables in a "chain of thought" process.
• The 2-dimensional tabular structure of Tab-CoT allows for improved unlocking of the step-by-step reasoning capabilities of LLMs, transforming the linear "chain of thought" process into a more structured one.
• Extensive experiments have revealed that our Tab-CoT outperforms traditional CoT techniques in zero and few-shot settings.This indicates that Tab-CoT has strong potential as a superior alternative to current chain-ofthought prompting methods.

Related Work
Chain-of-thought prompting (Wei et al., 2022), a variation of few-shot prompting that adds step-bystep reasoning in those few-shot examples instead of just providing answers, has achieved significant improvements across multiple datasets.The LLMs can generate solutions following the solution format of prompts.Compared to traditional prompting, chain-of-thought prompting decomposes the task into smaller steps, which makes difficult tasks easier to solve.
The chain-of-thought prompting method is not necessarily purely natural language based.Program Aided Language Models (PAL) (Gao et al., 2022) provides few-shot samples that contain executable Python code.Such an approach enables the LLMs to interact with the Python shell, allowing the model to focus on learning how to do mathematical reasoning rather than numerical calculations.
These chain-of-thought methods provide the solution structure and pattern via few-shot samples, but can these be provided without these fewshot samples in the zero-shot setting?Zero-shot CoT (Kojima et al., 2022) is a zero-shot chain-ofthought prompting method.The prompt phrase "Let's think step by step" added after the question triggers the explicit reasoning process.However, compared to few-shot CoT (Wei et al., 2022), zero-shot CoT allows more flexibility in the structure of the reasoning process.
Recently, Zhou et al. (2022) proposed Leastto-Most prompting, which is a prompting strategy that reduces a complex problem into a list of sub-questions, and sequentially solves the subquestions.Each sub-question is solved with the answer to previously solved sub-questions.Compared to zero-shot CoT, this method has more restrictions on the structure of reasoning by decomposing and sequentially answering.Moreover, importing external tools (like calculator and python shell) can further aid the math computation within the arithmetic domain (Gao et al., 2022).
These works reveal the importance of promoting structures in the chain-of-thought process.However, the nature of the zero-shot prompting makes the injection of structures into the generation process challenging.This motivates us to devise a better mechanism to prompt the language models under the zero-shot setting -a new prompting scheme that allows highly structured outputs in the form of tables to be generated.
Jackson is planting tulips.He can fit 6 red tulips in a row and 8 blue tulips in a row.If Jackson buys 36 red tulips and 24 blue tulips, how many rows of flowers will he plant?

Answer Extraction Prompt
Therefore, the answer is 9.
3 Tab-CoT This experiment essentially unveils how the tables are represented in such LLMs.The results also illustrate how the table can potentially be used for generating a reasoning process.Next, to validate this, we designed several simple experiments to understand how reasoning over such tabularstructured data is performed on such LLMs, as shown in Figure 3.Our first experiment (A) shows that such LLMs are able to perform potential vertical reasoning.However, if we replace '|' with ',' (B), the LLM fails to capture the patterns in the data.This tells us that the correct formatting is crucial when reasoning with tables in such LLMs.
Next, we intentionally insert a mistake into the partial table and ask the model to continue the generation process (circled in C).Surprisingly, the LLM is able to generate the correct entries even though the mistake occurred in the same row.This further confirms the LLM's strong potential in per-forming vertical reasoning with tabular-structured data.
Moreover, to prove both vertical and horizontal reasoning exists, we increase the difficulty by directly appending the first two elements from step 9 after step 6 (D).If only vertical reasoning existed, the value under "v4" would have been "11".Instead, the value generated is "13," confirming that the LLMs have the potential to perform a combination of horizontal and vertical reasoning simultaneously.

Table Generation Prompt
To make use of the 2-dimensional structure of the table, we replace the natural language prompt with a table-generation prompt (e.g., "|step|question|response|"), which serves as the header of the table.This regulates the context of this table, forcing the LLMs to conduct step-by-step reasoning by completing the table.Meanwhile, the choice of columns can be very specific.If each row of the table is regarded as a step, the row-by-row table generation process will become a step-by-step reasoning process.Within each step (row), we have multiple columns, each of which contributes certain detail towards the current reasoning step.
For any text question x, we have a table generation prompt (all column names) c.Concretely, we add the table generation prompt in the next row of the text question: (1) where t 1,1 • • • t m,n are the entries within the generated table, which contains m rows and n columns.
Answer Extraction Prompt After the table content, denoted as T , is generated from the previous step, we perform answer extraction.The answer extraction step helps us to extract the answer from the table, as the final results may not always be in the last cell of the generated table.Following zero-shot CoT (Kojima et al., 2022), we add another answer extraction prompt a: "the answer is" after the generated table, to extract the final answer from the table: Structure-Promoting in different tables generated (with different content).We propose a "structure-promoting scheme", which maximally unlocks the reasoning abilities of LLMs.
We define each row as a reasoning step.A table containing multiple rows will depict the step-bystep reasoning procedure leading to the final answer.Thus, our first column is "step", containing a number that indicates which reasoning step the current row represents.
Least-to-most prompting (Zhou et al., 2022) contains two stages: problem reduction and sequential solving.In problem reduction, they decompose a question into multiple subquestions.Similarly, we add "subquestion" as our second column.At the beginning of each step, the LLMs will generate a subquestion under this column, which demonstrates the objective of the current reasoning step.
The conventional zero-shot CoT (Kojima et al., 2022) shows that allowing the model to generate some reasoning process before answering can achieve a better result.Inspired by this observation, we add a third column, "process", into our table.Given a subquestion in the previous column, we expect to generate the reasoning process in the current column before answering.
The last column is named "answer".As the previous reasoning process under the "process" column may not necessarily provide an answer, we hope to use the "answer" column to explicitly request an (intermediate) answer at the end of each reasoning step.
With the above considerations, our primary scheme for the table header is designed as follows, which serves as our main 3 family (Brown et al., 2020) in our experiments, namely "code-davinci-002" and "textdavinci-002", whose APIs are made available by OpenAI6 .For brevity, we use "code" to refer to the model "code-davinci-002" and "text" to refer to "text-davinci-002" in our experiments.
We also conducted additional experiments on datasets involving other types of reasoning tasks.Specifically, we evaluate our method on two symbolic reasoning tasks: Last letter and Coin Flip7 : the former is the task that asks for the concatenation of the last letters of 4 words, and the latter asks for the state of the coin after being flipped a few times.We investigate how the specificity of column names affects the performance and report in our ablation study.We also evaluate our method on two commonsense reasoning tasks: CommonsenseQA (Talmor et al., 2019) and StrategyQA (Geva et al., 2021).
Following zero-shot CoT (Kojima et al., 2022), we set the first generated number as the numeral answer, the first capitalized letter as the answer for multiple-choice questions, and the first "yes" or "no" as the answer for "Yes or No" questions.

Main Results
Our main experiments are conducted on arithmetic reasoning tasks under the zero-shot setting.We tested the performance of both text-based and code-based LLMs on all methods.The results are shown in Table 2.Under the scheme "|step|subquestion|process|result|", our zero-shot Tab-CoT approach significantly outperformed the standard prompting in all tasks.Furthermore, our best-performing Tab-CoT model (using code-based LLM) outperforms the best conventional CoT model in 5 out of 6 tasks (with an average improvement of 2.2%).
When the standard prompting method is considered, using the text-based LLM leads to significantly better results than the code-based counterpart (15.7% on average).Similarly, when zero-shot CoT is considered, using the former also outperforms the latter by 10.9% on average.However, for our Tab-CoT approach, "code" outperforms "text" by 4.0%, leading to the best overall performance among all configurations.
From such results, we can see that the conventional CoT method responds differently from our Tab-CoT method with different types of underlying LLMs involved.The conventional CoT method (and the standard prompting method) strongly favors a text-based LLM under the zero-shot setting.In contrast, our approach works well with both types of LLMs, but the code-based version can give it an additional boost in performance.Compared with "text", the "code" model is further fine-tuned on code (Chen et al., 2021).We conjecture that table generation resembles the code generation process -both involve structured procedures that are highly organized and follow a step-by-step process.Comparing our Tab-CoT approach with conventional CoT, we can conclude that our proposed table-generation prompt is able to significantly better unlock the strong reasoning abilities within the code-based LLM.
Based on the above main experiments, we choose to use "code" as the default LLM for all subsequent experiments unless otherwise specified.

Importance of Scheme Design
To understand the significance of our proposed table scheme design, we evaluate the performance of "|step|subquestion|process|result|", along with four variations, each of which is obtained by removing one of the four columns as ablation.
The results in Table 4 show that each column of "|step|subquestion|process|result|" is crucial.From the result, we notice that removing the column "step" from our scheme results in the most significant performance drop.This implies although the step only contains a number indicating "which step this is", it organized the table in sequential order over rows.The column "subquestion" is also important.Removing "subquestion" from the scheme also shows an average performance drop of 5.4%.The "subquestion" column forms step-by-step instructions vertically, indicating the subquestion under consideration for each step.The "step" and "subquestion" columns play important roles in maintaining the structure of the table, building vertical connections across rows.

Effectiveness of Self-Consistency
The self-consistency (Wang et al., 2022) decoding strategy was shown to obtain better results by generating and exploring multiple, diverse reasoning paths.We also adopt a similar approach here.In the original self-consistency paper, up to 40 reasoning paths were considered.We show the feasibility of using only 3 paths in our work.8This is conveniently achieved by using 3 different promptswe select another two table schemes besides the standard scheme.One is a highly similar prompt, which we expect to perform similarly well, and the other is less similar, which we expect to yield a worse performance (based on Sec 5.2).They are shown in Table 3.We then perform majority voting based on the outputs from these 3 prompts.Interestingly, although a prompt with worse performance is used in the voting process, the overall performance improves.This shows the benefits of integrating different table schemes for such tasks, which helps improve the overall robustness of the approach.

Few-shot Tab-CoT
Tab-CoT shows impressive reasoning ability under the zero-shot setting.It can generate a structured output in the form of a table that enables the chainof-thought reasoning process without few-shot samples.Tables are capable chain-of-thought carriers, but can they also serve as good chain-of-thought teachers?To answer this question, we evaluated Tab-CoT under the few-shot setting.9For a fair comparison, we use the same few-shot sample questions described in Wei et al. (2022) (listed in Appendix D).We use "|step|subquestion|process|result|" as the table scheme when representing few-shot samples.The results are reported in Table 5, our method outperformed few-shot CoT by 1% on average.While the performance difference between Tab-CoT and CoT on other datasets is below 2%, the performance difference on SVAMP is 6.5%.The large improvement on SVAMP is likely related to the selection of few-shot samples because Wei et al. ( 2022) select 8 sample questions from SVAMP for all arithmetic reasoning tasks except AQUA10 .
Question 1: Gretchen has some coins.There are 30 more gold coins than silver coins.If she had 70 gold coins, how many coins did Gretchen have in total?(GT: 110) CoT: Let's think step by step.If Gretchen had 70 gold coins, then she would have 30 silver coins (70 -30 = 40).So Gretchen would have a total of 100 coins (70 + 30 = 100).

Model
Step Table 6: Case studies (on MultiArith) of the tables generated from "code-davinci-002"/"text-davinci-002".The results returned after applying the answer extraction prompts are in bold.Additional case studies are in Appendix C.

Case Studies
The main experimental results show that "code" under-performs "text" with conventional CoT but yields better results in our Tab-CoT.To understand this better, we conduct case studies to compare their generated tables in Table 6.
While "code" only generated short text snippets or formulas under "process", the words generated by "text" under the same column tend to form complete sentences whenever possible.As we mentioned earlier, "code" is an LLM that is further fine-tuned on code (Chen et al., 2021).This explains why it appears more amenable to the tabularstructured format of the output.In question 1, the model with "text" overwrites the generated "subquestion" by asking another question.Thus, the "result" fails to answer the "subquestion" in the same row.In question 2, "text" generated 5 steps while "code" only took 3.The "subquestion" generated by "text" is also ambiguous (e.g., "what is the known information?").In question 3, "text" presents a wrong reasoning order.Overall, "code" shows better reasoning ability multiple choice questions.We use the same few-shot samples following Wei et al. (2022).
by demonstrating a more concise and straightforward reasoning process.

Additional Experiments
We further evaluate our methods on symbolic reasoning and commonsense reasoning tasks.We also conducted some new experiments based on the GPT-3.5 model to understand our approach's effectiveness on such newer models11 .With such additional experiments, we hope to draw further insights into our approach.

Symbolic Reasoning
We evaluate Tab-CoT on two symbolic reasoning datasets: Coin Flip (CF) 12and Last Letter (LL) 13 .Unlike the arithmetic reasoning tasks, these tasks focus on some specific problems.This also opens up the opportunity for us to examine whether the specificity of the table scheme may have an impact on the reasoning process in such tasks.
To this end, we split table schemes into three categories: (1) general: the table scheme that can be generally applied to most text questions.
(2) domain-specific: the table scheme that can be adapted to a specific domain.(3) task-specific: the scheme that can only be adopted by a single task.
Our experiments in Table 7 illustrate that the specificity of the table schemes highly affects the performance of symbolic reasoning tasks.One may expect the performance to increase as the table scheme becomes more task-specific.Our taskspecific scheme outperformed the zero-shot CoT in both tasks.However, the increased specificity does not always lead to higher accuracy.In the Coin Flip task, we noticed that another task-specific scheme "|step|initial coin state|flip or not|next coin state|" only achieves an accuracy of 68.0%.To understand this, we investigate their reasoning flows in Figure 4.Although the left scheme is more task-specific, it largely disabled the vertical reasoning in the table.While the right scheme is general, it effectively enables reasoning along both vertical and horizontal directions, leading to significantly better results. 1414 We further evaluate the general scheme under the one-shot setting, and the results are in Appendix A Table 10: A comparison between the different sizes of "code", "Average" is the average score across six datasets.
Commonsense Reasoning As another set of additional experiments, we further evaluate our method on commonsense reasoning, including CommonsenseQA (Talmor et al., 2019) and Strate-gyQA (Geva et al., 2021).The results are in Table 8.Tab-CoT obtained the highest average accuracy.However, the results of our method did not show significantly improved performance compared with Standard Prompting in a few-shot setting15 .These results imply that commonsense reasoning tasks do not have a fixed answering pattern.Therefore, providing chain-of-thought samples is not enough to make up for the lack of commonsense knowledge.For a fair comparison, we use the same few-shot questions listed in (Wei et al., 2022).
Results on GPT-3.5We test our method on the recent model "GPT-3.5-turbo-0301" in Table 9 16 .We found that our method is applicable to GPT-3.5, and achieves better performance compared to conventional Zero-shot CoT.Another interesting observation is when prompting the GPT-3.5 model with "Let's think step by step", a large number of the generated texts already contain a table in their CoT process. 17

Ablation Studies
Model Sizes Kojima et al. (2022) evaluated the family of GPT-3 models of four different sizes: 2.7B, 6.7B, 13B, and 175B parameters.The results show that only the largest model ("text-davinci-002") shows the chain-of-thought reasoning ability.
We compare the performance of the smaller model "code-cushman-001" (13B) with "codedavinci-002" (175B).Similar to zero-shot CoT, smaller models do not show the ability to conduct chain-of-thought reasoning.The performance of "code-cushman-001" cannot reach 10%, except AQUA (a multiple choice dataset with 5 choices for each question).The experimental results are reported in Table 10.

Structure-Promoting
Scheme As mentioned in Table 4, we compare the performance when we remove any column from "|step|subquestion|process|result|".The detailed experimental results are reported in Table 11.Results suggest that each column of our proposed scheme is important because removing any column will lead to a drop in performance.

Discussion
Our experimental results confirmed the effectiveness of our proposed tabular chain-of-thought method under both zero-shot and few-shot settings.We summarize several advantages of our method compared to conventional chain-of-thought methods and list them below.
Tab-CoT generates a table illustrating the reasoning process, which is more organized.This nature of the generated text, as can be seen from Table 6, makes the reasoning process much easier.
Additionally, from Figure 4, we conclude that Tab-CoT encourages a more structured reason-17 Based on our observations, those tables generated in conventional Zero-shot CoT under GPT 3.5 can be different from those generated with our method.They appear to be mostly used to organize information related to the question but do not appear to be used for presenting reasoning steps.ing process to be explicitly modelled.As a 2dimensional data structure, tables enable both horizontal reasoning along rows and vertical reasoning along columns.
Practically, table schemes are also easy to craft.Designing a specific table generation prompt typically involves deciding concise header names without concerning grammar.It is thus less cumbersome than choosing a natural language prompt from a diverse set of candidates.
Overall, we argue that under current state-of-theart LLMs, table schemes are natural prompts that are well suited for zero-shot learning.

Conclusion
In this paper, we propose Tab-CoT, a novel prompting framework that performs effective zero-shot reasoning by generating a table.
Tab-CoT shows competitive results on arithmetic reasoning tasks under both zero-shot and few-shot settings.We further conducted comprehensive experiments across different reasoning tasks under different settings.Our comprehensive experiments revealed some specific benefits of our method and identify the optimal way to use it.We hope that, through our work, we can sparkle new ideas and provide some inspiration to our community.
In the future, we would like to explore methods to automate the scheme selection process, using the generated schemes to meet task-specific requirements.Future work also includes integrating external calculators (Gao et al., 2022), or taskspecific supervision (Zhou et al., 2022) into the learning process, under both zero-shot and fewshot settings.
Our Tab-CoT also provides a straightforward decomposition of the intermediate thought process.This highly structured chain of thought produced by our approach may help people to observe and interpret how large language models decompose complex problems.We believe our proposed method can help reveal the underlying mechanisms associated with the emergence of certain complex behaviours associated with large language models.

Limitations
We identify a few limitations of this work.First, our approach is applicable to language models pretrained with tables, which may not always be included in all language models (especially small ones).Second, our approach's limited improvement in commonsense reasoning tasks suggests that its effectiveness may depend on the specific task and the level of structured reasoning required.

A One-shot Reasoning on Symbolic Reasoning
We evaluate our method on Coin Flip (Table 21) and Last Letter (Table 22) under the one-shot setting.As shown in Table 13, by adding one fewshot sample, LLMs can gain a significant performance boost in both tasks with general scheme "|step|subquestion|process|result|".

B Additional Few-shot Results
We evaluate our method on commonsense reasoning tasks under a few-shot setting.Our model performs slightly better in terms of average accuracy.
The results are reported in Table 12.

C Additional Case Studies
We show some errors our method made in arithmetic reasoning tasks through further case studies.
The results are reported in Table 14 and 16.

D Few-Shot Samples
We list our few-shot samples for all arithmetic reasoning (Table 17 and Table 18), CommonsenseQA (Table 19) and StrategyQA (Table 20).We use the same few-shot sample questions from Wei et al.

Figure 1 :
Figure 1: A comparison between Tab-CoT with standard prompting and zero-shot-CoT on the same question.Chain-of-thought prompts are highlighted in orange.
evidence?|Hamsters are prey animals.Prey are food for predators.|yes|Therefore, the answer (yes or no) is yes.Could Brooke Shields succeed at University of Pennsylvania?|step|subquestion|process|result||-|-|-|-| |1|Whatis the evidence?|BrookeShields went to Princeton University.Princeton University is about as academically rigorous as the University of Pennsylvania.Thus, Brooke Shields could also succeed at the University of Pennsylvania.|yes|Therefore, the answer (yes or no) is yes.Yes or no: Hydrogen's atomic number squared exceeds number of Spice Girls?|step|subquestion|process|result| |-|-|-|-| |1|Whatis the evidence?|Hydrogenhas an atomic number of 1. 1 squared is 1.There are 5 Spice Girls.Thus, Hydrogen's atomic number squared is less than 5.|no| Therefore, the answer (yes or no) is no.Yes or no: Is it common to see frost during some college commencements?|step|subquestion|process|result| |-|-|-|-| |1|Whatis the evidence?|Collegecommencement ceremonies can happen in December, May, and June.December is in the winter, so there can be frost.Thus, there could be frost at some commencements.|yes|Therefore, the answer (yes or no) is yes.Yes or no: Could a llama birth twice during War inVietnam (1945-46)?|step|subquestion|process|result||-|-|-|-| |1|Whatis the evidence?|TheWar in Vietnam was 6 months.The gestation period for a llama is 11 months, which is more than 6 months.Thus, a llama could not give birth twice during the War in Vietnam.|no|Therefore, the answer (yes or no) is no.Yes or no: Would a pear sink in water?|step|subquestion|process|result| |-|-|-|-| |1|The density of a pear is about 0.6g/cm3, which is less than water.Objects less dense than water float.Thus, a pear would float.|no|Therefore, the answer (yes or no) is no.Table 20: Few-shot samples for StrategyQA One-shot Sample Used on Coin Flip A coin is heads up.Dorian flips the coin.Mayra flips the coin.Freddie does not flip the coin.Magaly flips the coin.Is the coin still heads up?Note that "flip" here means "reverse".|step|subquestion|process|result| |-|-|-|-| |1|Is the coin heads up?|Dorian flips the coin.|Thecoin is tails up.| |2|Is the coin heads up?|Mayra flips the coin.|Thecoin is heads up.| |3|Is the coin heads up?|Freddie does not flip the coin.|Thecoin is heads up.| |4|Is the coin heads up?|Magaly flips the coin.|Thecoin is tails up.|Therefore, the answer (Yes or No) is "No".

Table Generation Prompt |step|subquestion|procedure|result|
table generation and (2) answer extraction.Added prompts are highlighted in orange.Texts generated by the LLM are highlighted in green.

Table 1 :
Tasks and Data Table Scheme Different table generation prompts (headers) may result

Table 2 :
table generation prompt: Zero-shot results on the arithmetic datasets.All methods use the same answer extraction prompt in these datasets for a fair comparison.All methods are evaluated under the zero-shot setting.

Table 3 :
Zero-shot performance comparison between the three schemes (and with self-consistency).

Table 4 :
Performance if a column is removed from the scheme (detailed results are in Appendix A).

Table 5 :
Few-shot results on the arithmetic datasets.
The "text" model overwrites the question in the process, thus answering the wrong question.The "text" model contains redundant steps in the table and wrong calculation.A pet store had 102 puppies.In one day they sold 21 of them and put the rest into cages with 9 in each cage.How many cages did they use?(GT: 9) CoT:We know that the pet store had 102 puppies and that 21 of them were sold.That means we have 81 puppies left.Now, if 9 puppies fit in each cage, we can divide 81 by 9 to find out how many cages are needed.81/ 9 = 9.So, the pet store used 9 cages.The "text" model has a wrong reasoning order, and should add up before dividing.

With 2-d reasoning Without 2-d reasoning step initial coin state flip or not final coin state
Figure 4: The schemes that disable (left) and enable (right) potential 2-dimensional reasoning.

Table 7 :
Effect of different specificity of schemes.We use Zero-shot CoT with the "text" model as our baseline (as Zero-shot CoT works better with "text" model).

Table 8 :
Results on commonsense reasoning.

Table 11 :
Performance if a column is removed from the scheme.

Table 21 :
One-shot sample used on Coin FlipOne-shot Sample Used on Last Letter Take the last letters of each words in Lucky Mireya Jj Kcänd concatenate them.

Table 22 :
One-shot sample used on Last Letter