Chain-of-Thought Reasoning in Tabular Language Models

,


Introduction
Tabular mathematical reasoning task aims at answering math questions based on heterogeneous tabular and textual data, which can provide users with insights from tables containing valuable figures (Lu et al., 2023b;Zhu et al., 2021;Chen et al., 2021b).This task highlights the demand for multistep mathematical reasoning including information look-up and numerical calculations.For example, given the table and the question in Figure 1, we firstly need to count how many numbers are in the table, then add all the numbers together to get the sum of baskets, and finally compute the mean of the sum.Considering the inherent demand for multi-step operations, existing studies tend to extend chainof-thought (CoT) reasoning (Wei et al., 2022;Wang et al., 2023a;Kojima et al., 2022;Zhang et al., 2022) into powerful Large Language Models (LLMs) (Brown et al., 2020;Chowdhery et al., 2022;Thoppilan et al., 2022;Chen et al., 2021a) to promote multi-hop mathematical reasoning.As depicted in Figure 2  Though the combo of LLM and CoT has achieved great performance, such LLM-based methods may not be a feasible approach in some real-world scenarios.For instance, it is financially expensive to satisfy the high computational requirements, the storage capacity and the desired bandwidth of LLMs, which makes it a challenge for individual users or small organizations to utilize LLMs in their applications (Strubell et al., 2019;Bender et al., 2021).In consideration of the data security, enterprises may also seek privatization deployments where private data is not allowed to be processed by third-party LLM APIs.What's more, despite the fact that many pre-trained tabular language models have been developed (Liu et al., 2022;Herzig et al., 2020;Wang et al., 2021;Dong et al., 2022), their CoT reasoning ability has not been thoroughly investigated and it could be inadequate for solving the tabular mathematical reasoning task.As a result, an alternative approach, with lower costs and competitive CoT reasoning ability, is needed.
To accomplish this goal, we revisit small-scale tabular language models (TaLMs) and initiatively explore the chain-of-thought reasoning in TaLMs.Specifically, we propose a novel framework named TaCo, which coordinates two TaLMs that are responsible for CoT generation and answer inference, respectively.Given the input table and question, the first TaLM is fine-tuned to generate intermediate reasoning steps.Based on the original input and generated reasoning steps, the second TaLM is fine-tuned to infer the final answer.To alleviate the weakness of TaLMs in solving mathematical expressions, TaCo is also combined with an external calculator which is used to perform math calculations and fix incorrect results in the output reasoning steps.
To verify the effectiveness of the proposed method, we conduct comprehensive experiments on the TABMWP (Lu et al., 2023b) dataset, which is the latest math word problem benchmark over tabular data and provides detailed chain-ofthoughts to solve the problem step by step.Experimental results reveal that TaCo explores a new and promising paradigm for tabular mathematical reasoning, which is illustrated in Figure 2 (c).Compared with traditional fine-tuned TaLMs, TaCo improves the accuracy of recent TAPEX model by 29.76%.Compared with LLM-based approaches, TaCo outperforms the state-of-the-art ChatGPT by 9.55% (82.60%→92.15%)with much less parameters (0.8B).Moreover, we conduct ablation studies to analyse contributions of different parts in the framework.The detailed error analysis is also performed to provide insights for future improvements.
To summarize, we conclude our contributions as follows: • To the best of our knowledge, we explore the chain-of-thought reasoning in TaLMs for the first time, and advocate a new and promising paradigm for tabular mathematical reasoning, especially under scenarios where LLM-based methods are not feasible.
• We propose a novel framework, TaCo, which coordinates two TaLMs responsible for CoT generation and answer inference, respectively.
It is also integrated with a calculator to enhance accurate numerical calculations.
• Our method can boost the performance of small-scale TaLMs and surpasses the stateof-the-art ChatGPT by 9.55% on TABMWP benchmark with much less parameters (0.8B).

Pilot Experiment
Before diving into the specific method, we present a pilot experiment on the TABMWP dataset to answer two important questions: (i) Do existing pretrained generative TaLMs possess chain-of-thought reasoning ability?(ii) Whether generative TaLMs can benefit from chain-of-thoughts when predicting the final answer.We select the state-of-theart TAPEX model (Liu et al., 2022) for experiments, which is based on the encoder-decoder language model BART (Lewis et al., 2020) and is additionally pre-trained on the tabular data.We consider two model sizes: TAPEX-base (140M) and TAPEX-large (400M).Experiments are conducted in three different settings, i.e., vanilla, zero-shot CoT and gold CoT.For the "vanilla" setting, the pre-trained TAPEX model f (•) autoregressively generates the answer a based on the table t and the question q, i.e., a = f (t, q).For the "zero-shot CoT" setting, we follow Kojima et al. (2022) to evaluate the CoT reasoning of the TAPEX.Specifically, a trigger sentence p 1 is appended to the question in order to ask the TAPEX to output intermediate reasoning steps s, i.e., s = f (t, q, p 1 ).Then, given the original input and the generated CoT, another trigger sentence p 2 is appended to make the TAPEX output the final answer a, i.e., a = f (t, q, p 1 , s, p 2 ).For p 1 , we try various templates such as "Let's think step by step" and report best results.For p 2 , we intuitively select "As a result, the answer is" as the trigger sentence.For the "gold CoT" setting, we replace  From the results in Table 1, we can see that the TAPEX with "zero-shot CoT" setting performs even worse than the vanilla one, which shows that the small-scale TAPEX is not a decent zero-shot reasoner like LLMs and does not possess CoT reasoning ability.This is also consistent with findings from previous CoT studies (Wei et al., 2022;Ho et al., 2023).After inspecting the model outputs, we find that the pre-trained TAPEX model cannot follow the instruction to generate reasoning steps.In most cases, it directly generates the answer or illogical texts.However, given the annotated "gold CoT", the model achieves a remarkable performance gain.For instance, the accuracy of TAPEX-large on test set increases from 18.59% to 48.01%.This demonstrates that CoT reasoning steps are beneficial to TAPEX when inferring the correct answer and it encourages us to further elicit CoT reasoning ability of TaLMs by finetuning.

Method
Based on observations in Section 2, we propose the TaCo framework for tabular mathematical reasoning.It includes two training stages: (i) CoT generation and (ii) answer inference, where two generative TaLMs with the same architecture are fine-tuned independently with different inputs and outputs.In this section, we introduce the framework with the TAPEX model as selected backbones, but it should be noted that TaCo is compatible with arbitrary generative TaLMs to boost their performance.The overview of TaCo framework is illustrated in Figure 3.

CoT Generation
In the CoT generation stage, a TAPEX model is fine-tuned to generate a solution which consists of multiple reasoning steps to solve the problem.Given an input table T with M rows {R i } M i=1 and N column headers {c j } N j=1 , the TAPEX will linearize the table into a flattened text sequence The resulting sequence T * will be concatenated with the textual context, which includes a question Q and a trigger sentence P .Based on the concatenated input, the probability of generating the target solution S is computed as follows: where L is the length of target solution.We select "Let's think step by step" as the trigger sentence P since it gives the best performance in pilot experiments.
After generating a potential solution S, we find that S often contains some numerical calculation errors.This is often the case with language models because TaLMs and even LLMs are not suitable for actually solving mathematical expressions (Chen et al., 2022).Take the generated solution in Figure 3 as an example.Though the model generates plausible reasoning steps, calculation results among these steps are all wrong (in red color), e.g., "49 + 48 + 51 + 54 + 37 + 49 = 312".Such calculation errors will accumulate to the last reasoning step and seriously mislead the answer inference model into predicting the false answer.
To mitigate the influence of calculation mistakes, we introduce an arithmetic calculator g(•) to solve mathematical expressions of "+,-,×,÷" in the generated solution S and output the corrected solution Ŝ = g( S).Concretely, we extract equation strings in S using regular expressions and calculate their results using the Python eval function.Since multiple equations may exist in one solution and one equation could also refer to results of previous equations, the calculation result of each equation is propagated to the following equations by string replacing.As we can see from Figure 3, original wrong results in S are successfully fixed and are replaced with correct results (in green color), e.g., "49 + 48 + 51 + 54 + 37 + 49 = 288".

Answer Inference
In answer inference stage, another TAPEX model is fine-tuned to generate the final ansewr based on the original input and the annotated solution S. Similar with the CoT generation stage, the probability of generating target answer A is computed by: where N is the length of target answer.During the inference phase, the annotated solution is replaced with the corrected solution Ŝ to output the predicted answer Ā.Both CoT generation model and answer inference model are trained with a standard language modeling objective.

Dataset and Evaluation Metric
Experiments are conducted on the TABMWP (Lu et al., 2023b) dataset, a recent large-scale benchmark which is constructed from grade-level math curricula and contains 38,481 math word problems with the tabular context.Beside the gold answers, TABMWP also provides detailed step-by-step solutions to solve the problems, which can be utilized as chain-of-thoughts to finetuning TaLMs.There are two question-types in the TABMWP: 28,719 freetext questions with integer answers (INT) and decimal answers (DEC), and 9,712 multi-choice questions with extractive text answers (EXTR), boolean text answers (BOOL) and other text answers (OTH).DEC questions are more essential for the overall accuracy.

Statistics of each split are shown in the
Given the predicted answer and the ground truth, we employ the exact match accuracy as the metric and use the official evaluation script to evaluate the model performance.

Implementation Details
Implementations.Our framework is implemented with Pytorch (Paszke et al., 2019).We mainly employ the TAPEX (Liu et al., 2022) as the backbone TaLM in the proposed framework.We also replace TAPEX with UnifiedQA (Khashabi et al., 2020)  The "Heuristic guess" is a baseline from the TABMWP paper.For multi-choice questions, it randomly selects one from the given options with even probabilities.For free-text questions, it randomly chooses one number from the question or the table as the prediction.

Main Results
Table 3 demonstrates main experimental results on the TABMWP dataset.For TAPEX, UnifiedQA and ChatGPT baselines, we report results based on our implementation.For other baselines, we report published results from original papers (Lu et al., 2023b;Chen et al., 2022).
From the results in Table 3, we can find that: (1) With two TAPEX-large models as backbones, the TaCo framework establishes a new state-of-the-art accuracy of 92.15% on the TABMWP test set, outperforming the previous best model ChatGPT with CoT prompting by 9.55%, which demonstrates the effectiveness of the proposed method.Notably, compared with LLMs such as GPT-3 and Codex, the parameters in TaCo framework are much less (0.8B), which brings lower costs for application deployments.(2) Compared with LLM-based approaches with the standard few-shot prompting, fine-tuned TAPEX and UnifiedQA can achieve competitive results.For instance, the fine-tuned TAPEX-large even performs better than GPT-3 and Codex.However, when combined with the CoT prompting, LLM-based methods are significantly better than fine-tuned small-scale language models, which shows that the CoT prompting plays an important role in the tabular mathematical reasoning task.By contrast, the TaCo framework extends the CoT reasoning into TaLMs for the first time, and improves the performance of TAPEX-base and TAPEX-large model by 29.19% and 29.76%, respectively.The best results are marked in bold.± stands for standard deviation over 3 repeated experiments.If not otherwise specified, LLM baselines are in few-shot setting."-SC" represents using self-consistency decoding strategy (Wang et al., 2023a).
(3) Among different baselines, the model performance on free-text questions is obviously worse than that on multi-choice questions, with an average difference of 21%.The reason is that, compared with multi-choice questions, free-text questions usually require more complicated numerical calculations and also do not directly provide answer options in the input.The detailed evidence is presented in the Appendix B. Nevertheless, from pre-trained LM to LLM+CoT and to the proposed TaCo framework, the performance gap between two question types gradually decreases.For instance, the accuracy gap of TaCo (TAPEX-large) framework (1.78%) is much lower than that of fine-tuned TAPEX-large (26.58%).This shows our method can obtain better generalization on two types of questions.(4) Considering questions of various answer types, the TaCo framework beats other baselines on questions with integer (INT) and decimal (DEC) answers, which may resulted from the utilization of the external calculator.ChatGPT with the CoT prompting outperforms other methods including the human baseline on questions with Boolean text answer, which may contribute to its great general semantic understanding ability.For example, judging yes/no questions based on previously generated reasoning steps.(5) Not surprisingly, all the models perform worse on questions from the grade 7-8 than that from the grade 1-6 due to the increasing difficulty.Among them, the proposed framework achieves the best accuracy than other baselines on harder questions from grade 7-8.

Ablation Study
We conduct ablation experiments to systematically investigate the effect of the external calculator, the progressive two-stage paradigm and the TaLM Effect of External Calculator.As shown in Table 4, there is a drastic performance drop for the TaCo framework (e.g., 92.15% → 74.58%) when removing the external calculator.With further observations, we find that the performance decline mainly comes from free-text questions which demand more numerical calculations.For instance, the accuracy of TaCo (TAPEX-large) plummets from 91.69% to 67.77%.It demonstrates the great significance of using the external calculator to reduce calculation errors in the generated solutions.
Otherwise, the answer inference model is likely to be misled by the incorrect solution and arrives at the wrong answer.
Effect of Two-stage Paradigm.When we change the two-stage paradigm to one-stage ones, the model performance drops about 9.5%, which reveals the contribution of two-stage paradigm.We think it is challenging for single small-scale TaLM to generate correct reasoning steps and the final answer simultaneously.As a result, we delegate the CoT generation and the answer inference to two TaLMs, respectively.More importantly, onestage paradigms cannot fully utilize the corrected CoT to change the original (wrong) answer.By contrast, the two-stage paradigm brings a second chance to re-contemplate the improved reasoning steps before making the final judgement.The similar two-stage paradigm has also been explored in recent works (Press et al., 2023;Zhao et al., 2023), where they utilize one LLM to generate the CoT to be improved, and then ask the same LLM to infer the final answer based on the improved CoT.
Comparing two one-stage paradigms, we notice that QT → SA performs better than QT → AS.This shows that it may be more suitable for TaLMs to infer the final answer according to produced reasoning steps, rather than give explanations based on the predicted final answer.If we remove both the two-stage paradigm and the external calculator, the model performance would suffer a more steep decline.But it is still better than that of traditional fine-tuned models in QT → A paradigm, which validates the value of intermediate reasoning steps for TaLMs.
Effect of TaLM Backbone.To investigate the performance of TaCo with different backbones, we replace TAPEX with UnifiedQA as the backbone model.Related experimental results are presented in Table 5.When the backbone changes from TAPEX to UnifiedQA, the TaCo framework suffers a sharp performance drop on both free-text and multi-choice questions.For instance, even with more parameters (1.54B), the accuracy of TaCo with UnifiedQA-large on the test set (76.96%) is much lower than that with TAPEX-large (92.15%), which indicates the advantages of pre-trained tabular language models.Unlike UnifiedQA which is solely pre-trained on the unstructured textual data, TAPEX is additionally pre-trained on the tabular data and thus has a better understanding of table structures.As more powerful generative TaLMs emerge, they can be integrated into the TaCo framework to improve their performance on the tabular mathematical reasoning task.

Error Analysis and Case Study
As illustrated in Figure 6, for this problem that involves two multiplication and one addition operations, the TaCo framework successfully generates correct intermediate reasoning chains and finally predicts the right answer.
There are 473 free-text questions (78%) and 130 multi-choice questions (22%) for which the TaCo (TAPEX-large) gives wrong predictions.We randomly selected 100 questions of each type for error analyses.Figure 4 depicts error distributions by question types.More error instances are presented and discussed in Appendix C.
For free-text questions, error cases fall into the following four categories.(1) Counting operation (49%): the question requires the model to count numbers as the final answer, which is challenging for generative language models.(2) Fraction calculation (36%): the model fails to conduct fractionrelated calculations such as reducing a fraction, which may be alleviated with an advanced calculator.(3) Wrong formula (11%): the CoT generation model outputs wrong formulas in the reasoning steps.(4) Function-related problem (4%): the model fails to solve problems related to the func-tion, e.g., compute the slope of the function based on the table data.
For multi-choice questions, error cases can be divided into the following five types.(1) Number comparison (44%): the model cannot determine which number is larger or smaller.(2) Time calculation (21%): the model needs to perform time calculation such as compute the elapsed time between 9:15 A.M. and 11:20 A.M.. (3) Max/Min operation (19%): the question demands finding the biggest or smallest number in a group.(4) False CoT (9%): the CoT generation model gives wrong or hallucinated reasoning steps, e.g., using numbers that do not exist in the table or the question when generating formulas.( 5) Commonsense (7%): the commonsense knowledge is needed to answering the question, which is a weakness of small-scale language models.

Related Work
CoT prompting for LLMs.By providing a few in-context examples (or demonstrations) which contain chain-of-thoughts, CoT prompting can encourage LLMs to output intermediate reasoning steps before predicting the final answer (Wei et al., 2022).Existing CoT studies mainly focus on two directions.(1) Improving the quality of CoT demonstrations.For instance, selecting better incontext examples for CoT prompting according to the question diversity (Zhang et al., 2022), the solution complexity (Fu et al., 2023), or the example similarity (Rubin et al., 2022).( 2) Exploring new representations of CoT reasoning steps.Beside the typical natural language format, researchers also proposed chain-of-thoughts in other formats.For instance, program-of-thoughts (Chen et al., 2022), tree-of-thoughts (Yao et al., 2023a), and graph-ofthoughts (Yao et al., 2023b).Among them, the CoT in program languages has emerged as a powerful approach for LLMs to invoking external tools (Qin et al., 2023).Recently, Lu et al. (2023a) proposed the Chameleon framework that augments LLMs with various tools like search engines and Python executors.We treat it as a contemporary work of our paper and list its results in the Appendix D.
Pre-trained TaLMs.Inspired by the success of pre-training on the natural language text, various TaLMs are proposed for pre-training on the semistructured tabular data (Dong et al., 2022).Existing TaLMs mainly inherit the architectures of traditional language models and can be classified into three types.(1) Encoder-based TaLMs like TAPAS (Herzig et al., 2020), MATE (Eisenschlos et al., 2021) and TUTA (Wang et al., 2021).
(2) Encoder-Decoder TaLMs such as TAPEX (Liu et al., 2022) and STTP (Xing and Wan, 2021).(3) Decoder-based TaLMs like TableGPT (Gong et al., 2020).In previous studies, TaLMs are usually finetuned to directly generate final answers or simple formulas.By contrast, we are the first to explore the combination of the CoT reasoning and pre-trained TaLMs.

Conclusion
We extend the CoT reasoning into small-scale TaLMs for the first time, and provide an effective approach for tabular mathematical reasoning task, especially under scenarios where LLMs are not accessible.Specifically, we propose a novel framework named TaCo, which coordinates two TaLMs responsible for CoT generation and answer inference, respectively.By introducing an external calculator, we further augment TaCo with the accurate math computing ability.With two TAPEX-large models as backbones, the TaCo outperforms the state-of-the-art ChatGPT on the TABMWP dataset by 9.55% (82.60%→92.15%)with much less parameters (0.8B).

Limitations
Though the proposed method achieves great performance with less parameters, the fine-tuning of the CoT generation model and the answer inference model depends on annotated chain-of-thoughts and gold answers.As a result, the chain-of-thought reasoning ability of TaCo could be limited to the tabular mathematical reasoning task.In the future research, one can utilize open-source LLMs to generate chain-of-thoughts of more diversities and of more table-related tasks (Wang et al., 2023b;Ho et al., 2023), which may further extend the generalization ability of TaLMs and reduce the cost of manual annotation.
In the aspect of external tools, compared with frameworks which enable LLMs to access various tools (Shen et al., 2023;Lu et al., 2023a), TaCo only utilizes a calculator to complete common arithmetic calculations, i.e., "+,-,×,÷".More advanced external tools may be integrated to enhance the capability of the framework.We believe that the tool learning with small-scale language models is a valuable future direction, especially for particular scenarios where LLMs are not available.

Ethics Statement
This paper proposes a two-stage framework for the tabular mathematical reasoning task, and models are trained and evaluated on the public TABMWP dataset.Thus, the authors foresee no ethical concerns with the research in this paper.

A More Implementation Details
In our experiments, we employ TAPEX and Uni-fiedQA as backbones of TaCo framework.When linearizing the table into flattened sequence, if there exist no column headers in the original table, pseudo column headers will be inserted, e.g., 'Column header 1'.The hyper-parameter configurations of TAPEX and UnifiedQA backbone and their model sizes are shown in Table 6 and Table 7, respectively.Our experiments are all performed on a 32G NVIDIA V100 GPU.
For LLM-based baselines, we list numbers of few-shot examples and selection strategies in Table 8.For ChatGPT baseline, we randomly select 4 examples from train set for each question type.For fair comparison, we use the same prompt format as PromptPG (Lu et al., 2023b) to construct in-context examples, which is demonstrated in Figure 5.

B The complexity of CoT generation
Table 3 reveals a significant performance difference between free-text questions and multi-choice questions.To shed more light on the TABMWP dataset, we quantitatively analyze the complexity of the CoT generation for two question types.Specifically, we compute the number of required numerical calculations in the gold CoT (including +, -, ×, ÷, counting, min, max), the number of reasoning steps (we treat each line in the gold CoT as one reasoning step for simplicity) and the length of the gold CoT.The statistical results in the Table 9 demonstrate that, in the TABMWP dataset, the CoT generation from free-text questions is more complex than that from multi-choice questions.Based on our observations, at least 18% multi-choice questions (mainly of EXTR and OTH answer types) do not need numerical calculations, but almost all freetext questions need numerical calculations.

C Error Instances and More Analysis
In this section, we present detailed error instances to analyze the weakness of TaCo framework, which is shown in Figure 7 to Figure 10.We find that most of errors are caused by the inability of used external tool and the representation of chain-ofthoughts.Take the error instance in Figure 7 as an example.To correctly answer the question in Figure 7, the model should find numbers from the table which are greater than 53, and then count how many numbers are found.However, as the CoT generation model is fine-tuned to generate chain-of-thoughts in simple natural language, it is difficult for the model to describe the above process in a short and straightforward expression, which makes it hard to invoke external tools.If we could represent chain-of-thoughts in program languages like Python, the solution of this question would be much more clear.For instance, one can write a line of Python code: "Ans = Count (61,61,65,65,66,70,66,78)", and implement a Python function "Count()" as an external tool to get the accurate result.The same methodology could be applied to error instances which demand other abilities such as fraction calculation, min/max operation and time calculation.Besides, lacking commonsense knowledge also increases the difficulty for models to comprehend tables and questions, e.g., reading bus schedule in Figure 10.

D Results of Chameleon framework
Recently, Lu et al. (2023a) proposed a compositional reasoning framework named Chameleon, which treats LLMs as a natural language planner to utilize a variety of tools including vision models, web search engines, Python functions and so on.
As shown in Table 10, based on the powerful GPT-4 and multiple external tools, Chameleon achieves the best accuracy of 98.78% on TABMWP test set.However, the proposed TaCo framework still achieves a competitive result of 92.15% with less parameters.
We also apply the same calculator to the output of ChatGPT and use regular expressions to extract the final answer from the output.There is a slight performance increase from 82.60% to 83.07%.After inspecting error cases of ChatGPT, we found that most errors resulted from wrong reasoning steps rather than calculation mistakes.Compared with small-scale TaLMs, the numerical calculating ability of ChatGPT is much more better, which may attribute to the potential use of more advanced external tools behind the ChatGPT system.11018

Figure 1 :
Figure 1: An example from the TABMWP dataset.To solve the problem, the model needs to perform multistep mathematical reasoning based on the table and the question.
(b), this paradigm prompts LLMs with several in-context examples containing CoT demonstrations to elicit intermediate reasoning steps before inferring the final answer.

Figure 3 :
Figure 3: Overview of the TaCo framework, with the table and the question in Figure 1 as a running example. p

Figure 4 :
Figure 4: Error distributions of different question types.

Figure 6 :
Figure 6: A correct instance where TaCo generates right solution and answer.(ID:752).

Figure 7 :
Figure 7: An error instance of counting operation (ID:449), where TaCo cannot correctly count how many numbers satisfying requirements.

Figure 8 :
Figure 8: An error instance of fraction calculation (ID:1711), where TaCo makes mistakes when reducing a fraction.

Figure 9 :
Figure 9: An error instance of number comparison (ID:1434), where TaCo cannot correctly judge which is the larger number between 72.00 and 74.00.

Table 3 :
Accuracy (%) on the development set and test set of TABMWP.We also report detailed accuracy on different types of questions in test set.FREE: free-text questions; MC: multi-choice questions.INT: integer answers; DEC: decimal answers; EXTR: extractive text answers; BOOL: Boolean text answers; OTH: other text answers.

Table 4 :
Ablation study of the external calculator and proposed two-stage paradigm."base" and "large" stands for model sizes of TAPEX backbone.
backbone in the TaCo framework.QT → S → A represents the proposed two-stage paradigm, which firstly generates the solution S and then arrives at the final answer A based on the input question Q, table T and generated solution S. QT → SA and QT → AS represents one-stage paradigms, which generate the solution and the answer in different orders, respectively.QT → A stands for the vanilla fine-tuning paradigm that directly predicts the answer.

Table 5 :
Experiment results of TaCo framework with TAPEX and UnifiedQA as backbone, respectively.

Table 8 :
Number of in-context examples and selection strategies of LLM baselines.

Table 9 :
The quantitative analysis of the complexity of the CoT generation for two question types.

Table 10 :
Accuracy of Chameleon on TABMWP test set.