Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-Thoughts

As large language models (LLMs) have shown effectiveness with different prompting methods, such as Chain of Thought, Program of Thought, we find that these methods have formed a great complementarity to each other on math reasoning tasks. In this work, we propose XoT, an integrated problem solving framework by prompting LLMs with diverse reasoning thoughts. For each question, XoT always begins with selecting the most suitable method then executes each method iteratively. Within each iteration, XoT actively checks the validity of the generated answer and incorporates the feedback from external executors, allowing it to dynamically switch among different prompting methods. Through extensive experiments on 10 popular math reasoning datasets, we demonstrate the effectiveness of our proposed approach and thoroughly analyze the strengths of each module. Moreover, empirical results suggest that our framework is orthogonal to recent work that makes improvements on single reasoning methods and can further generalise to logical reasoning domain. By allowing method switching, XoT provides a fresh perspective on the collaborative integration of diverse reasoning thoughts in a unified framework. The code is available at https://github.com/tengxiaoliu/XoT.


Introduction
The AI community has long sought to achieve automated reasoning (Hewitt, 1969), which is an important component of Artificial General Intelligence (Steunebrink et al., 2016).Mathematical reasoning, as a cognitive skill essential for humans yet challenging for language models, attracts increasing interests and commitment from researchers (Feigenbaum and Feldman, 1963;Wang et al., 2017;Lu et al., 2022).
With the abilities endowed by in-context learning (ICL), Large Language Models (LLMs) Figure 1: CoT only reasons in a single pass, while selfrefine involves refinement using the same method.XoT integrates a verification module that makes a difference in method planning, enabling the attempts of diverse reasoning thoughts within an iterative framework.(Brown et al., 2020;Chowdhery et al., 2022;Touvron et al., 2023a;OpenAI, 2023) are able to solve mathematical problems through textual rationales with Chain-of-Thought prompting (Wei et al., 2022) (CoT) or through Python functions with Program-Aided Language Model (Gao et al., 2022) and Program-of-Thought prompting (Chen et al., 2022) (PAL or PoT).These prompting methods exhibit unique strengths and limitations.CoT generates a step-by-step reasoning flow in natural language and performs calculations on the fly.This approach enables a more flexible solution format, but may result in a loss of precision since language models often struggle with arithmetic calculations (Lewkowycz et al., 2022;Wei et al., 2022).On the other hand, PoT or PAL resolves problems through Python statements, relying on Python interpreters to ensure calculation accuracy.Another noteworthy and intriguing prompting method is to form math problems as linear equation systems (He-Yueya et al., 2023).Similarly, inspired by Linear Algebra, we propose Equation-of-Thought (EoT), which performs math reasoning in a more direct way.The diversity inherent in each method does not arXiv:2310.14628v2[cs.CL] 27 Dec 2023 render them as competing or mutually exclusive alternatives.On the contrary, in practical problem solving scenarios, possessing multiple methods can always yield a range of complementary advantages.
The distinct problem-solving approaches can contribute to synergistic benefits that surpass the outcomes of any single approach.We find that this intuition also applies to the realm of math reasoning.With the availability of CoT, PoT and EoT, we hold the hypothesis that a model has the potential to solve a problem if it reaches the correct answer using any one of the prompting methods.As illustrated in Figure 2, our analysis shows that the model exhibits the potential to solve 92.72% of the problems, surpassing the best performing single method by over 10%.
Motivated by this observation, we propose XoT, an integrated math problem solving framework, which improves the LLM's reasoning ability by switching among diverse reasoning thoughts.Since there is no guarantee that LLMs can always solve the problem in a single attempt, we follow the human intuition and allow the model to rethink and switch to a different method when encountering difficulties or obstacles.We apply two complementary verification methods to facilitate the model to decide whether it is time to switch to another method: passive and active verification.Passive verification relies on the external executors to provide determinable results based on the generated programs (Chen et al., 2023;Le et al., 2022).It offers shallow inspections, such as program syntax issues or the runtime errors.For active verification, we ask the model to verify the solution by checking whether the answer adheres to the conditions outlined in the original question.
As shown in Figure 1, XoT consists of three modules that work in an iterative framework: planning, reasoning and verification.Given a problem as input, the planning module first proposes the most appropriate method.The reasoning module then generates one solution using the planned prompting method.With the outputs and the results from external executors, the model is asked to assess the answers in the context of the questions.If the answer fails the verification, we will go back to the planning module for another round of iteration and attempt alternative methods.The iterative process concludes when the verification confirms the correctness of the answer or after exhausting all available methods.To demonstrate the effectiveness of XoT, we conduct extensive experiments on 10 popular mathematical reasoning datasets and achieve consistent improvement.Empirical results suggest that XoT can accommodate recent work that focuses on improving single reasoning methods.Additional experiments also indicate that XoT can generalise to other domains such as logical reasoning tasks.
We summarize the main contributions as follows.First, we propose an integrated problem solving framework XoT, utilising the complementarity of different reasoning thoughts.Second, we introduce EoT which solves math problems with a system of linear equations, serving as a complementary method to existing approaches.Third, we incorporate passive and active verification to facilitate the framework to switch among diverse reasoning thoughts, empowering the framework to make informed decisions regarding the subsequent steps to be taken.More generally, XoT sheds lights on a new direction of interacting with diverse reasoning methods and tools.As shown in Figure 1, instead of sticking to one determined method, LLMs can benefit from the verification and the flexible switching among available reasoning thoughts.1 2 Related Work

Math Reasoning with LLMs
As the field of large language models continues to prosper, many prompting techniques have emerged to unlock the reasoning abilities of LLMs (Qiao et al., 2022).Early success includes reasoning with step-by-step chain of thought (Wei et al., 2022), decomposing questions into sub-questions in a least-to-most fashion (Zhou et al., 2022), zero-shot prompting LLMs with simply one sentence (Kojima et al., 2022), writing programs to solve procedural tasks (Gao et al., 2022;Chen et al., 2022).Despite generating solutions in single forward pass, one line of work employs multiple reasoning results and ensembles them by majority vote (Wang et al., 2022), and stepwise verifier (Li et al., 2022).Additionally, Tree-of-Thoughts (Yao et al., 2023) deliberately explores multiple reasoning paths and searches over a tree-structured reasoning states.Imani et al. (2023) propose to vote over multiple solutions generated with algebraic and program prompts.One concurrent work (Zhao et al., 2023) considers the difference of CoT and PoT and asks the LLM to choose one better reasoning rationale.In contrast to their work, XoT involves more reliable verification modules and switches methods when necessary.

Iterative Refinement
One stream of work is dedicated to iteratively enhancing LLMs by continuously reevaluating and refining outputs until the desired quality is achieved.Madaan et al. (2023) prompts the model to write feedback based on previously generated drafts and leverages the feedback to generate high-quality outputs.Similarly, Chen et al. (2023) iteratively debugs the code by utilizing external program execution results and code explanations generated by the model itself.In order to avoid repetitive mistakes, Shinn et al. (2023) builds a memory of previous errors, while Wang and Li (2023) collects all mistakes during the training phase to provide a global insight.When considering sources of hints to guide rethinking, Paul et al. (2023) focuses on intermediate reasoning steps, while Zheng et al. (2023) directly utilizes the previously generated answers.Qi et al. (2023) propose to emulate the divideand-conquer fashion of human thinking strategy and involve self-questioning and recursive thinking processes in the problem solving framework.Although these approaches contribute to improving the reasoning quality of LLMs, they are limited in retrying without looking around for other possible thoughts.In contrast, our proposed method aims to explore alternative solutions, and it is orthogonal to iterative refinement, as we have the flexibility to switch solutions when refining no longer leads to further improvement.

Prompting methods
For math reasoning tasks, we use three reasoning thoughts in this work, namely Chain-of-Thought (CoT), Program-of-Thought (PoT) and Equation-of-Thought (EoT).Despite the wellknown strengths of CoT and PoT methods, our proposed EoT excels particularly in reasoning with unknown variables.For each problem, EoT attempts to model the questions as linear equations and involves unknown values in the description.
A detailed formulation of EoT prompting can be found in Table 12 of Appendix C. As illustrated in Figure 3, while CoT correctly sets up the equations, it fails in accurately performing the calculations.PoT falls short in dealing with unknown variables, as Python requires that every variable is defined with a value.Assigning a value to an unknown variable (david_insects) hallucinates PoT to generate a misleading step (the highlighted line).In comparison, EoT manages to express the question context in straightforward equations and solves them with a deterministic equation solver.

Complementarity
Given a question q, we denote the correctness of the reasoning answers using each method as RX (q), where X ∈ {CoT, P oT, EoT } denotes the diverse reasoning methods.RX (q) = {0, 1} represents whether the generated answer is correct according to the gold label.We define the accuracy under the oracle setting as: (1) The oracle setting represents that the model has the potential for solving one given problem if any of the methods accurately generates the answer.It also implies that in cases where the generated answer does not match the gold answers, XoT will make further attempts using alternative methods to answer the question.Under oracle setting, the model can potentially achieve more than 10% gains on various datasets.In Figure 2, the bar at the bottom represents the highest performance achieved by employing a single method, followed by the optimal performance achieved through the use of two methods.The overall stacked bar shows the utilization of all three methods, which indicates the upper bound that can be reached through the combined collaboration of various methods.

XoT
Our goal is to develop a generalized problem solving framework that can automatically select the appropriate method for different problems and has the capability to switch among reasoning thoughts using both active and passive verification.We first describe the overall framework and introduce each module in detail.

Overall Framework
The overall pipeline is described in Algorithm 1.The inputs of our framework include a question q and a predefined set of methods M .With the user input, XoT employs its three built-in modules to output the final solution, namely planning module P , reasoning module R and verification module V .
These three modules collaborate in an iterative manner.Suppose at iteration t, the planning module P first chooses the most appropriate method available: m t = P (M ).The chosen method is subsequently excluded from the set of methods.The reasoning module is then tasked to generate Algorithm 1 XoT Reasoning Algorithm Require: input question q, method set M , planning module P , reasoning module R, verifica- else 9: end if 11: end while 12: return y ▷ Return the solution one solution y using the proposed method m t .Following this, the verification module evaluates the solution by rethinking the answer within the given conditions.If the answer successfully passes the verification, we proceed to return the current solution.Otherwise, XoT will move forward to the next iteration.Every module is implemented with a LLM through inference under few-shot setting.
We will elaborate each module with details.

Planning and Reasoning
The planning module is responsible for selecting the appropriate method at the beginning of each round of iteration.Recent work shows the necessity to equip reasoning framework with the ability to plan ahead (Lu et al., 2023).As elaborated in Section 3, it is evident that each method possesses distinct strengths.Our intuition is to consistently initiate the process with the optimal method to enhance reasoning efficiency.
The reasoning module performs few-shot reasoning with the planned prompting method.Each round of reasoning operates independently, meaning that subsequent iterations do not rely on the failed reasoning attempts of previous iterations.

Verification module
The verification module assesses the effectiveness of the reasoning solution through two approaches: passive verification and active verification.
In the case of active verification, the module rethinks the answer within the context of the given question.It first acquires all intermediate values associated with each variable mentioned in the solution.These values are computed by external executors.We intentionally exclude the reasoning process (expressions) leading to the results to prevent the verification module from emulating the solution's thinking process.With the intermediate results and final answer in hand, the module is expected to recheck whether the answer satisfies the conditions specified in the question.The desired format for this evaluation is an assertion statement, as shown in Figure 4.This assertion is subsequently combined with the original solution for external tools to execute.If no issues arise during this execution phase, it means the solution successfully passes the verification.A detailed illustration of the prompts we use can be found in Appendix C. The verification module is specially designed for PoT and EoT as the intermediate values can be easily obtained.We leave the exploration of a more effective verification for CoT as future work.

Experimental Setting
Datasets Our experiments are conducted on a comprehensive set of 10 math reasoning datasets, encompassing various challenging math reasoning scenarios.Some widely used datasets include GSM8K, SVAMP, AQuA, MATH and MAWPS (AddSub, SingleOP, SingleEQ, Multi-Arith) (Koncel-Kedziorski et al., 2016).Besides, we also incorporate several recently introduced datasets, namely Algebra, GSM-hard.Algebra comprises a collection of solely algebraic word problems that can be resolved through the use of equations.To increase the complexity of calculations, GSM-hard replaced small numerical values with larger ones.The details of the statistics of the datasets can be found in Table 1.
Model We query OpenAI API for experiments 2 .Specifically we use gpt-3.5-turboas the inference engine.If not further explained, we manually construct the prompts with 8 examples sampled from the training set.For CoT and PoT, we directly use the examples released by published paper (Fu et al., 2022;Gao et al., 2022;Chen et al., 2022).For model generation strategy, we employ greedy decoding in all runs.Due to the non-deterministic APIs, we report the average performance and the standard deviation across 3 runs.We also evaluate XoT with various base models in Appendix A.2.

Main Results
The main results are shown in Table 2.We consider three prompting methods as baselines, namely CoT, PoT and EoT.On average, XoT achieves a significant improvement of 5.49% across the datasets.For MATH dataset, we show the breakdown results of different question subtopics in Table 3.We also represent the performance enhancement over the strongest baseline as ∆.As questions in MATH are too complex for equation systems to solve, we only consider CoT and PoT with passive verification.Specifically, on the AQuA dataset, which consists of multiple-choice questions, we observe that PoT or EoT often fails to generate a valid answer due to the diverse answer formats.Across the three runs, 24.4% of the PoT answers and 30.3% of the EoT answers cannot be executed.Therefore, applying passive verification is adequate to ensure the explortion of other method options.When post processing the generated results, we further enforce a restriction that the model cannot make a random guess if it fails to extract an answer from the generated output.Such instances should be proceeded to the next iteration to guarantee a fair evaluation of the performance.
Notably, we observe that the enhancements are more pronounced for the challenging datasets compared to the easier ones.Difficult datasets usually contain longer questions and more than 3 reasoning steps while easier datasets such as SingleEQ require only one equation to solve the problem.We find that the improvement directly correlates with the complementary nature of the three methods employed across different datasets.On easier datasets, each method performs well individually, resulting in only minor complementarity.Figure 5 reveals that XoT demonstrates superior performance on 2 https://openai.comdatasets that exhibit stronger enhancement under oracle setting.The bars in the figure represent the improvement under XoT, while the line indicates the upper bound of the improvement under oracle setting.The comparison indicates that MultiArith and SingleEQ allow minimal room for improvement, therefore the overall XoT performance is negatively impacted by the accumulated errors introduced by the verification module.
Additionally, we conduct experiments on logical reasoning task to evaluate the generalisability of XoT.Details can be found in Appendix A.1.

Analysis
In this section, we first analyze the effectiveness and necessity of each module within XoT.Then we provide comparison with majority voting and describe how model's self refinement can be integrated in our framework.

Ablation Study
Planning The planning module decides which method to attempt at the beginning of each iteration.We are curious about how well it performs in selecting the most suitable method among the available options.The planning module is expected to select from PoT and EoT at the beginning because these two methods can be verified with both active and passive verification.To demonstrate the necessity of the planning module, we conduct an experiment in which XoT is asked to execute each method in a predefined order.Whether to switch the method is still determined by the verification module.We break down the performance of each dataset with respect to different combinations of methods in Table 4.
Our findings align with two design ethos of the planning module.First, it demonstrates robustness across different datasets.While specific combinations excel at different datasets, XoT equipped with the planning module outperforms all other predetermined combinations on average.For instance, on GSM-hard, the combination of PoT and EoT achieves the best performance, which highlights the importance of leveraging external tools to handle calculation involving large numbers.Additionally, on SingleEQ and MultiArith where XoT fails to offer improvement, the combination of two methods proves to be efficient, surpassing the single method baselines.With the inclusion of the planning module, XoT can dynamically adjust the  execution order based on different questions, which ensures a more consistent and robust performance.Second, the planning module enhances efficiency, facilitating XoT to reach the final answer in fewer iterations by always starting from the most possible method.To illustrate, on GSM8K, XoT needs 1.46 iterations on average in comparison with 1.58 iterations with the fixed EPC order (EoT->PoT->CoT, the best performing fixed order).Specifically, 68.8% of the questions are resolved in the first iteration with XoT, as opposed to 57.2% when employing the fixed EPC order.
Reasoning How important is it to try different methods instead of exclusively relying on a single method?To investigate this, we restrict the available method options to utilizing PoT only, denoted as PoT 3 .In other words, if the generated solution fails to pass the verification, it reconsiders its reasoning using the same prompting method instead of changing to another.The results are demonstrated in Figure 6.PoT 3 uses the same few-shot examples in three iterations while PoT Table 4: Results across different datasets without the planning module.We manually define the execution sequence, denoted as the combination of the first letter in each method.For example, 'PEC' indicates PoT-EoT-CoT.This suggests the necessity employing various reasoning methods in our framework.
Verification The verification module facilitates seamless switching between iterations.We here explore how helpful the active and passive verifications are. Figure 7  such a simplistic verification approach yields an alarmingly high false positive rate of 89.5% and 41.0%, as shown in Table 5.This drawback is particularly critical as our XoT's essence lies in the ability to adaptively switch methods, and a high false positive rate restricts the model's ability to explore alternative method options.By additionally incorporating active verification, despite a slight compromise in accuracy, the false positive rate is substantially reduced by 56.8% and 24.3%.We also note that this approach inevitably leads to an increase in the false negative rate.However, this is a minor drawback as the subsequent method options still have chances to get it correct.Consequently, employing active verification offers 2.3% gains to the overall XoT performance.Additionally, we explore the necessity of the iterative nature of XoT by removing the entire verification module.In this scenario, we only reason once with the most suitable method suggested by the planning module.The results are presented in Table 6.As our planning module mainly chooses the method from PoT or EoT, we here restrict the available methods to PoT and EoT only in XoT framework, which is denoted as 'XoT (only PE)'.By removing the verification module, the framework, denoted by 'XoT (w/o verification)' is no more capable of rechecking the answer thus cannot  perform iterative attempts to switch methods.This leads to a performance degradation of 4.9% and 2.9% on GSM8K and SVAMP respectively.

Comparison with Majority Voting
We additionally conduct experiments involving the majority vote of three distinct methods.The vote is based on three answers generated by three methods (one answer per method).As shown in Table 7, taking the majority vote of the three methods achieves 82.59 on average, while XoT achieves better performance at 84.63.Additionally, we observe that the majority vote fails on datasets containing questions that align exceptionally well with a specific method.Specifically, the majority vote achieves 79.73 on Algebra, while XoT achieves 89.94.The majority vote needs to execute all three methods to reach an answer, while XoT will stop when the answer passes the verification.We calculate the total token count as #total_token = #input_token + #output_token * 2, according to OpenAI's pricing policy 3 .As shown from the table, XoT is able to achieve higher performance with a lower budget, exhibiting a reduction of 16.7% in expenses.The token count includes all the incontext examples used and is averaged across the number of the total questions in 9 datasets.

Self-refinement
The design principle underlying XoT is its adaptable capability to switch methods, allowing for smooth integration with research aimed at improving individual methods.The line of iterative refinement methods enhances the model performance 3 https://openai.com/pricingby asking the model to rethink on its previous response, serving as a good alternative for the reasoning module in XoT.Specifically, before moving on to another method at each iteration, we allow the model to first make self refinement on its current approach, making the best use of current method.
Inspired by previous work (Madaan et al., 2023), after reasoning with one method for the first time, we require the model to analyze its response lineby-line and summarize several advice to mitigate the potential errors.Then, the model answers the question for a second time in the same method, with the summarized advice as a hint.After that, we verify the results produced by the second round and determine whether to switch to another method.
To achieve the iterative refinement in CoT, we follow Zheng et al. (2023) to progressively hint the model with the answers generated before.For PoT and EoT, we follow the released self-refinement prompts from Madaan et al. (2023).The results are shown in Table 8.We only allow the model to think twice using each prompting method.Though adding only one round of refinement yields marginal improvement within each single method, their collaboration contributes to a more significant improvement under XoT framework.

Conclusion
We propose XoT, an integrated problem solving framework that utilizes diverse reasoning thoughts to prompt LLMs.XoT integrates planning, reasoning and verification into a unified framework, enabling the model to explore multiple methods based on the active and passive verification of the solutions.We conduct extensive experiments on 10 math reasoning datasets to thoroughly evaluate the advantages of each module and showcase the efficacy of our proposed approach.Further results also show that the design ethos of XoT can generalize to logic reasoning domain.We consider its generalisation to more diverse tasks as a compelling avenue for future exploration.

A.1 Generalisation to logical domain
We analyze the generalisability of XoT framework to logical reasoning domain.One recent work (Pan et al., 2023) proposed LogicLM to solve logical reasoning questions using First Order Logic expressions and executed them in external symbolic reasoners.Following LogicLM, we design similar formal language expressions to represent First Order Logic and conduct experiments on FOLIO (Han et al., 2022), an expert-written, logically complex and diverse dataset for natural language reasoning.Our findings in Table 9 suggest that different methods in logical domain also show strong complementarity, achieving 77.45% under oracle setting.After involving the verification module, XoT performs at 62.75% on the validation set of FOLIO.These results underscore the applicability of XoT as a general problem solving framework.

A.2 Experiments on other models
We further assess the performance of XoT across various base models, such as Llama-2 series (Touvron et al., 2023b).The results are shown in Table 10, and we illustrate the performance scaling  10. curve in Figure 8.With less capable models, different prompting methods still demonstrate strong complementarity under oracle setting.Our observations suggest that smaller models tend to yield suboptimal results, likely due to the unbalanced performance across different reasoning approaches and the models' limited capability for active verification.This limitation inhibits the model's ability to timely switch between methods.However, as the model's size increases, XoT consistently shows its strength across the datasets.

A.3 Proportion of XoT
Figure 9 illustrates the proportion of different methods that XoT selects as the final answers.On GSM8K, 56.7% questions end up being solved with PoT, while 28.3% are tackled by EoT.The remaining 15% is left for CoT to solve.

B XoT with self refinement
We here offer the details of how we combine iterative self-refinement with XoT framework.As shown in Figure 10, the self refinement process can be integrated in the reasoning module, where the dashed line indicates rethinking using the same method.When the desired number of self refinement iterations is reached, the generated solutions will proceed to the verification module.Then the verification will determine whether to use the current solution or change to another method.

C Examples
In this section, we show the input and output examples of each module in XoT.Full prompts are available in public Github repository: https: //github.com/tengxiaoliu/XoT.For EoT, we use sympy4 library to solve the linear equations.

Figure 2 :
Figure 2: Complementarity of X-of-Thought methods on different datasets.The stacked bars indicate the best performance achieved by using one, two and three methods separately.Employing multiple methods under oracle setting can offer significant performance gains.

QuestionFigure 3 :
Figure 3: In particular cases where CoT and PoT fall short, EoT successfully solves the problem, which serves as a good complement.

Question##Figure 4 :
Figure4: Overview of XoT.Following the suggestion of the planning module, XoT first reasons with PoT.However, the generated answer fails in the verification module.In the second iteration, the selected method is EoT.The reasoning module successfully generates the solution that passes the verification.

Figure 5 :
Figure5: The correlation between oracle performance and final improvement.A higher oracle gain allows more room for XoT to improve.

PoT 3 Figure 6 :
Figure 6: Repeatedly exploiting the same method (PoT 3 ) results in limited complementarity compared to XoT with three methods.PoT 3 -d denotes we use different few-shot examples in three iterations.

Figure 7 :
Figure 7: Comparison of passive and active verifications.The blue and green matrices represent verifications for PoT and EoT respectively.

Figure 8 :
Figure 8: Performance scaling curve on different base models.The performance is averaged across the four datasets shown in Table10.

Figure 9 :
Figure 9: The proportion of different methods that XoT finally chooses as the answer on GSM8K.

Figure 10 :
Figure 10: Self refinement can be integrated in the XoT framework.The dashed block indicates the reasoning module with the inclusion self refinement.Within each self refinement process, the model repeatedly exploits the same method.

Table 1 :
Statistics of the datasets we used.# Steps denotes the average number of reasoning steps in the gold answers.⋆ indicates a rough estimate due to the inconsistent rationale formats.

Table 2 :
GSM8K SVAMP AQuA ⋆ Algebra GSM-hard AddSub SingleOP SingleEQ MultiArith Average Main experiment results across various math reasoning datasets.Under oracle setting, XoT switches the method if the generated answer does not match the gold answers.⋆ denotes we only use passive verification.∆ represents the improvement over the best performing baseline.

Table 3 :
Experiment results on MATH dataset.We only employ two methods and passive verification on MATH.
3 -d uses differente examples randomly sampled from the training set.It is observed that under orcale setting, repetitive exploitation of a single method has limited complementarity of 84.08%, which is 8.64% less than XoT.As a result, the final performance reflects such a gap with PoT 3 of 78.39% and XoT of 82.71%.

Table 5 :
Ablation results of different verification methods on GSM8K.Employing active verification significantly reduces false positive rate and results in a notable improvement in the overall XoT performance.

Table 6 :
Ablation results of excluding the entire verification module on GSM8K and SVAMP.XoT (only PE) is equipped with the verification module.The lack of this module compromises its ability for iterative methodswitching, resulting in diminished performance.

Table 7 :
Comparison between XoT and Majority Voting.XoT outperforms the majority vote approach in a more efficient manner, yielding an average gain of 2.04 with a reduction of 16.7% in token count.#Tokens denotes the average number of tokens consumed for one case (including prompts, question and response).

Table 9 :
XoT performance on logical reasoning task FOLIO validation set.Normal text reasoning and formal language FOL are complement to each other under oracle setting and XoT framework.