Solving Math Word Problems via Cooperative Reasoning induced Language Models

Large-scale pre-trained language models (PLMs) bring new opportunities to challenging problems, especially those that need high-level intelligence, such as the math word problem (MWPs). However, directly applying existing PLMs to MWPs can fail as the generation process lacks sufficient supervision and thus lacks fast adaptivity as humans. We notice that human reasoning has a dual reasoning framework that consists of an immediate reaction system (system 1) and a delicate reasoning system (system 2), where the entire reasoning is determined by their interaction. This inspires us to develop a cooperative reasoning-induced PLM for solving MWPs, called Cooperative Reasoning (CoRe), resulting in a human-like reasoning architecture with system 1 as the generator and system 2 as the verifier. In our approach, the generator is responsible for generating reasoning paths, and the verifiers are used to supervise the evaluation in order to obtain reliable feedback for the generator. We evaluate our CoRe framework on several mathematical reasoning datasets and achieve decent improvement over state-of-the-art methods, up to 9.6% increase over best baselines.


Introduction
Addressing math problems is a hallmark of human intelligence, which allows reasoning and adapting from limited data.We want neural models to be able to do the same, however, quick and flexible reasoning is challenging to current neural models as they must possess a certain level of prior experience from a limited amount of new data while avoiding overfitting.The rapid growth of largescale Pre-trained Language Models (PLMs) offers unprecedented potential for this issue, often relying on well-designed trigger prompts (Wei et al., 2022c; Li et al., 2022;Brown et al., 2020).Although appealing in terms of efficiency, its success relies on memorizing patterns with a sufficiently large number of parameters (≥ 100 billion) (Wei et al., 2022b), differentiating it from the fast adaptivity in the human reasoning process.
Active disciplines like neuroscience and cognitive science attempt to uncover the mechanism of human reasoning, and agree that our learning process is governed by an interaction mechanism, often referred to as System 1 and System 2 (Evans, 2003;Kahneman, 2011).In particular, System 1 offers fast responses like human instinct, and System 2 performs deliberate reasoning.Interactions between them are important for adapting to a continuously changing environment.PLMs behave more like System 1, according to the above theory, and thus lack the generalization ability in reasoning (Nye et al., 2021).
In this work, we explore a new line of zero-shot math problem reasoning, using a human reasoning-arXiv:2210.16257v4 [cs.CL] 28 May 2023 alike framework with feedback in the solution generation loop as opposed to pure PLM-based methods, called Cooperative Reasoning (CoRe).Intuitively, System 1 and System 2 are embodied as generators and verifiers, respectively, and they are defined as follows: generators for generating reasoning paths, and verifiers for supervising the paths' evaluation.Specifically, we train a LM beyond the question-answer paradigm by integrating in-theloop reasoning, i.e., we let the LM output both the answer and the corresponding reasoning process for a given question.Meanwhile, we introduce two types of verifiers, including token-level and sentence-level, allowing us to provide feedback in the whole solution generation lifecycle.Notice that the solution path is generated by selecting candidate tokens with some probability so that it is tree-alike and much coincides with the tree search process of Monte Carlo Tree Search (MCTS) (Kocsis and Szepesvári, 2006).With this in mind, the verifiers can score tokens along the solution generation process from start to end when using the MCTS.Therefore, we can use the score to evaluate the quality of the generation process during inferring before finalizing the solution, making timely feedback available for supervising the generation process.With this, the evaluation goes beyond the quality of the final result at the granularity of each reasoning step, extending the supervision from the solution level to the path level.We combine the solution score and the perplexity of its corresponding reasoning path to encourage the overall training towards high-quality augmented solutions while aligning with the reliable reasoning process, aiming to improve generalization ability.
Our experimentally evaluate CoRe on multiple mathematical reasoning datasets in both zeroshot and fine-tuning settings.CoRe consistently achieves better performance than competing baselines.Notably, CoRe has up to 9.6% improvements on MultiArith over SoTA baselines, which are dozens of times larger than our model.In summary, our contributions are as follows.
• We propose a novel reasoning method for mathematical problem solving, called Cooperative Reasoning (CoRe), that introduces feedback in the loop during solution generation as opposed to the sequential learning process in the previous ones, resulting in the first method for this task that builds on top of the learning mechanism in the human brain.• We develop a self-thinking strategy for further boosting reasoning ability with generated data from the cooperation between System 1 and System 2. • We demonstrate the superiority of CoRe comparing to other zero-shot and fine-tuning methods, which has 9.6% improvements on Multi-Arith over SoTA baselines.
2 Related Work

Dual Process System
Dual-process theory (Evans, 2003;Kahneman, 2011) argues there are two cognitive systems underpinning human reasoning: System 1 and System 2. The purpose of clarifying these systems is that they have the potential to help us construct artificial intelligence systems that benefit from human flexibility and methodical generalization.
Dual process system model guidance is not new.Nye et al. (2021) simulated Systems 1 and 2 to improve consistency and coherence of neural networks.Similar to several studies Cobbe et al. (2021); Li et al. (2022); Scialom et al. (2021), in addition to System 1 for the generation, we develop a distinct model as System 2, called Verifier.The Verifier checks the feasibility and correctness of the generator's content and collaboratively solves the reasoning task together.

Multi-step Reasoning
Many works exploit the multi-step reasoning ability of language models.Cobbe et al. (2021) showed that training a verifier to score the solutions generated by a fine-tuned GPT-3 could improve the performance compared to solely fine-tuning a GPT-3.Nye et al. (2022) discovered that asking the language model to write the intermediate process could achieve better results on various NLP tasks.Likewise, Chain-of-Thought (CoT) prompts (Wei et al., 2022c) prepended exemplars with intermediate reasoning steps as prompts and achieved SoTA on several reasoning benchmarks by using largescale PLMs.Wang et al. (2022) further boosted CoT's performance by sampling a bunch of possible solutions and then obtained the final answer by majority voting.DIVERSE (Li et al., 2022) proved diverse CoT prompts and an extra verifier were both helpful for PLMs to solve reasoning problems.Kojima et al. (2022) found that by simply adding "Let's think step by step" after the question.PLMs could successfully step by step solve the problems, called Zero-shot-CoT.
These above methods rely on extremely large language models, resulting in high computational cost and time-consuming.Moreover, several works (Wei et al., 2022c;Kojima et al., 2022) point out that neither CoT nor Zero-shot-CoT is helpful to smaller models.While our method does not necessarily require extremely large PLMs and can work with models with different size scales, thus reducing computational cost and inference time.Our approach has competitive zero-shot performance thanks to the efficient and collaborative application of a dual-process system.

Cooperative Reasoning
In this section, we will present the proposed cooperative reasoning framework, CoRe, that enforces System 1 and System 2 mutually cooperating, which includes 3 sequential steps: cooperative training, cooperative inference, and self-thinking.

Preparation
As discussed in Sec. 1, we expect a PLM (G) to fast generate multiple reasoning paths like System 1.Then, considering that System 2 is responsible for deliberate evaluations of the reasoning paths, we employ two modules: a step verifier (V step ) for reasoning steps, and a path verifier (V path ) for reasoning paths.

Cooperative Training
Before applying System 1&2 to inference, a critical issue for them is learn how to generate reasoning paths and evaluate reasoning steps/paths.Inspired by a widely-used training strategy for reasoners (Cobbe et al., 2021), we present a cooperative training method as shown in Fig. 2 Step 1.Moreover, we discuss hyper-parameter configurations and extra training details in Appendix B.1 and Appendix B.2.
Step 1.1: We first fine-tune G on a dataset D = {(q i , p i , gt i )} N i=1 consisting of N samples.Each sample x is composed of a question q, a reasoning path p and a ground truth answer gt.We fine-tuen G with standard language modeling objective L LM as Eq.(1).
Step 1.2: Once G has learned how to generate solutions, we employ it on questions q from D.
As a result, we obtain a new dataset D + = q i , rp i,j , a i,j i=1,...,N j=1,...,M with M generated reasoning paths (rp) and answers (a) for each q.
Step 1.3: Different from the popular methods, we train two verifiers to model human reasoning procedure with deliberate analysis for each step and the whole path.To evaluate several reasoning steps in a path, we desire a token-level scorer, which is named step verifier V step .Therefore, we fine-tune a PLM with two tasks jointly: 1) the language modeling task mentioned before; 2) the verification task to predict a score for each token in the solution.The verification loss L V S is calculated as the Mean Squared Error (MSE) of the predicted score with respect to the label as follows: where, (rp, a) from D + and gt with same q from D.
On the other hand, we need a path-level scorer for reasoning paths.Different from step verifier, we simply extract an overall presentation of the reasoning path for prediction.Specifically, we employ a BERT-like model and take the [CLS] token to calculate MSE loss L V P similar to L V S .
In summary, the overall training objective for verifiers is given by: LV = LV S + LLM + LV P . (3)

Cooperative Inference
After obtaining a generator and two verifiers, we propose cooperative inference to generate solutions for unseen questions.Instead of treating verifiers as voters, we argue that verifiers should offer appropriate guidance and feedback during the reasoning process.Therefore, we integrate a cooperative search algorithm.In particular, we adopt the popular Monte Carlo Tree Search (MCTS) (Kocsis and Szepesvári, 2006) to enable controlled reasoning.
The cooperative inference starts from the root node, which preserves question tokens.We detail the cooperative inference process as follows.
Selection.If the current node has children, with 50% probability, we select a node from its children with the modified PUCT formula (Czech et al., 2021) as Eq. ( 4), Step 1: Cooperative Training System 1&2 Step 1.1 Fine-tuning ({Q, P, GT}) Step 1.2 Generating Reasoning Paths ({Q}) Generated {Q, RP, A} Step Step 3: Self-Thinking where the state s represents the sequence consisting of all tokens in the current search path.And, N (s, n) means the times that node n has been selected in state s.Reward R(n) records all the scores received from the backup.We perform selection again with the selected node as the current node.
Otherwise, we perform expansion once and choose the returned new node as current node.
Expansion.During expansion, the generator is required to generate a sequence of tokens based on the current state.A new node is created to store the generated tokens and added to the current node's children.Then, V step evaluates the current reasoning path and predict a score score step .Finally, the new node is returned.Roll-Out.After selection and expansion, we start from the current node and let the generator complete the reasoning path until it meets [EOS] token or reaches the max token length limit.Next, V path evaluates the whole reasoning path and produces a score score path .Remember that V step also provides a score score step during the expansion.Therefore to leverage both scores, we introduce a hyper-parameter α to adjust their contributions to the node's reward, where s is the final score that each node receives by the backup.
Backup.We update the rewards back from the current node to the root node.The scores produced by verifiers are added to R(n) and the visited time N (s, n) is increased by 1.

Self-Thinking
Algorithm It is challenging to fine-tune models on the data synthesized by themselves, which indicates they have to be very confident in the content they generate.A proper self-training method can enhance the robustness of the whole system and allow deep data mining.Therefore, we introduce self-thinking as described in Fig. 2 Step 3 and Algorithm 1. Considering the noise contained in generated data, we build a filter by using scores from verifiers and perplexity (PPL) from the generator.In detail, we select high-quality reasoning paths by setting a score threshold.Moreover, we only keep the reasoning paths with no higher PPL than the ground truth solutions.After filtering, we merge D new with D and send it to Step 1. Once the several iterations are completed, we obtain a powerful System 1&2.More details can be found in Appendix B.3.

Zero-shot Inference
We simply perform cooperative inference as Fig. 2 Step 2 with trained System 1&2 on unseen datasets.
After obtaining several reasoning paths with scores, we arrive at the final answer by weighted voting based on scores following (Li et al., 2022).

Baselines
For comparison under the zero-shot setting, the results of Instruct GPT-3 (175B) and PaLM (540B) with their various methods are from Kojima et al. (2022).The zero-shot * and zero-shot-CoT * imply not the standard prompt (see details in Appendix B.4).We also provide our generator as a baseline when compared to previous fine-tuning methods.Regarding to sampling multiple solutions, we search 40 paths with the same setting as Self-Consistency (Wang et al., 2022).

Implementation Details
Since cooperative training requires a highquality dataset with reasoning paths, we treat GSM8K (Cobbe et al., 2021) as the seed dataset D in Sec.3.2.Unless otherwise, we employ GPT-J (Wang and Komatsuzaki, 2021) as the generator and the step verifier, DeBERTa-large (He et al., 2021) as the path verifier.Since the default setting consists of two GPT-J (6B) and a DeBERTa-large (0.4B), we note our backbone as "GPT-J 12B", which implies around 12.4 billion parameters in total.During generation, we apply calculator as assistant following Cobbe et al. (2021).We run all the experiments for 3 times and report the best re- sult, detailed hyper-parameters setting can be found in Appendix B.1.Our zero-shot setting is similar to the transferring setting in T0 (Sanh et al., 2022) and FLAN (Wei et al., 2022a).All the training and testing procedures are done on a DGX station with 8 A100 GPUs.

Zero-shot Results
Table 1 presents main results on two mathematical reasoning datasets, demonstrating the zero-shot generalization ability.CoRe achieves superior performance on both datasets, demonstrating its capability of mathematical reasoning on unseen datasets.Note that the baselines are several dozen times larger than ours and still underperform our model.The improvement might be explained by two potential reasons.One is that applying the CoRe framework on PLMs can activate their reasoning ability, even though their scales are small (≤ 100B).Another one is that self-thinking can provide valuable self-produced data to teach Systems 1&2.Therefore, the results present the effectiveness of cooperative working with System 1&2 and self-thinking.

GSM8K Results
Beyond improvements on zero-shot results, we observe that the fine-tuning setting can benefit a lot from our CoRe framework, as shown in Table 3.
Compared to previous fine-tuned SoTA (Cobbe et al., 2021) (GPT-3 350B), CoRe outperforms it with much fewer parameters, computation and inference time.Note that it samples 100 solutions for each question while we only search 40 paths.For a comprehensive comparison, we include few-shot results with large-scale PLMs due to a limited number of "fine-tune" competitors.With regard to few-shot methods applied on large-scale PLMs (≥ 100B parameters), CoRe only underperforms PaLM-540B strengthened by chain of thought prompt and self-consistency, further proving the effectiveness of our method.

Is guidance important during path searching reasoning?
We argued that it is important to introduce guidance in the loop during reasoning path searching.
To validate this argument, we adjust the weight of reward provided by verifiers during reasoning.The experiments are conducted using models without self-thinking.Table 4 summarizes the performance on zero-shot datasets with different settings of guidance.For "w/o verifiers", the solutions are predicted by a generator only and applied with "Self-Consistency".As demonstrated in Table 4, guidance from V path can provide performance gains on SingleOp, with a 20.6% absolute improvement.We further incorporate the guidance from the step-level verifier V step .As described in Eq. ( 5), increasing the weight of reward (α) from V step , CoRe achieves a higher accuracy on both SingleOp and MultiArith.Thanks to the feedback and guidance during the reasoning stage, the generator tends to explore more often on a path with a higher reward score.As a result, CoRe increases the accuracy on SingleOP from 59.6% to 82.9% and MultiArith from 92.3% to 96.8%.

How much does self-thinking boost the reasoning ability of a language model?
To examine the effect of self-thinking, we explore it along with two axes: 1) the number of iterations and 2) the type of search strategy.Since we apply the self-thinking procedure on the GSM8K dataset, we investigate the performance of models under different settings on GSM8K, as shown in Table 5.First, increasing the number of iterations can always improve the performance for both greedy decode and self-consistency.Our CoRe reaches sat- Generator only (Greedy) 29.9 34.7 34.9 Generator + Self-Consistency 42.0 43.1 45.9 CoRe 60.0 63.2 61.6 uration in one round, which might be attributed to the fact that System 1&2 learns better and faster on self-generated data by collaborative working.Second, regardless of the search strategy, self-thinking consistently boost the model's performance, which verifies that self-thinking boost language model's reasoning ability.

Do self-thinking generalize to other datasets?
We have performed self-thinking on GSM8K and proved that it improves the model's reasoning ability in 4.3.2.Furthermore, we explore whether the improvement on GSM8K comes at the cost of performance degradation on other datasets, i.e. the model overfits the dataset.As presented in

How performance varies as the number of search iterations for different search strategies changes?
As shown in Fig. 3, accuracy on 4 datasets consistently increases along with the growth of search iterations for both search strategies.However, the scaling curves of self-consistency and CoRe are quite different.The performance gain quickly saturates with self-consistency.Sampling 40 paths can not further improve the accuracy, while the scaling curve of CoRe is much sharper.Due to the heuristic algorithm that requires the model to continue exploring on the previously generated paths, CoRe starts from a relatively lower level in the beginning, whereas the accuracy quickly improves as the number of search iterations increases.The result demonstrates the effectiveness of CoRe in searching reasoning paths, with a fast growth curve and a slow saturation rate.

Improvements from CoRe
A typical exemplar from GSM8K is presented in Table 7. Greedy decode fails to find a reasonable path due to the limited exploration in the output space.
In contrast, self-consistency samples multiple reasoning paths randomly, resulting in a richer candidate set.Although it finds some right solutions occasionally, without any guidance, it fails to explore more frequently on the high-quality paths, Question Sophia and Rose went together to the market to buy onions and potatoes.Rose bought 4 times the number of onions and potatoes Sophia bought.If Rose bought 12 onions and 4 potatoes, how many onions and potatoes in total did Sophia buy at the market?Since Rose bought 12 onions, this means there were 12 / 4 = 3 onions that Sophia bought.The number of potatoes that Sophia bought is 4, meaning there were 4 / 4 = 1 potato that Sophia bought.The total number of onions and potatoes that Sophia bought is 3 + 1 = 4.

CoRe Path 2
The number of onions Rose bought is 12 onions / 4 = 3. Sophia bought 4 / 12 = 1 / 3 of the number of onions Rose bought.The number of potatoes Rose bought is 4 potatoes / 4 = 1.33 of the number of potatoes Rose bought is 4 potatoes / 12 onions = 0.33.The total number of onions and potatoes Rose bought at the market is 3 onions + 1 potatoes = 4.

CoRe Path 3
Let x be the number of onions Sophia bought.Rose bought 4 * x onions.The total number of onions and potatoes Rose bought is 4 * x + x = 12. 5 * x = 12.x = 4. Sophia bought 4 onions.

0.023
Table 7: GSM8K samples generated through different searching methods.SC refers to Self-Consistency.The erroneous parts of each path are highlighted in red and the corresponding correct parts within other paths are blue.

Question
Gunter is trying to count the jelly beans in a jar.He asks his friends how many they think are in the jar.One says 80. Another says 20 more than half the first one.A third says 25% more than the first one.What is their average guess?
No  thus ending up with a wrong answer obtained by majority voting as shown in the fourth row.
As a comparison, results generated by CoRe are listed with their scores.Similar to random sampling, the reasoning paths might be partially illogical, even though the final answers happen to be correct.Despite this challenge, CoRe is capable of distinguishing those poor-quality paths from the superior ones thanks to the verifiers.Adhering to the philosophy of cooperative reasoning we have emphasized, the verifiers managed to harness the generator throughout the reasoning procedure with the help of MCTS.Therefore, CoRe enjoys not only the advantage of having a diverse candidate set, but also the merit of being wiser and efficient during reasoning path searching.

Improvements from Self-Thinking
Table 8 shows an example that the vanilla model failed to solve the given question, whereas after the self-thinking, the model rectified the faulty parts and successfully addressed it.This displays that self-thinking boosts language models' inner reasoning ability regardless of the search strategy, which is also proved in Sec.4.3.2.

Discussion
Although we only fine-tune the language model on GSM8K due to the scarcity of QA datasets annotated with intermediate rationales, zero-shot results on several arithmetic datasets prove that basic reasoning capability is transferable across datasets within the same domain.This observation implies that when it comes to a new domain, we only need to collect a limited number of questionanswer pairs with reasoning paths, model's reasoning ability can generalize to other unseen datasets and can be further strengthened by our approach CoRe according to the experimental results.

Conclusions
In this work, we mimic the dual system of human cognition to develop an effective reasoning framework for solving the math word problems.The proposed approach is consisting of two ingredients: the generator as System 1 and the verifiers as System 2, and overall reasoning is conducted based on their mutual reinforcement.From the robustness and generalization aspects, CoRe activates superior reasoning ability of LMs, and thus outperforms PLMs that are dozens of times larger.

Limitations
The outcome on multiple datasets verifies the powerful reasoning ability, which even works on models with only several billion parameters.However, our self-thinking procedure utilizes only one dataset, GSM8K, and the available training set size is only 7.5K.The main reason is the scarcity of high-quality datasets with rich reasoning paths.And, collecting such data incurs huge computation costs and expensive human resources.Another limitation is that we have not conducted experiments on bigger language models, such as GPT-3 and PaLM, due to the expensive usage costs and the fact of no open-source codes.In a nutshell, in the future, we will focus on collecting more high-quality labeled data and exploring our method on more powerful language models.

Ethics Statement
In this work, our CoRe shows impressive reasoning capability, however, it also comes with social risks.Here, we summarize three possible ethical impacts: i) PLMs with bias, ii) generated data with social stereotypes and iii) problematic data environments.Considering utilizing PLMs as backbones, several works present various potential risks in PLMs (Lucy and Bamman, 2021;Amin and Kabir, 2022).Fortunately, our method supports the replacement of different PLMs.Therefore, we encourage deploying some risk-free PLMs, expecting to reduce the potential ethical risks.Furthermore, once deploying harmful PLMs, the self-thinking process might generate several undesired data and those data are fed into language models, which deepens the bias and causes unintended social impacts.For reducing the aforementioned cases, we suggest recording generated sentences.In realworld applications, a good choice is to monitor generated content and then hand them over for human review.In addition to the two risks posed by PLMs, the data in downstream tasks is of great concern.In particular, private data might cause unpredictable influence because of their nature as a non-open source.Therefore, we believe that a data cleaning workflow is necessary to mitigate potential risks, such as PrivateClean (Krishnan et al., 2016).Finally, we encourage open debating about its utilization for increasing transparency and reducing the potential for misuse.

A Dataset Details
The mathematical reasoning datasets with details are as follows (Detailed description of the statistics in Table 9).We follow the licenses for their papers.
The dataset in fine-tuning: GSM8K (Cobbe et al., 2021) is a high-quality dataset with reasoning paths.It consists of 8.8K grade school math problems created by human writers, which are divided into a train set (7.5K) and a test set (1.3K).The reasoning paths include 2 to 8 steps with considering basic arithmetic operations.Furthermore, we conduct cooperative training and self-thinking on its training set.
The datasets in zero-shot inference: ASDiv-A (Miao et al., 2020) includes diverse math word problems, which are required to answer a number for each question.SingleOP (Roy et al., 2015) is proposed with elementary math problems of a single operation.SingleEq (Koncel-Kedziorski et al., 2015) is construed with both single-step and multi-step math problems from mixed sources.MultiArith (Roy and Roth, 2015) includes elementary math problems with multiple steps.

B.1 Hyper-parameters Setting
For the generator and the step verifier, we train them for two epochs.The batch size is set to 16.The learning rate (LR) is set to 1e − 5 at the first epoch and 1e − 6 at the second epoch for generator.On the hand of step verifier we apply the warmup method then linearly decaying scheduler, LR is set to 1e − 6 and warmup ratio is 0.1.
For the path verifier, we train it for three epochs with batch size set to 128 and LR set to 1e − 5. Same LR scheduler as the step verifier has been applied for the path verifier.We set the gradient clip norm to 1.0 and the sampling temperature to 0.7.The random seed is set to 19990303 throughout the training process.
For MCTS, we set max search iterations to 40 during inference.In expansion, we search 20 tokens each time.In order to avoid expanding too many homogeneous children for the same node, we simply penalize the probability of first token if it has appeared in other child nodes.We set the max token number to 300 in roll out and limit the total token number of reasoning path to 400.

B.2 Details of Training Verifiers
Before two verifiers are fine-tuned, we utilize the generator to sample 100 solutions for each question following Cobbe et al. (2021).Then we train the two verifiers on the generated data as described in Sec.3.2 Step 1.3.

B.3 Details of Self-Thinking
In each iteration of self-thinking, we initialize the model with the weights obtained from the previous round so as to save the computational costs.Since we use cooperative inference rather than random sampling to generate data for further training, solutions are expected more high-quality.Thus, the number of generated solutions M mentioned in Sec.3.2 is set to 50 for saving computational cost and time.Due to the flexibility of MCTS, we have also tried to limit the time for searching rather than the number of iterations, which makes the total search time controllable and predictable.Moreover, this allows the model to adaptively adjust the final number of solutions searched for each question, due to the different levels of difficulty in questions.
In our experiments, we observe that setting the time limit to 320 seconds provides better results than setting the iteration limit to 50, while maintaining approximately the same time consumption.Therefore, we use time control to generate data during self-thinking.

B.4 Baseline Settings
As shown in Table 1, the Instruct GPT-3 is based on text-davinci-002 version.Moreover, since Kojima et al. (2022) provides difference prompt setting, we list them in Table 10.For few-shot scenarios with the chain of thought prompts, we follow the original paper (Wei et al., 2022c).reasoning paths reaches 30 and maintains a faster increasing rate after that.As a result, CoRe has a superior performance over Cobbe et al. (2021) on all the datasets and achieves a 9.1% and 8.3% improvement compared to it on ASDiv-A and Sin-gleOp.

D Future Work
We focus on measuring our method in boosting the language model's arithmetic reasoning ability in this work.Nevertheless, we believe that our framework can also be applied to other reasoning tasks seamlessly, e.g., commonsense reasoning and symbolic reasoning.We choose arithmetic reasoning because it is the fundamental type of reasoning task.Additionally, we believe solving arithmetic reasoning is the first step toward a general cognitive reasoning system.In the future, we will explore other reasoning tasks and put more effort into lowresource scenarios.

Figure 1 :
Figure 1: Comparing our CoRe with popular methods in mathematical logic reasoning tasks.

1
Self-ThinkingInput: Generator G; Step verifier V step ; Path verifier V path ; Dataset D. 1: Combine generator and verifiers with a cooperative search algorithm.until performance is saturated.

Figure 3 :
Figure 3: Zero-shot results with different search strategies in cooperative inference.

Table 2 :
(Roy and Roth, 2017)gWe compare CoRe with previous fine-tuned SoTA baselines on four datasets, and results are presented in Table2.To show the importance of cooperative reasoning, we apply our generator as a baseline.The results demonstrate that without any guidance generator underperforms previous methods on most datasets.Despite the gain from self-consistency, it still lags behind other fine-tuned SoTAs.While after applying our method CoRe, it surpasses previous fine-tuned SoTAs on all datasets in a zero-shot setting.The results clearly demonstrate the capability of CoRe to greatly boost PLMs' reasoning ability.Zero-shot results v.s.previous fine-tuned SoTA results on math reasoning tasks.The previous SoTA baselines are obtained from:a:(Lan et al., 2022), b: LogicForm (Liang et al., 2016), c: UNITDEP(Roy and Roth, 2017), d: Relevance and LCA operation classifier(Roy and Roth, 2015).The best scores are in bold.

Table 5 :
Results on GSM8K with models undergone a different number of self-thinking iterations.Outcomes of various search strategies are provided.

Table 6 :
Zero-shot results with a different number of self-thinking iterations for generator and verifiers respectively.

Table 8 :
An example of GSM8K, model with self-thinking reasoned correctly, while the non-self-thinking model generated a wrong reasoning path and therefore failed.
Cobbe et al. (2021)licate the work ofCobbe et al. (2021)with GPT-J and report the results in Table 11 for comprehensive comparison.CoRe fully surpassesCobbe et al. (2021)when the number of

Table 10 :
Prompt setting for few-shot baselines.

Table 11 :
Comparison between Cobbe et al. (2021) and CoRe with GPT-J as backbone model.The best scores are in bold.