Large Language Models are Better Reasoners with Self-Verification

Recently, with the chain of thought (CoT) prompting, large language models (LLMs), e.g., GPT-3, have shown strong reasoning ability in several natural language processing tasks such as arithmetic, commonsense, and logical reasoning. However, LLMs with CoT require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes and vulnerable to error accumulation. The above issues make the LLMs need the ability to verify the answers. In fact, after inferring conclusions in some thinking decision tasks, people often check them by re-verifying steps to avoid some mistakes. In this paper, we propose and prove that LLMs also have similar self-verification abilities. We take the conclusion obtained by CoT as one of the conditions for solving the original problem. By performing a backward verification of the answers that LLM deduced for itself, we can obtain interpretable answer validation scores to select the candidate answer with the highest score. Experimental results demonstrate that the proposed method can improve the reasoning performance on various arithmetic, commonsense, and logical reasoning datasets. Our code is publicly available at: https://github.com/WENGSYX/Self-Verification.


Introduction
The ability of reasoning in the process of thinking and decision-making is an essential aspect of human intelligence.Recently, chain of thought (CoT) prompting (Wei et al., 2022) has been a good way to solve the arithmetic, commonsense, and logical reasoning tasks with large language models (LLMs), which help the LLMs simulating the human thinking process when solving complex natural language processing (NLP) tasks.CoT guides LLMs to generate a series of intermediate reasoning steps to address complex problems rather than just predict * These authors contribute this work equally.a final answer.This approach has been shown the advance performances on several challenging NLP tasks, even when using only a few or no training samples (Madaan et al., 2022;Saparov and He, 2022;Fu et al., 2022).
Although CoT can enable large models to solve complex reasoning tasks, it is highly sensitive to individual mistakes and vulnerable to error accumulation (Shen et al., 2021).The multi-step prompting and multi-token prediction are required in CoT, which may lack robustness while addressing complex reasoning tasks with the autoregressive mechanism (Vaswani et al., 2017;Radford et al., 2019;Brown et al., 2020;Zhang et al., 2022a).If a tiny mistake occurs, it can change the meaning deviations of the whole statement (Xiao et al., 2022), leading to incorrect answers (Cobbe et al., 2021).That is especially problematic in using CoT for addressing multi-step precise reasoning (such as mathematical calculation).Due to the lack of the error correction mechanism, it is difficult for the LLMs to obtain correct results from the possible errors in multiple steps reasoning.
Previous methods resolve the above issue by training another verifier to evaluate the correctness of the model's output (Shen et al., 2021;Li et al., 2022).However, there are some drawbacks in these work.On the one hand, training a verifier requires a lot of human annotations and additional fine-tuned models, which limits its widespread use in other tasks and domains.On the other hand, the verifier fine-tuned by a language model is not easily explainable, making it difficult to assess the model's reliability based on its output scores.Therefore, the challenge of obtaining a better reasoner based on the LLMs is to get a verifier that can avoid manual annotation and additional training, so that it can be better extended and migrated to other fields and tasks.
To address this challenge and overcome the limitations of training verifiers, we propose utiliz-arXiv:2212.09561v4[cs.AI] 24 May 2023 Q: Jackie has 10 apples.Adam has 8 apples.How many more apples does Jackie have than Adam?A: 2 具体方法 : 1 2 Jackie have 2 apples than Adam.
To mimic the self-verification ability of human, we predict the accuracy of f C by predicting the original conditions f 1 or f 2 is right or not based on this conclusion.
ing LLMs as reasoners with self-verification for selecting better prediction results.In numerous decision-making tasks, humans often perform selfverification of inferred conclusions to mitigate mistakes (Poole and Mackworth, 2010).In this paper, we propose and demonstrate that LLMs possess a similar self-verification ability, the better reasoning with CoT is carried out in the following two steps, Forward Reasoning and Backward Verification.Specifically, in Forward Reasoning, LLM reasoners generate candidate answers using CoT, and the question and candidate answers form different conclusions to be verified.And in Backward Verification, We mask the original condition and predict its result using another CoT.We rank candidate conclusions based on a verification score, which is calculated by assessing the consistency between the predicted and original condition values.For example, as shown in Figure 1, by taking f 2 and f Y as conditions to predict the value of condition attribute in f1 , the correctness of f Y can be evaluated by comparing the consistency of values of the predicted f1 and the original f 1 .
Our method employs LLMs for self-verification with only a few prompts, eliminating the need for fine-tuning or gradient updating.This approach enables automatic verification of multiple candidate answers and corresponding conclusions, mitigating deviations from the correct thought chain in the original CoT.Our verification score arises from evaluating each step during the backward verification phase, rather than from the direct output of a neural network model (Cobbe et al., 2021;Li et al., 2022), enhancing the explainability of prediction outcomes and solution processes.We conducted experiments on various open-source datasets for mathematical reasoning, common sense, and logical reasoning tasks, achieving state-of-the-art results (e.g., 60.8 → 65.1 on GSM8K (Cobbe et al., 2021), 91.01 → 93.40 on SingleEq (Koncel-Kedziorski et al., 2015)).In addition, we also attempt to combine our method with some approaches to improving forward reasoning, such as self-consistency Kojima et al. and Least-to-Most Zhou et al. (2023).The experimental results show that our method also improves upon these forward reasoning approaches.
Our contributions are summarized as follows: 1. We propose and prove that large language models (LLMs) can self-verify their prediction results.The proposed method can provide interpretable verification scores without the need for train additional verifiers.
2. We have conducted extensive of experiments with multiple LLMs, and the experimental results on multiple mathematical, commonsense, and logical reasoning datasets achieve a significant improvement compared to the baseline.
3. We introduced True-False Item Verification for General Tasks in the backward verification stage and proposed Condition Mask Verification based on the characteristics of Arithmetic Tasks.Our method can be applied to a wide range of reasoning datasets, potentially paving the way for self-validation to become a new paradigm following pre-training and prompt learning, thus motivating further exploration of the capabilities of LLMs.
In-context Learning.Large language models such as GPT-3 exhibit impressive few-shot learning ability (Lu et al., 2022;Qiao et al., 2022), it requires only filling a few exemplars into context as prompts and without the need for finetuning on a dataset of training examples.However, this approach struggles with tasks requiring complex reasoning (Rae et al., 2021), which drives researchers to explore other prompting strategies.CoT (Wei et al., 2022) is a chained reasoning approach that inserts a multi-step reasoning path before generating the final answer.Wang et al. (2023b) proposed a self-consistency decoding strategy to vote on the reasoning path, and Kojima et al. demonstrated that LLMs could as zero-shot reasoners through the prompt "Let's think step-by-step".These methods focus on constructing the CoT but ignore the high sensitivity of LLMs to individual mistakes in generating these chains, so some of these conclusions by CoT may be unreliable.In this paper, we proved that LLMs can self-verify their conclusions.
Answer Verification.It is a common method for evaluating and reordering candidate answers with a trained language understanding model.Kushman et al. (2014) 2021) fine-tunes GPT-3 as a verifier, which calculates token-level and solution-level verification scores for a predicate result.However, the above method all need additional annotations.In our work, we do not require training examples and can provide an explainable verification score.

The Proposed Method
The proposed method can be used to verify prediction results.As shown in Figure 2, the process mainly consists of two steps.The first step, forward reasoning, is similar to the normal CoT, except that multiple candidate answers are generated through sampling decoding.In the second step, we calculate the verification scores for each candidate's answer by the self-verification method, and the answer with the highest score is selected as the final answer.

Forward Reasoning
In forward reasoning, the LLM reasoners generate candidate answers with the chain of thought prompting.We augment the input with several CoT prompts similar to the original query and then send it to the LLM.The LLM then performs sampling decoding to generate multiple candidates for verification.
As shown in Figure 2, for a reasoning task, the large language model LLM is given a question X which is accompanied by a chain of thought set C. In few-shot setting, the whole prompt also contains other question-CoT prompt-answer tuples.The input X can be further subdivided into X = {f 1 , f 2 , . . ., f R , q}, where each f i is a condition (fact), and q is a question, both represented as natural language clause or sub-sentences.
Specifically, in order to generate step-by-step solutions with CoT, we followed Wei et al. (2022) and also designed CoT prompt set C for the reasoning dataset (e.g., the GSM8K dataset), which contains n samples, each sample has the question Ẋ , chain of thout ṫ, and the answer ẏ.These samples are used as the input of test-time.Each example in C is concatenated as a prompt: C = ( Ẋ0 , ṫ0 , ẏ0 ); ( Ẋ1 , ṫ1 , ẏ1 ); . . .; ( Ẋn , ṫn , ẏn ) Therefore, LLM is required to follow the prompt of C to generate the chain of thought t CoT before generating the final answer y: P(y|C, X ) = P(t CoT |C, X ) × P)(y|C, X , t CoT )  A K : ......

sorted by scores
What is the answer of 'X'?To ensure the diversity of different answers, we adapt sampling decoding (Radford et al., 2019) to generate multiple y for K times.Specifically, sampling decoding is a random decoding method, which can select the next word by sampling from a probability distribution over the possible words at each step.Multiple candidate answers can be obtained when repeatedly using sampling decoding.For example, we generate "18" and "2" as candidate answers in the example of Figure 2.

Backward Verification
Step 1 may generate multiple different answers, this step is used to verify and select the best answer.Backward verification involves several sub-steps.First, the original question with each candidate's answer is rewritten as a conclusion and then supplemented as a new condition (incarnadine color in Figure 2).Then, we considered two methods to construct new questions.In the general QA task, the True-False Item Verification is given based on all the conditions, asking the LLM whether these conditions are mutually satisfied, it has a broad applicability.In Arithmetic reasoning tasks, as the definite condition masks can indicate the reasoning direction of the language model, we propose the Condition Mask Verification method to design questions for the verification stage.Finally, we perform multiple experiments to compute the verification score by comparing the consistency between the predicted condition value and the original masked condition value, and select the candidate answer with the highest score as the final answer.

Rewritten Candidate Conclusion
Besides, we rewrite the original question with the candidate's answer as a conclusion and then supplement it as a new condition in the backward verification step.Specifically, we use the instruction prompt "Please change the questions and answers into complete declarative sentences [q] The answer is [y]" to change q and y into new declarative sentence f Y by LLM.As shown in Figure 2, we can rewrite the question and conclusion as "Jackie has 18 apples more than Adam".

Condition Masking
For question generation, the diversity of the problems makes it difficult to balance the need for coherence and fact consistency between questions and answers in practical operation (Sun et al., 2018;Ji et al., 2022).To tackle this issue, we included clear questions asking the language model to accurately predict.
True-False Item Verification (TFV).This approach can be applied to a wide range of reasoning QA tasks.We directly add "Do it is correct (True or False)?" after all the conditions, requiring the LLM to self-evaluate the correctness of these conditions.
Condition Mask Verification (CMV).Further, we use regular expressions to filter out specific conditions, such as numbers, and then mask them in turn.If we do not mask all conditions but randomly select a condition, unnecessary conditions may be masked, which will significantly impact the verification answer.For example, "Dana worked 9 hours on Friday, 10 hours on Saturday, and 3 hours on Sunday.She earns $13 per hour.How much money did Dana earn in weekend?", since condition 1 (9 hours) does not affect the conclusion, it is difficult to predict it correctly.We replace all occurrences of f in the original X with "X" in turn, and ask LLM to re-predict it.Then we rewrite the question.For example, we might find a value in f 1 and replace it with "X".We can then add "What is the answer of 'X'?" to the end of the new question, effectively turning it into an equation.This technique helps to guide the language model towards the correct answer.

Verification Score Calculation
This backward verification chain of thought is similar to solving an equation.We design a chain of thought prompt, like forward reasoning, to guide LLM in generating a solving process.We input the newly constructed sentences into LLM.For TFV, we can directly count the number of answers that are True as the score, and for CMV, we will match its final result with the masked condition.
Due to the limited performance of LLM itself, if the condition is verified only once in the backward verification step, it is easy to have the same score, resulting in a lack of differentiation.To address this, we repeat the sampling decoding process P times, so that the verification score can more accurately reflect the model's confidence for a given conclusion (Erd, 1970).
The verification score is calculated as follows: Where 1 (•) is an indicator function.
Finally, we select the one with the highest verification score from the K candidate answers generated as a result.
For example for CMV, in Figure 2.3)Verification, we match the results generated by the selfverification of LLM with the masked conditions.There is one "10" in the conclusion of A 1 , so the verification score is 1.There are four correct results in A 2 , so the verification score is 4, and we finally choose A 2 , which has the highest verification score, as the final conclusion.

Task and Dataset
We evaluated eight datasets on three reasoning tasks: arithmetic reasoning, commonsense reasoning, and logical reasoning.These datasets are highly heterogeneous in terms of their input formats (see Appendix A.2 for the detailed description of each dataset.Examples of different datasets are given in Table 7 of Appendix A.5).
Common-senseQA (CSQA) (Talmor et al., 2018) is the most typical dataset of the task, which requires commonsense knowledge about the world to accurately answer questions with complex meanings.
• Logical Reasoning.Date Understanding (DU) (Srivastava et al., 2022) involves inferring a date from a given context.

Prompts
We conducted all experiments in the few-shot setting without any fine-tuning of the original LLM To ensure a fair comparison, we used the same prompts as in Wei et al. ( 2022) for forward reasoning.We made several changes of the prompts for backward verification (the details are shown in Appendix A.6).

Implementation
In each experiment, we perform CoT prompting on the LLMs, then LLMs generate conclusions (answers) by sampling decoding without top-k truncation.When forward reasoning, we generated K = 5 candidate answers (conclusions).In backward verification, each candidate conclusion generated P = 10 times, and the maximum token length of each decoding was 168.After LLM generates the output, we only select the part of the text that conforms to the conclusion format.Appendix A.1 shows the specific strategy for different tasks.In addition, to ensure a fair comparison, we ran each experiment three times and calculated the average result.

Result and Analysis
The main experimental results are shown in Table 1.
The table shows that the proposed self-verification method (SV) can improve previous methods in all datasets.Our method achieved a new stateof-the-art (SOTA) performance in six of these eight datasets.Appendix A.5 shows specific examples of language model self-verification for each dataset.Additionally, we observed that self-verification led to an average increase of 2.33% in the highperforming Instruct-GPT model, which indicates that the model with strong forward reasoning capabilities also benefits from the self-verification mechanism.The detailed experimental conclusions and analysis are described as follows: The current self-verification method is more suitable for arithmetic reasoning tasks than other reasoning tasks.We find that the average performance improvement of arithmetic reasoning tasks (1.67%/2.84%↑) is higher than that of other reasoning tasks (0.62%/0.78% ↑) in Table 1.We believe the reason is that it is easier to find the required mask conditions for arithmetic reasoning tasks, but other reasoning tasks used TFV that cannot determine the exact conditions.In the future, we will consider the targeted condition selection and masking for other reasoning tasks.
The self-validation method can be combined Figure 3: The self-verification ability of models with different sizes.with improved methods for forward reasoning.We report the results of combining selfconsistency or PoL at the bottom of Table 1 separately.Specifically, for combining self-consistency, we use the Top-2 candidate results obtained from self-consistency in the Forward Reasoning stage and then use self-validation to re-rank the candidate results; for combining PAL, we require the generation of runnable programs in Forward Reasoning to obtain candidate answers.We find that this approach still can achieve better performance than self-consistency, demonstrating that self-verification can be combined with a series of existing methods for improving forward calculation to achieve further gains.We believe that the selfverification can re-rank candidate answers from the perspective of backward validation, providing more robust results.
Larger language models are better reasoners with self-verification.Figure 3 shows the GPT-3 model capability with parameters from 0.4B to 175B.The experimental results show that when the parameters are small, the self-verification ability of the model is weaker, even worse than the original performance of CoT.This result aligns with the few-shot experiment results in Wei et al. (2022).The most reason is that self-verification is also an emerging ability, which will appear in larger models.We think that larger language models are able to generate more robust and accurate results in in-context learning Ho et al. ( 2022 With the different number of few-shots, the reasoning ability of models using selfverification has significantly improved.Figure 4 demonstrate the impact of different sample sizes on three arithmetic reasoning datasets.We observe that the self-verification method exhibits greater robustness with smaller samples, even with only 2-shots (At this time, it has 99.6% performance of 8-shot, while CoT has only 98.7%).In addition, we find that even with only four samples (2 CoT samples + 2 self-verification samples), self-verification outperforms the CoT with eight samples, which highlights the importance of answer verification in scenarios of limited data.
The more verification conditions are used, the better self-verification reasoning ability.We observed the effect of using the single conditional mask on six different arithmetic datasets for Condition Mask Verificat in Figure 5.As each number in these datasets' input can be thought of as a condition, we can study the impact of increasing the number of validation conditions.In most exper- iments, we found that the multi-condition mask performed better than the single-condition mask, and both performed better than the original CoT.These results suggest that the accuracy of verification scores improves as the number of available conditions increases.Fewer computational resources can also improve performance through self-verification.In Figure 6, we show the results of changing the number of P generated in Backward Verification.We find that even when P = 2, only a small increase in computational overhead is needed, and there is still an improvement in CoT baseline.Considering that performance starts to slowly increase when P is increased to 10, we recommend choosing an appropriate value for P (e.g.P=10) to achieve a balance between performance and resource consumption.
Masked conditions can guide the LLMs to reason more effectively.As shown in Figure 7, we compared the results of using CMV (Conditional Masked Verification) and TFV (Token Form Verification) for self-verification.We found that the performance of CMV is generally better than TFV.We believe this is because the lack of explicit goals can lead to a lack of use of existing conclusions, so CMV is more helpful in stimulating the selfverification ability of the model.However, due to its simplicity, TFV can be applied to a variety of tasks (including common sense reasoning and logical reasoning, both with improvements compared to the CoT baseline) for self-verification, making it highly adaptable to different scenarios.

Conclusion
In this study, we show that large language models have a strong ability to self-verification, allowing them to assess the conclusions they generate accurately.We propose a novel method that uses self-verification to generate interpretable scores for ranking results in few-shot tasks.Our approach demonstrates the potential of using self-verification to improve the accuracy and reliability of large language models in reasoning tasks.By relying on the self-verification ability of large language models, we significantly improved the accuracy of three types of reasoning tasks.Our results suggest that self-verification may be an important step toward achieving human-like intelligence in artificial intelligence systems.

Limitations
Our self-verification method relies on large language models.It prompts to guide the models in verifying their own results, but it is important to note that these prompts are artificially constructed and may introduce bias.The method's effectiveness is limited by the presence of the correct answer within the candidate conclusions produced by the LLMs, and therefore depends on the model's ability to forward reason correctly.While our method can improve LLMs' accuracy in reasoning tasks, it does not focus on the reasoning process itself but instead on the conclusions reached through reason-

A.2 Dataset Details
Our method is evaluated on eight benchmark datasets that cover arithmetic reasoning, commonsense reasoning, and logical reasoning tasks.The statistics of the datasets are shown in Table 5.
We list the details for all datasets used in this paper.
• GSM8K: https://github.com/openai/grade-school-math The main experiment was run on November 25th to December 10th, the single-condition rxperiment was run on November 20th to 25th, the Few CoT prompts experiment was run on December 12th, the True-False Item Verification experiment was run on December 12th to 15th, the different sizes models experiment was run on December 16th, and the computational reasource experiment was run on December 18th.

A.4 Self-Verification Bias
LLMs' self-verification can find the correct answer, but it may also wrongly judge an incorrect answer.
In Table 6, we present more detailed experimental results of LLMs with self-verification to verify its own results.During the forward reasoning step, LLMs generates 1-5 candidate conclusions per sample which may be correct or incorrect.We then use LLM with self-verification to verify these conclusions, and we count LLMs' accuracy in verifying correct and incorrect conclusions.We found that LLM has a higher accuracy in verifying a correct conclusion, but there is room for improvement in LLM's accuracy in verifying an incorrect conclusion.This may be caused by arithmetic errors or chain of thought errors during backward verification.We will solve this problem in the future.

A.5 Additional Experiment Results
In Table 4, we show whether to generate real examples of multiple condition masks.We found that if only the first condition mask is used, the score is zero, and multiple evidence masks can obtain more accurate verification scores.Then, Table shows the difference between (A) Conditional Masked Verification and (B) True-False Item Verification.The difference between the two lies in whether the condition mask is correct and the template for the question.
Finally, we generated some selfverification examples using the Instruct-GPT (code-davinci-002) model.As show in Table 7.

A.6 Forward reasoning prompt
For fairness, we used the same CoT prompts cue as Wei et al. ( 2022) in forward reasoning.
As show in

Multiple-Choice
Pick up the first large letter encountered in the text.

True or False
Pick up the first "True" or "False" encountered in the text after removing unnecessary letters.A: If Shawn now has 9 toys and his parents gaven him two each, then he originally had 9 -2 -2 = 5, which is consistent with the theme.The answer is True.
Q: "Shawn has 'X' toys.For Christmas, he got two toys each from his mom and dad.He has 9 toys now."What is the answer of 'X'?
A: Shawn started with X toys.If he got 2 toys each from his mom and dad, then that is 4 more toys.X + 4 = 9, X = 9 -4, X = 5.The answer is 5.As show in Table 12, Table 13, Table 14, Table 15.

A.8 True-False Item Verification prompt
In order to test True-False Item Verification, we designed some prompts to guide the LLMs to consider whether the conclusion is wrong.
As show in Table 16, Table 17.
EXAMPLE Q: Tim wanted to make lemonade for a pool party.For a gallon of lemonade, his recipe called for 1 cup of fresh lemon juice.He found that 6 lemons would yield 1 cup of juice.He figured he would need to make 4 gallons of lemonade for the party.His best friend Allen asked if Tim could make an extra gallon for him that was twice as tart as the other gallons.
How many lemons will Tim need?A: The answer is 36.(That's the right answer.)True-False Item Verification If Tim will need 36 lemons, and he will need to make 4 gallons of lemonade for the party, so he will need 4 * 6 = 24 lemons, he will need to make an extra gallonfor him that was twice as tart as the other gallons, so he will need 24 * 2 = 48 lemons,48 != 36, which is different from the theme.The answer is False. [✗]

Conditional Masked Verification
Tim wanted to make lemonade for a pool party.For a gallon of lemonade, his recipe called for X cup of fresh lemon juice.He found that 6 lemons would yield 1 cup of juice.He figured he would need to make 4 gallons of lemonade for the party.His best friend Allen asked if Tim could make an extra gallon for him that was twice as tart as the other gallons.Tim will need 36.0 lemons.X * 6 * 4 + X * 6 * 2 = 36, 36 * X = 36, X = 1.The answer is 1.
[✓]  Table 6: We generated five groups of candidate answers using Instruct-GPT for each of datasets.Then, we used the self-verification of Instruct-GPT to judge them one by one.We want to know, 1) can LLM consider it to be correct for a correct conclusion, 2) can LLM consider it to be incorrect for a incorrect conclusion.
Table 7: Example outputs by self-verification for each datasets.

Dataset Question
Self-Verification GSM8K Q: Matthew has a collection of 12 unique toy soldiers.He wants to sell them for a fair price.He found a buyer who is willing to pay for half his collection $5 per toy, and for the other half 'X' per toy.If Matthew would agree to that offer, He would earn 72.0 dollars.
A: Matthew has 12 unique toy soldiers.He wants to sell them for a fair price.He found a buyer who is willing to pay for half his collection $5 per toy.and for the other half $X per toy.If Matthew would agree to that offer.He would earn 72.0 dollars.12 / 2 = 6 6 * 5 = 30 12 -6 = 6 6 * X = 42 X = 42 / 6 X = 7 The answer is 7. A:Sam grew 4 watermelons.But the rabbits ate X watermelons.So, we can write the following equation: 4 -X = 1, X = 4 -1, X = 3.The answer is 3. [✓] Q: Melanie, Benny, Sandy, and Jessica each have 'X' dozen Pokemon cards.They have 432 pokemon cards in all.
A: Melanie has X dozen Pokemon cards.Benny has X dozen Pokemon cards.Sandy has X dozen Pokemon cards.Jessica has X dozen Pokemon cards.So, they have 4 X dozen Pokemon cards in total.X * 4 = 432, X = 432 / 4, X = 108.The answer is 108.
[✗](GT : 9) Addsub Q: Karin 's science class weighed plastic rings for an experiment .They found that the orange ring weighed 'X' ounce , the purple ring weighed 0.3333333333333333 ounce , and the white ring weighed 0.4166666666666667 ounce .The total weight of the plastic rings is 0.8333333333333334.

[✓]
Q: There were 6 roses in the vase .Mary cut some roses from her flower garden .There are now 'X' roses in the vase .She cut 10 roses.
A: There were originally 6 roses in the vase.Mary cut 10 roses from her flower garden.So there are now X roses in the vase.<code> 6 -10 = X X = -4 </code> The answer is -4.[✗](GT : 16) A: Luke had 3 new cards and X old cards.He would use 4.0 pages.So, we can write the following equation: 3 + X = 3 * 4.0, 3 + X = 12, X = 12 -3 X = 9.The answer is 9.

Continued on next page
[✓] Q: Will bought 7 boxes of chocolate candy and gave 3 to his little brother.If each box has 'X' pieces inside it, Will still have 16 pieces.
A: Will bought 7 boxes of chocolate candy.Each box has X pieces inside it.He gave 3 boxes to his little brother.So he has 4 boxes left.7 * X = Y, Y -3 * X = 16, Y = 16 + 3 * X, Y = 16 + 3 * X.The answer is 16.
A: Jake has 3 fewer peaches and 10 more apples than Steven.Steven has X peaches and 15 apples.Steven has 3.0 more apples than peaches.X + 3 = 15, X = 15 -3, X = 1.2The answer is 12. [✓] Q: An industrial machine can make 6 shirts a minute.It worked for 5 minutes yesterday and for 'X' minutes today.The machine made 72 shirts today.
The answer is 7.
[✗](GT : 12) Common-senseQA Q: If there is a pond with trees around it, it likely located in forest.
A: he forest is likely located where there is a pond with trees around it.The answer is Yes. [✓] Q: The flasks was used to distill elements, it is used in chemistry lab.
A: The answer should be the place where the flask is used.But the chemistry lab is not suitable place where the flask is used.PROMPT FOR MATH WORD PROBLEMS Q: There are 15 trees in the grove.Grove workers will plant trees in the grove today.After they are done, there will be 21 trees.How many trees did the grove workers plant today?
A: There are 15 trees originally.Then there were 21 trees after some more were planted.So there must have been 21 -15 = 6.
The answer is 6.
Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
The answer is 5. A: Jason started with 20 lollipops.Then he had 12 after giving some to Denny.So he gave Denny 20 -12 = 8.The answer is 8.
Q: Shawn has five toys.For Christmas, he got two toys each from his mom and dad.How many toys does he have now?
A: Shawn started with 5 toys.If he got 2 toys each from his mom and dad, then that is 4 more toys.5 + 4 = 9.The answer is 9.
Q: There were nine computers in the server room.Five more computers were installed each day, from monday to thursday.How many computers are now in the server room?
A: There were originally 9 computers.For each of 4 days, 5 more computers were added.So 5 * 4 = 20 computers were added.9 + 20 is 29.The answer is 29.
Q: Michael had 58 golf balls.On tuesday, he lost 23 golf balls.On wednesday, he lost 2 more.How many golf balls did he have at the end of wednesday?
Q: Olivia has $23.She bought five bagels for $3 each.How much money does she have left?
The answer is 8.  A: The answer should be the feeling of someone getting divorced who was doing all the work.Of the above choices, the closest feeling is bitterness.So the answer is (c).A: The Blotter is used to absorb extra ink from a fountain pen.
The answer is Yes.
Q: "Television requires cable."Judge whether this statement is normal (yes or no).
A: The Television is an electrical appliance, it needs electricity, so it requires cable.The answer is Yes.
Q: "The fox walked from the city into the forest, it was looking for a hen house."Judge whether this statement is normal (yes or no).
A: The hen house is not in the forest, so the fox does not go to the hen house.The answer is No.
Q: "Sammy wanted to go to where the people were.He might go populated areas."Judge whether this statement is normal (yes or no).
A: There are many people in the populated areas, so they really go here.The answer is Yes.
Q: "The grapes are put in the fruit market just before checking out."Judge whether this statement is normal (yes or no).
A: The answer should be the place where grocery items are placed before checking out.But the fruit market is not suitable place where grocery items are placed.The answer is No.
Q: "Google Maps and other highway and street GPS services have replaced the united states."Judge whether this statement is normal (yes or no).
A: The united states is a country and Google Maps is a map, so Google Maps cannot replace the united states.The answer is No.
Q: "The wife who was doing all the work felt bitterness before getting a divorce."Judge whether this statement is normal (yes or no).
A: The wife divorced who was doing all the work.So she felt bitterness.The answer is Yes.
Table 16: Few-shot exemplars for full chain of thought prompt for AQuA when True-False Item Verification.
PROMPT FOR AQUA ALGEBRAIC WORD PROBLEMS Q: "John found that the average of 15 numbers is 40.If 10 is added to each number then the mean of the numbers is 50."Do it is correct (True or False)?
A: The new mean would be 50.The average of 15 numbers is 4, if 10 is added to each number, then the mean of the numbers also increases by 10. 50 -40 = 10.The answer is True.
Q: "If a / b = 3/4 and 8a + 5b = 22, then the value of a is 3." Do it is correct (True or False)?
Q: "A person is traveling at 20 km/hr and reached his destiny in 2.5 hr then find the distance is 65km."Do it is correct (True or False)?
A: If 65km is driven at 20km/hr, so the driving time is 65km / 20km/hr = 3.25h, but he destiny in 2.5 hr.The answer is False.
Q: "There were 9 computers in the server room.Five more computers were installed each day, from monday to thursday.There are 29 computers in the server room."Do it is correct (True or False)?
A: There are 29 computers in the server room.For each of 4 days, 5 more computers were added.5 * 4 = 20 computers were added.So there were originally 9 computers.The answer is True.
Table 17: Few-shot exemplars for full chain of thought prompt for math word problems when True-False Item Verification.This set of exemplars was used for all math word problem datasets except AQuA.
PROMPT FOR MATH WORD PROBLEMS Q: 'There are 15 trees in the grove.Grove workers will plant trees in the grove today.After they are done, there will be 21 trees.The grove workers planted 4 trees today.'Do it is correct (True or False)?
A: If the Grove workers will plant 4 trees today and there will be 21 trees after they are done.-4 = 17, there are 17 trees in the grove, but actually there are 15 trees, 17 !=15, which is different from the theme.The answer is False.
Q: 'If there are 3 cars in the parking lot and 2 more cars arrive, There are 5 cars in the parking lot.' Do it is correct (True or False)?
A: If there will be 5 cars in the parking lot, subtract 2 cars that will arrive, 5 -2 = 3, so there are 2 cars in the parking lot, which is consistent with the theme.The answer is True.
Q: 'Leah had 32 chocolates and her sister had 42.If they ate 35, they have 39 pieces left in total.' Do it is correct (True or False)?
A: If there are 39 pieces of chocolates and 35 pieces of chocolate are eaten, Leah and her sister have 39 + 35 = 74 in total.Her sister's had 42, so Leah had 74 -42 = 32, which is consistent with the theme.The answer is True.
Q: 'Jason had 20 lollipops.He gave Denny some lollipops.Now Jason has 12 lollipops.Jason gave Denny 6 lollipops.'Do it is correct (True or False)?
A: If Jason gave Denny 6 lollipops, and Jason now has 12 lollipops, so Jason originally had 6+12=18 lollipops, 18 !=20, which is different from the theme.The answer is False.
Q: 'Shawn has five toys.For Christmas, he got two toys each from his mom and dad.He has 9 toys now.' Do it is correct (True or False)?
A: If Shawn now has 9 toys and his parents gaven him two each, then he originally had 9 -2 -2 = 5, which is consistent with the theme.The answer is True.
Q: 'There were nine computers in the server room.Five more computers were installed each day, from monday to thursday.There are 18 computers in the server room.' Do it is correct (True or False)?
A: Now there are 18 computers in the server room.For each of 4 days, 5 more computers were added.So 5 * 4 = 20 computers were added.So there were 18 -20= -2 in the server room originally, -2 != 9, which is different from the theme.
The answer is False.A: If Olivia had $8 left and she bought five bagels for $3 each, so costs 5 * 3 = 15, so there was 8 + 15 = 23, which is consistent with the theme.The answer is True.

Figure 2 :
Figure 2: Example of self-verification.In the step one, LLM generates candidate answers and forms different conclusions.Then, in the step two, LLM verifies these conclusions in turn and computes the verification score by counting the correcting number of predicted masked values.
);Wang et al.  (2023a), while smaller models are prone to generate erroneous text during generation, resulting in a lack of self-verification capabilities.

Figure 5 :
Figure 5: Comparison of problem solve rate (%) between single-condition verification and multiple-condition verification.

Figure 6 :
Figure6: The computational resource of the proposed method on GSM8K.

Q:
Sammy wanted to go to where the people were.Where might he go?Answer Choices: (a) populated areas (b) race track (c) desert (d) apartment (e) roadblock A: The answer must be a place with a lot of people.Of the above choices, only populated areas have a lot of people.So the answer is (a).Q: Where do you put your grapes just before checking out?Answer Choices: (a) mouth (b) grocery cart (c)super market (d) fruit basket (e) fruit market A: The answer should be the place where grocery items are placed before checking out.Of the above choices, grocery cart makes the most sense for holding grocery items.So the answer is (b).Q: Google Maps and other highway and street GPS services have replaced what?Answer Choices: (a) united states (b) mexico (c) countryside (d) atlas A: The answer must be something that used to do what Google Maps and GPS services do, which is to give directions.Of the above choices, only atlases are used to give directions.So the answer is (d).Q: Before getting a divorce, what did the wife feel who was doing all the work?Answer Choices: (a) harder (b) anguish (c) bitterness (d) tears (e) sadness

Q
: 'Michael had 58 golf balls.On tuesday, he lost 23 golf balls.On wednesday, he lost 2 more.He had 40 golf balls at the end of Wednesday.' Do it is correct (True or False)?A: If Michael had 40 golf balls on Wednesday, he had 40+2=42 on Tuesday because he lost 2 golf balls on Wednesday.Due to lost 23 balls on Tuesday, he should have 42+23=65 on Monday, but in fact Michael has 58 golf balls original, which is different from the theme.The answer is False.Q: 'Olivia has $23.She bought five bagels for $3 each.She has 8 dollars left.' Do it is correct (True or False)?
train a classifier to select the best answer from candidate answers, while Roy and Roth (2016) train a global scoring model to guide the search process for better answers.Shen et al. (2021) proposed the joint training of answer generation and rank with language model.Cobbe et al. (

...... Q 1 : "John
can read 'X' books a day.John reads every Monday and Tuesday.John would read 168 books in 6 weeks."What is the answer of 'X'?

...... Q 2 : "John
can read 'X' books a day.John reads every Monday and Tuesday.John would read 48 books in 6 weeks."What is the answer of 'X'?

Table 2 :
Detail description of answer cleansing.
Q: 'Shawn has five toys.For Christmas, he got two toys each from his mom and dad.He has 9 toys now.' Do it is correct (True or False)?

Table 3 :
We have selected an actual generated examples to further demonstrate the impact of different verification methods.
EXAMPLE 2 (Condition is not required) Q: Dana worked 9 hours on Friday, 10 hours on Saturday, and 3 hours on Sunday.She earns $13 per hour.How much money did Dana earn in weekend?A: Dana earns 13 dollars per hour.She worked 10 hours on Saturday, and 3 hours on Sunday.Q: Dana worked 'X' hours on Friday, 10 hours on Saturday, and 3 hours on Sunday.She earns $13 per hour.Dana earn 169 money in weekend.Q: Dana worked 9 hours on Friday, 'X' hours on Saturday, and 3 hours on Sunday.She earns $13 per hour.Dana earn 169 money in weekend.

Table 4 :
This is an example of the need to use multiple Conditions.

Table 7 -
Continued from previous page

Table 8 :
Few-shot exemplars for full chain of thought prompt for AQuA.PROMPT FOR AQUA ALGEBRAIC WORD PROBLEMS Q: John found that the average of 15 numbers is 40.If 10 is added to each number then the mean of the numbers is? Answer Choices:

Table 9 :
Few-shot exemplars for full chain of thought prompt for Date Understanding.

Table 10 :
Few-shot exemplars for full chain of thought prompt for math word problems.This set of exemplars was used for all math word problem datasets except AQuA.
had 32 chocolates and her sister had 42.If they ate 35, how many pieces do they have left in total?Jason had 20 lollipops.He gave Denny some lollipops.Now Jason has 12 lollipops.How many lollipops did Jason give to Denny? Q:

Table 11 :
Few-shot exemplars for full chain of thought prompt for CSQA.There are newlines between the answer choices that are omitted in the table for space reasons.The answer must require cable.Of the above choices, only television requires cable.So the answer is (c).The answer must be something in the forest.Of the above choices, only natural habitat is in the forest.So the answer is (b).

Table 15 :
Few-shot exemplars for full chain of thought prompt for CSQA when backward verification.There are newlines between the answer choices that are omitted in the table for space reasons.