GRACE: Discriminator-Guided Chain-of-Thought Reasoning

In the context of multi-step reasoning, e.g., with chain-of-thought, language models (LMs) can easily assign a high likelihood to incorrect steps. As a result, decoding strategies that optimize for solution likelihood often yield incorrect solutions. To address this issue, we propose Guiding chain-of-thought ReAsoning with a CorrectnEss Discriminator (GRACE), a stepwise decoding approach that steers the decoding process towards producing correct reasoning steps. GRACE employs a discriminator trained with a contrastive loss over correct and incorrect steps, which is used during decoding to score next-step candidates based on their correctness. Importantly, GRACE only requires sampling from the LM, without the need for LM training or fine-tuning. Using models from FLAN-T5 and LLaMA families, we evaluate GRACE over four math and two symbolic reasoning tasks, where it exhibits substantial performance gains compared to greedy decoding, verifiers, and self-consistency in most settings. When further combined with self-consistency, GRACE outperforms all the baselines by sizeable margins. Human and LLM evaluations over GSM8K show that GRACE not only improves the final answer accuracy but also the correctness of the intermediate reasoning. Our implementation can be accessed at \url{https://github.com/mukhal/grace}.


Introduction
Multi-step reasoning spans a set of tasks where a question is answered via a sequence of reasoning steps until a final answer is reached (Creswell and Shanahan, 2022;Wei et al., 2022).While pretrained language models (LMs) have shown impressive performance on a variety of QA tasks, they still struggle with problems that require complex multi-step reasoning (Cobbe et al., 2021;Creswell et al., 2022;Ni et al., 2023).One reason is that I have 10 liters of orange drink that are two-thirds water and I wish to add it to 15 liters of pineapple drink that is three-fifths water.As I pour it, I spill one liter of the orange drink.How much water is in the remaining 24 liters?

Avg. Prob
After 1 liter of pineapple drink was poured, there were 15 -1 = 14 liters of pineapple drink left.

0.47
After 15 liters of pineapple drink was poured in, there were 15 -5 = 10 liters of pineapple drink left.

0.80
After 1 liter of orange drink was spilled, there were 10 -1 = 9 liters of orange drink left.

Question Prefix
Then, after 1 liter of pineapple drink was added, there were 15 -1 = 14 liters of pineapple drink left.… 0.77 Figure 1: A math question from GSM8K (Cobbe et al., 2021), a solution prefix, and candidate next steps sorted in descending order by their average token probability according to a few-shot prompted LLaMA13B.The correct next step is assigned a significantly lower probability than the incorrect ones.GRACE solves this issue by calibrating candidate step likelihoods based on the step correctness.
the next-word prediction objective used for pretraining does not explicitly encourage the LM toward correct step-by-step reasoning.To boost the reasoning abilities of LMs, supervised fine-tuning (SFT) has been performed on gold step-by-step solutions (Uesato et al., 2022;Ho et al., 2022;Fu et al., 2023).However, SFT can easily lead to the overfitting of the reference solutions seen during training, resulting in an LM that assigns low probabilities to alternative but correct solutions (Ni et al., 2023).Concurrently, LMs may assign a high probability to invalid sequences, which leads them off track when common decoding strategies such as greedy decoding are used.
While prompting techniques such as scratchpad or chain-of-thought (CoT) (Nye et al., 2021;Wei et al., 2022;Wang et al., 2022) can improve reasoning, they only indirectly affect the sequence probabilities, leaving the aforementioned issue mostly unsolved.To give an example, when prompting LLaMA 13B (Touvron et al., 2023) with a few-shot CoT prompt, a question from GSM8K (Cobbe et al., 2021), and a correct solution prefix, the top probable next step candidates are incorrect while the

Learning
Train the discriminator with max-margin loss.
< l a t e x i t s h a 1 _ b a s e 6 4 = " 5 D Q y c k t X j N Q 2 v M R F N b 8 l E j M F g v Q = " > A A A B 6 n i c b Z D L S g M x F I b P 1 F u t t 6 p L Q Y J F E J Q y U / C y s + D G Z Y v 2 A u 1 Y M m m m D c 1 k h i Q j l K F L l 2 5 c K O L W F 3 D n c 7 j z G f Q h T C 8 L b f 0 h 8 P H / 5 5 B z j h d x p r R t f 1 q p u f m F x a X 0 c m Z l d W 1 9 I 7 u 5 V V V h L A m t k J C H s u 5 h R T k T t K K Z 5 r Q e S Y o D j 9 O a 1 7 s Y 5 r V b K h U L x b X u R 9 Q N c E c w n x G s j X W l b g 5 b 2 Z y d t 0 d C s + B M I H f + 9 n W 3 + 1 7 + L r W y H 8 1 2 S O K A C k 0 4 V q r h 2 J F 2 E y w 1 I 5 w O M s 1 Y 0 Q i T H u 7 Q h k G B A 6 r c Z D T q A O 0 b p 4 3 8 U J o n N B q 5 v z s S H C j V D z x T G W D d V d P Z 0 P w v a 8 T a P 3 M T J q J Y U 0 H G H / k x R z p E w 7 1 R m 0 l K N O 8 b w E Q y M y s i X S w x 0 e Y 6 G X M E Z 3 r l W a g W 8 s 5 J / r j s 5 I p H M F Y a d m A P D s C B U y j C J Z S g A g Q 6 c A + P 8 G R x 6 8 F 6 t l 7 G p S l r 0 r M N f 2 S 9 / g D x R p H 4 < / l a t e x i t > s + < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 D Q y c k t X j N Q 2 v M R F N b 8 l E j M F g v Q = " > A A A B 6 n i c b Z D L S g M x F I b P 1 F u t t 6 p L Q Y J F E J Q y U / C y s + D G Z Y v 2 A u 1 Y M m m m D c 1 k h i Q j l K F L l 2 5 c K O L W F 3 D n c 7 j z G f Q h T C 8 L b f 0 h 8 P H / 5 5 B z j h d x p r R t f 1 q p u f m F x a X 0 c m Z l d W 1 9 I 7 u 5 V V V h L A m t k J C H s u 5 h R T k T t K K Z 5 r Q e S Y o D j 9 O a 1 7 s Y 5 r V b K h U L x b X u R 9 Q N c E c w n x G s j X W l b g 5 b 2 Z y d t 0 d C s + B M I H f + 9 n W 3 + 1 7 + L r W y H 8 1 2 S O K A C k 0 4 V q r h 2 J F 2 E y w 1 I 5 w O M s 1 Y 0 Q i T H u 7 Q h k G B A 6 r c Z D T q A O 0 b p 4 3 8 U J o n N B q 5 v z s S H C j V D z x T G W D d V d P Z 0 P w v a 8 T a P 3 M T J q J Y U 0 H G H / k x R z p E w 7 1 R m 0 l K N O 8 b w E Q y M y s i X S w x 0 e Y 6 G X M E Z 3 r l W a g W 8 s 5 J / r j s 5 I p H M F Y a d m A P D s C B U y j C J Z S g A g Q 6 c A + P 8 G R x 6 8 F 6 t l 7 G p S l r 0 r M N f 2 S 9 / g D x R p H 4 < / l a t e x i t > s + < l a t e x i t s h a 1 _ b a s e 6 4 = " t d L / B f u u G i x t q y p L O w b + B 0 p D E 6 4 = " > A A A B 6 n i c b Z C 7 S g N B F I b P x l u M t 6 i l I I N B s N C w G / D S G b C x T N B c I F n D 7 G Q 2 G T I 7 u 8 z M C m F J a W l j o Y i t L 2 D n c 9 j 5 D P o Q T i 6 F J v 4 w 8 P H / 5 z D n H C / i T G n b / r R S c / M L i 0 v p 5 c z K 6 t r 6 R n Z z q 6 r C W B J a I S E P Z d 3 D i n I m a E U z z W k 9 k h Q H H q c 1 r 3 c x z G u 3 V C o W i m v d j 6 g b 4 I 5 g P i N Y G + t K 3 R y 1 s j k 7 b 4 + E Z s G Z Q O 7 8 7 e t u 9 7 3 8 X W p l P 5 r t k M Q B F Z p w r F T D s S P t J l h q R j g d Z J q x o h E m P d y h D Y M C B 1 S 5 y W j U A d o 3 T h v 5 o T R P a D R y f 3 c k O F C q H 3 i m M s C 6 q 6 a z o f l f 1 o i 1 f + Y m T E S x p o K M P / J j j n S I h n u j N p O U a N 4 3 g I l k Z l Z E u l h i o s 1 1 M u Y I z v T K s 1 A t 5 J 2 T / H H Z y R U P Y a w 0 7 M A e H I A D p 1 C E S y h B B Q h 0 4 B 4 e 4 c n i 1 o P 1 b L 2 M S 1 P W p G c b / s h 6 / Q H 0 T p H 6 < / l a t e x i t > s < l a t e x i t s h a 1 _ b a s e 6 4 = " t d L / B f u u G i x t q y p L O w b + B 0 p D E 6 4 = " > A A A B 6 n i c b Z C 7 S g N B F I b P x l u M t 6 i l I I N B s N C w G / D S G b C x T N B c I F n D 7 G Q 2 G T I 7 u 8 z M C m F J a W l j o Y i t L 2 D n c 9 j 5 D P o Q T i 6 F J v 4 w 8 P H / 5 z D n H C / i T G n b / r R S c / M L i 0 v p 5 c z K 6 t r 6 R n Z z q 6 r C W B J a I S E P Z d 3 D i n I m a E U z z W k 9 k h Q H H q c 1 r 3 c x z G u 3 V C o W i m v d j 6 g b 4 I 5 g P i N Y G + t K 3 R y 1 s j k 7 b 4 + E Z s G Z Q O 7 8 7 e t u 9 7 3 8 X W p l P 5 r t k M Q B F Z p w r F T D s S P t J l h q R j g d Z J q x o h E m P d y h D Y M C B 1 S 5 y W j U A d o 3 T h v 5 o T R P a D R y f 3 c k O F C q H 3 i m M s C 6 q 6 a z o f l f 1 o i 1 f + Y m T E S x p o K M P / J j j n S I h n u j N p O U a N 4 3 g I l k Z l Z E u l h i o s 1 1 M u Y I z v T K s 1 A t 5 J 2 T / H H Z y R U P Y a w 0 7 M A e H I A D p 1 C E S y h B B Q h 0 4 B 4 e 4 c n i 1 o P 1 b L 2 M S 1 P W p G c b / s h 6 / Q H 0 T p H 6 < / l a t e x i t > s < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 / X 5 6 h N m K 5 T k Y E i f h H 8 y C P z d Y z 4 = " > A A A B 6 H i c b Z C 7 S w N B E M b n 4 i u J r 6 i l z W I Q L C T c C T 7 K g I 1 l A u Y B S Q h 7 e 3 P J m r 2 H u 3 t C O N L Y W N h Y K G J r 7 z 9 j 5 1 + j m 0 e h i R 8 s / P i + G X Z m 3 F h w p W 3 7 y 8 o s L a + s r m V z + f W N z a 3 t w s 5 u X U W J Z F h j k Y h k 0 6 U K B Q + x p r k W 2 I w l 0 s A V 2 H A H l + O 8 c Y d S 8 S i 8 1 s M Y O w H t h d z n j G p j V W + 7 h a J d s i c i i + D M o F j O P X g f 3 / f H l W 7 h s + 1 F L A k w 1 E x Q p V q O H e t O S q X m T O A o 3 0 4 U x p Q N a A 9 b B k M a o O q k k 0 F H 5 N A 4 H v E j a V 6 o y c T 9 3 Z H S Q K l h 4 J r K g O q + m s / G 5 n 9 Z K 9 H + R S f l Y Z x o D N n 0 I z 8 R R E d k v D X x u E S m x d A A Z Z K b W Q n r U 0 m Z N r f J m y M 4 8 y s v Q v 2 k 5 J y V T q t O s W z D V F n Y h w M 4 A g f O o Q x X U I E a M E B 4 h G d 4 s W 6 s J + v V e p u W Z q x Z z x 7 8 k f X + A 5 + k k H g = < / l a t e x i t > q < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 / X 5 6 h Simulate mistakes the LM is likely to make during inference by sampling solutions from the model.(3) Learning the discriminator with a max-margin loss to assign high scores to correct steps and low scores to incorrect steps.Bottom: The guided stepwise decoding process using the trained discriminator.
Given the question and the prefix, we sample a pool of candidate next steps and use the discriminator to score steps as in Equation ( 6).The top-scored step is then selected and added to the prefix.This process repeats until a final answer is generated.
correct step is assigned with a much lower probability than the incorrect ones as shown in Figure 1.
Oversampling techniques have been proposed to alleviate this problem by utilizing multiple sampled solutions.For instance, the sample-then-rank approach uses a verifier model to score a set of randomly sampled solutions based on their correctness (Cobbe et al., 2021;Li et al., 2022).Selfconsistency is another technique that aggregates multiple random samples via majority voting over the final answer (Wang et al., 2022).Nonetheless, oversampling methods have two main drawbacks.First, as they rely on temperature sampling from the LM distribution, they are prone to sample highly likely but incorrect solutions.Second, they exhibit no control over solution decoding, as they are applied over complete solutions and after the decoding is finished.This paper builds on the insight that we can sample correct multi-step solutions by steering the decoding process towards generating correct reasoning steps.Inspired by discriminator-guided controlled generation methods (Yang and Klein, 2021; Dathathri et al., 2020;Krause et al., 2021), we propose GRACE, a guided-decoding method that relies on a correctness discriminator model to nudge the decoding process towards correct steps.Our discriminator is trained at the step level, allowing for finer-grained control over the sampling process compared to the vanilla self-consistency and verifier methods.While recent work (Uesato et al., 2022) relies on human annotations to build a steplevel correctness reward model, human annotations are expensive and hard to scale.We address this limitation and propose a 3-step approach to train the correctness discriminator based on access to the correct solutions only, without any step-level human annotations.
We compare GRACE to greedy decoding, selfconsistency, and verifiers, and show strong improvements over all of them on six different multistep reasoning benchmarks with two language models families: FLAN-T5 (Chung et al., 2022) and LLaMA (Touvron et al., 2023).For instance, GRACE outperforms greedy decoding on GSM8K (Cobbe et al., 2021) by 7.4% accuracy points with FLAN-T5-Large and 5.4% with LLaMA 7B .In addition, when further combining our approach with self-consistency, GRACE outperforms the vanilla self-consistency by 10.2% points on GSM8K and 15.7% on MultiArith (Roy and Roth, 2015).
In summary, our contributions are as follows:

Method
Overview.Our setup follows chain-of-thought reasoning (Nye et al., 2021;Wei et al., 2021), where given a question q (e.g., a math word problem), our goal is to generate a chain of T intermediate reasoning steps s 1 , s 2 , . . ., s T , s T +1 , where s T +1 is the final answer.A pretrained language model (LM) is either fine-tuned or prompted in a few-shot manner to generate the chain.We start by formalizing our approach in the next section.

Formalization
Given a problem q and a correct solution prefix s 1 , s 2 , . . ., s t−1 , we want to sample a correct next step s t towards the final answer. 2 We assume access to a judge or a discriminator model D that takes in the problem q, the prefix s 1 , s 2 , ..s t−1 and a candidate next step s t , and outputs a real-valued score D(q, s 1:t−1 , s t ) that indicates whether s t is a correct reasoning step at time-step t.We also assume access to the language model distribution p LM (•|q, s 1:t−1 ).
Formally, let c be a binary variable that indicates the correctness of the generated step with respect to the question and the prefix, where we want to sample the next step s t ∼ p(•|s 1:t−1 , c, q).We can factorize p(s t |s 1:t−1 , c, q) as: p(s t |s 1:t−1 , c, q) = p(s t |s 1:t−1 , q)p(c|s t , s 1:t−1 , q) p(c|s In Equation ( 3), we substitute p(s t |s 1:t−1 ), the probability of the next step without modeling correctness, with p LM (s t |q, s 1:t−1 ).Similarly, in Equation (4), p(c|s 1:t , q) is replaced with exp(D(q, s 1:t−1 , s t )).This substitution is justified as, in accordance with our discriminator's definition, exp(D(q, s 1:t−1 , s t )) is proportionate to p(c|s 1:t , q).By assuming that the prefix s 1:t−1 is correct, p(c|s 1:t , q) becomes dependent only on the correctness of s t , modeled by D(q, s 1:t−1 , s t ).This form of factorization echoes the controlled generation method used by FUDGE (Yang and Klein, 2021), but with two notable distinctions.First, we model the next step as opposed to the next token correctness, which is often ill-defined.Second, unlike FUDGE's discriminator which predicts whether a given attribute will be satisfied in the future, our discriminator evaluates the correctness of a given step s t with respect to s 1:t−1 , the solution so far.To summarize, Equation (4) shows that we want to sample s t (i) with high likelihood p LM (s t |q, s 1:t−1 ) according to the LM and (ii) is correct with respect to the question and the prefix.Intuitively, this implies the utilization of the reasoning capabilities of the LM while maintaining correctness.Throughout the rest of the paper, we will refer to the prefix s 1:t−1 as r and the next step s t as s for simplicity.

Discriminator Learning
We use three steps to learn the discriminator function D(q, r, s), which are shown in Figure 2 (top).
• Step 1-Negative sampling: We collect a set of solutions with at least one incorrect step.without step-level supervision, we align sampled incorrect solutions with the reference solution via dynamic programming using the Needleman-Wunsch (NW) algorithm (Likic, 2008).The original implementation of the NW algorithm finds a minimumcost alignment between two character sequences.
To extend it to our case, we use the cosine distance between the embeddings of two steps as the cost of aligning these two steps.We compute step embeddings via ROSCOE (Golovneva et al., 2023) Formally, given an m-step sampled solution d = {d 1 , . . ., d m } and an n-step reference solution g = {g 1 , . . ., g n }, the alignment algorithm produces a sequence of l pairs of aligned step indices k is an incorrect next step after the prefix r k .For an alignment pair (x i , y i ), three cases are handled (shown in Figure 3): For symbolic reasoning tasks, where there is no intermediate variable, we check whether the two steps entail one another using a pretrained NLI model.Once an incorrect step is found i.e., DoStepsMatch(d x i , g y i ) returns False, we exit to guarantee that that prefix in the returned examples is correct.
We utilize the maxmargin loss objective L D (Rosasco et al., 2004): where ζ > 0 is a hyperparameter.We found the max-margin loss to perform better than other alternatives (see Section 5 for an ablation study).

Guided Stepwise Decoding
After D is trained, it is employed to guide solution decoding.At each time t, we use nucleus sampling to sample a pool of J candidates for the next steps t , . . ., s 3 These candidates represent multiple possible choices for the next step.Each candidate s (i) t is then scored using: where β is a hyperparameter to control the discriminator score coefficient.The guided decoding process is shown in Figure 2 (bottom).

Experimental Setup
Tasks.We evaluate our approach on four math and two symbolic reasoning tasks.For math reasoning, we use GSM8K (Cobbe et al., 2021), a common benchmark for complex multi-step reasoning.MathQA-Gain, a subset of MathQA (Amini et al., 2019)  SVAMP, MultiArith, CF, and TSO do not include reference step-by-step solutions (only the final answer is included for each question) we follow recent work on chain-of-thought distillation (Ho et al., 2022;Fu et al., 2023;Hsieh et al., 2023) and prompt GPT-3.5-turbo to generate a step-by-step solution for each question.Details on this process and dataset statistics are in Appendix E.1.
Sampling, Training, and Decoding.For each task, we sample roughly 100K incorrect solutions for discriminator training with top-k sampling with k = 50 and temperature T = 1.3 for FLAN-T5 and T = 0.7 for LLaMA. 4The discriminator used in all of our experiments is a FLAN-T5 Large encoder (~340M).For math reasoning tasks, we use an external calculator to compute the results of math operations.The exact details on sampling, training, and decoding are in Appendix A.
Baselines.We compare GRACE to greedy decoding, which is the standard decoding method for reasoning tasks (Wei et al., 2022;Li et al., 2022;Fu et al., 2022;Zhou et al., 2022) and beam search with a beam size of 3. 5 We additionally compare GRACE to self-consistency (SC), where multiple solutions are sampled with a temperature of T = 0.7 and we pick the most frequent answer as the final answer.We sample 40 solutions for experiments with FLAN-T5 and 20 with LLaMA.In addition, we compare to a solution verifier (Cobbe et al., 2021;Li et al., 2022), using FLAN-T5 Large encoder as the verifier for a fair comparison.We use the verifier checkpoint that achieves the best F1 on a held-out set.We note that self-consistency and verifiers may be applied on top of GRACE by sampling complete solutions using our guided decoding approach and then reranking or applying majority voting over the sampled solutions.Lastly, we compare to LM-only scoring, which ranks steps according to log p LM only by setting β = 0 in Equation ( 6), to demonstrate the utility of including the discriminator when computing a step score.
Language Models.We verify the effectiveness of GRACE on two models from different families and with different sizes, namely FLAN-T5 Large (778M; Chung et al. 2022) and LLaMA (7B, 13B;Touvron et al. 2023).As FLAN-T5 Large performs poorly in the few-shot setting, we fine-tune it over the training set of each task.LLaMA models are not fine-tuned and are used in a few-shot setting with 6 CoT demonstrations (provided in Appendix G).

Results and Discussion
Evaluation of final answer accuracy.We compare the accuracy of final answers reached by different methods.We first discuss the results over math reasoning in Table 1.With T5 Large , GRACE outperforms the baselines on all tasks.For instance, GRACE outperforms greedy decoding by 7.4% and 11.7% points over GSM8K and SVAMP, respectively.When combining our approach with SC, where sampling is done using GRACE and then majority voting is applied on the samples, the accuracy boost over vanilla SC is as large as 6.8 points on SVAMP.With the few-shot prompted LLaMA 7B , a similar trend is observed, as GRACE  outperforms greedy decoding and SC on Multi-Arith and SVAMP.GRACE with SC outperforms the vanilla SC with random sampling by 10.2% and 15.7% points on GSM8K and MultiArith, respectively.
We observe that the verifier approach performs poorly on all tasks except for MathQA-Gain.This is likely because the verifier training examples include solutions with the correct final answer but invalid reasoning steps.As a result, the trained verified cannot identify correct from incorrect reasoning.To test this hypothesis, we ran an experiment with GSM8K where we only included the gold trajectories as positive examples and indeed found improvement in the verifier's performance, albeit still below SC and GRACE.
Moving to symbolic reasoning (shown in Table 2): On TSO, GRACE w/ SC boosts the accuracy of T5 Large and LLaMA 13B by 2.6% and 4.6%, respectively compared to SC.As for Coin Flip, GRACE w/ SC improves LLaMA 13B 's accuracy by 12.8% compared to the vanilla SC.One might note that LLaMA 13B 's performance on TSO (34.4%) is close to random chance (33.3%).This can be explained by observing that LLaMA 13B 's performance was already poor (29.8% with SC), and therefore it is likely that the candidate's next steps scored by the discriminator are mostly incorrect, explaining why GRACE produces marginal improvement.Appendix H shows examples of solutions produced by GRACE on all tasks.
Ultimately, our results show that GRACE can boost both FLAN-T5 and LLaMA's final answer accuracy on different math and symbolic reasoning tasks.Interestingly and in the case of LLaMA models, we achieve such improvements (i) without any training of the LM and (ii) with a discriminator that has 20X and 38X fewer parameters than the backvone LM for LLaMA 7B and LLaMA 13B , respectively.This points to a promising direction of our approach in steering the generations of large LMs via significantly smaller and more efficient discriminators.

Evaluation of intermediate step correctness.
Reaching a correct final answer does not guarantee correct reasoning, since a model can reach the correct answer spuriously (Golovneva et al., 2023;Uesato et al., 2022).Here, we measure if GRACE boosts the correctness of the reasoning chains compared to the baselines.To do that, we use prefix correctness (PC) following Uesato et al. ( 2022), which measures whether the steps so far are correct.Inspired by recent work showing that using LLMs for evaluation highly correlates with human judgment (Wang et al., 2023;Liu et al., 2023b;Luo et al., 2023), we measure prefix correctness using LLMs in addition to human evaluation.For LLM  6) is varied from 0 to 1. Increasing β up to a certain level improves the final answer accuracy, pointing to the benefit of steering the decoding process via the discriminator.The model used here is FLAN-T5Large and all numbers are averaged over 3 runs.
evaluation, we use GPT-3.5-turbo with a few-shot prompt that lets the model predict a binary label of correct or incorrect after each prefix.Details on LLM evaluation including the prompt used are in Appendix C.
In addition to PC, which is computed over all solutions regardless of the final answer, we also evaluate the trace error (TE), which is computed exclusively on solutions with a correct final answer and measures the percentage of these solutions that have at least one major mistake.Following Uesato et al. ( 2022), a major mistake is defined as "A step where the information expressed is incorrect, or it would no longer be possible to reach the correct solution without undoing that step".We evaluate TE using both human and LLM evaluation on 200 questions that were answered correctly by both GRACE and the baselines.LLM-based TE is computed as the percentage of correct solutions with at least one incorrect prefix.For human-based TE, we ask annotators to label each solution as to whether it has such a major mistake, mark the step where the mistake happened, and provide a justification.Details on the human evaluation are in Appendix D. We conduct this evaluation on the GSM8K test set since the reasoning required to solve GSM8K is more complex, compared to other tasks.self-consistency.GRACE scores higher than both greedy decoding and self-consistency by 7.0 and 3.8 points respectively.We also observe significant improvements of trace error by GRACE.Specifically, it reduces trace error from 9.0% with greedy decoding to 5.0% (44% reduction), and a similar improvement is seen in the LLM-computed TE.
Our results clearly suggest that guiding the decoding process with GRACE not only improves the correctness of the final answer but also of the intermediate steps.

Analysis
Sample Efficiency.A primary motivation for GRACE is to achieve more step-level control over solution decoding than solution-level aggregation as done by vanilla SC. 6 Therefore, we expect GRACE to require fewer samples than vanilla SC to reach the same accuracy.To see if this is true, we compare GRACE w/ SC to the vanilla SC with different numbers of samples.Figure 4 (top) plots the number of samples against final answer accuracy on four tasks with FLAN-T5 Large .We observe that GRACE is more sample-efficient and yields better accuracy with the same or fewer samples than vanilla SC.
Step Score.We study the effect of the discriminator score coefficient β in Equation ( 6) when computing the score of a candidate step on the reasoning performance.Figure 4 (bottom) shows final answer accuracy as we vary β from 0.0 to 1.0.The plot shows the accuracy improving as β is increased beyond 0, emphasizing the benefit brought by integrating D(q, r, s) into the step score.Interestingly, when increasing β beyond a certain point, the performance drops again, indicating that we should not completely omit p LM (s|q, r), which represents the LM's learned reasoning abilities.Alignment.To verify whether our alignment algorithm brings any benefit to the discriminator training, we compare it to a simpler version where steps in the sampled solutions are aligned with the corresponding steps in the reference solutions.The naive approach only aligns samples with the same number of steps as the reference solution, since there is no clear way to align samples of different lengths.Figure 6 in Appendix F shows the accuracy on GSM8K and SVAMP when training the discriminator using both alignments.Our alignment approach outperforms naive alignment by 2.2% and 5.9% points on GSM8K and SVAMP, respectively.These results highlight the advantages of our proposed alignment method in yielding a better discriminator training.
6 One can compare solution-vs.step-level guidance to sparse vs. intermediate rewards in reinforcement learning (RL).Guiding the solution at the step level is akin to the RL agent receiving rewards from intermediate actions rather than a delayed reward signal at the end of the episode, enabling the  Discriminator Loss Function.We compare the max-margin objective in Equation ( 5) to two different discriminator training objectives.The first is a binary cross-entropy objective, where the model is trained to predict 'correct' or 'incorrect' after each step, similar to Uesato et al. ( 2022).The probability of correctness is used as the discriminator score in Equation ( 6).The second is the pairwise ranking loss used to train the reward model for InstructGPT (Ouyang et al., 2022) . Table 4 shows accuracy on GSM8K with FLAN-T5 Large when GRACE's discriminator is trained with each of these loss functions.Notably, the binary cross-entropy loss exhibits the lowest accuracy, emphasizing the importance of contrastive training.Moreover, the max-margin objective is comparable to the pairwise ranking loss.
Cross-task Performance.Our approach relies on reference solutions, which may not always be available for all tasks.Therefore, it is valuable to investigate how GRACE performs when the discriminator is applied to a task different from the one it was originally trained on.In Figure 5, we present the results for SVAMP and MultiArith when the discriminator's training task is varied.In this context, GRACE demonstrates a small relative performance drop, showing an 8% decrease for GSM8K → SVAMP and a 7% decrease for GSM8K → Mul-tiArith, while still outperforming greedy decoding and LM-only scoring.However, a more substantial drop of 26.6% is observed in the case of SVAMP → MultiArith.This decrease can be attributed to two key factors.First, SVAMP has a smaller set of training questions (432) in comparison to GSM8K (6.4K), and second, SVAMP questions require simagent to learn the task with fewer samples.pler reasoning compared to GSM8K.
Discriminator Size.Lastly, we study how the size of the discriminator model impacts the final answer accuracy.More details are in Appendix F.

Related Work
Discriminator-Guided Controlled Generation.Previous work in controlled generation has employed discriminators during decoding to guide generation towards specific attributes, such as sentiment, topic, or lexical constraints (Holtzman et al., 2018;Dathathri et al., 2020;Yang and Klein, 2021;Krause et al., 2021;Khalifa et al., 2021).These discriminators can either update the hidden states of the language model in real-time (Dathathri et al., 2020) or adjust token probabilities (Holtzman et al., 2018;Yang and Klein, 2021;Liu et al., 2023a).Our research takes inspiration from these practices but extends them to multi-step reasoning in two key aspects: control granularity and discriminator training.We direct the decoding of multi-step solutions at the level of reasoning steps to promote their correctness, instead of individual tokens as correctness is not meaningfully defined at the token level.As for discriminator training, it is clear that learning a reasoning correctness discriminator is more challenging than a topic or sentiment discriminator as the former requires checking for logical, mathematical, or factual errors in a given reasoning step.To tackle this, we introduce a novel 3-step process for training discriminators without step-level annotations.More relevant to our work, Li et al. (2022) introduced a step-aware verifier to score sampled solutions but their technique only applies to fully sampled solutions, unlike our approach which actively guides the decoding process.Yang et al. ( 2022) used a stepwise verifier to guide the search process for proof generation and relied on heuristics to generate negative examples, unlike GRACE, which samples incorrect solutions from the model.

Conclusion
Language models can easily assign a high probability to incorrect solutions.Existing methods like self-consistency and verifiers that rely on sampling from the LM distribution do not effectively address this issue.This work proposes a guided decoding method that trains a step-level discriminator model that is used to steer the solution decoding process toward correct steps.We demonstrate the utility of our approach on six reasoning benchmarks, where it strongly boosts the correctness of the generated solutions.

Limitations and Future Work
There is an overhead incurred by sampling and computing the discriminator step scores during decoding.In addition, GRACE's performance is upper-bounded by the quality of the sampled candidate steps.Also, our approach requires access to reference step-by-step solutions for the alignment process.As for future directions, leveraging the alignment approach to curate a reward signal to train the language model and extending GRACE to commercial APIs that do not provide access to the logits are relevant future directions.
Sampling and Discriminator Training.For each task, we sample roughly 80K incorrect solutions for discriminator training with top-k sampling with k = 50 and temperature T = 1.3 for FLAN-T5 and T = 0.7 for LLaMA.The discriminator used in all our experiments is a FLAN-T5 Large encoder (~340M).The step score is computed by applying max-pooling over the hidden states followed by a two-layer MLP with a ReLU and tanh non-linearities.The tanh is applied to constrain the scores in the range [−1, 1].We train the discriminator for 10 epochs with a batch size of 32.We use the Adam optimizer with a learning rate of 1e − 4 for GSM8K and 6e − 5 for other tasks.We use ζ = 1.0 as the margin hyperparameter.We monitor the loss on a held-out development set from each task and choose the checkpoint.
Interestingly, we found that early stopping based on the loss is a better indicator of the discriminator's performance than using the pairwise classification accuracy i.e., how often the discriminator assigns a higher reward to the correct step than the incorrect one.
Decoding.For step-wise decoding, we sample reasoning steps using nucleus sampling to form the pool of candidate next steps.We continue decoding steps until a final answer is generated or until a maximum number of steps is reached.For math reasoning tasks, we use a calculator during decoding to compute the results of math operations.Table 5 provides concrete hyperparameters used for stepwise decoding for each task.Table 5 shows the stepwise decoding hyperparameters used for each task and language model used.These values were found through a grid search over the development set for each task.

B Solution Alignment
Algorithm 2 shows the Needleman-Wunsch algorithm for aligning sampled solutions with the ground-truth solution for a given problem.To filter out low-quality samples, we discard sampled solutions with alignment cost > 2.0 for all tasks except for TSO, where we discard samples with alignment cost > 6.0.

C LLM Evaluation Details
Before using GPT-3.5 to evaluate our model, we need to measure whether it can reliably assess the prefix correctness.To do that, we manually annotate 100 model-generated solutions from GSM8K which corresponded to 280 prefixes in total.We ask human annotators to provide a binary label for each prefix to indicate whether the solution so far will still lead to the correct final answer or not.If a prefix is found to be incorrect, then all the following prefixes in the solution are also incorrect.Interestingly, we found that the few-shot prompting GPT-3.5-turbo with 10 demonstrations could predict the prefix correctness with 88.94% macro F1 score.The few-shot prompt we use is shown in Table 6.We run our evaluation on three different runs for GRACE and self-consistency results and randomly sample 10 different demonstrations each time for the prompt.
You are ChatGPT, a very capable language model that is good at doing math.You are given a math problem, a step-by-step solution to the problem, and a correct solution.After each step in the solution, identify whether the solution so far will lead to the correct final answer or not.If the solution so far is correct, you should generate "-> correct".If the solution is incorrect, you should generate "-> incorrect".I will give you a few examples to get you started.
Q: Siobhan has 2 fewer jewels than Aaron.Aaron has 5 more jewels than half of Raymond's jewels.If Raymond has 40 jewels, how many jewels does Siobhan have?Correct Solution: Half of Raymond's jewels is 40/2 = 20.Since Aaron has 5 more jewels than half of Raymond's jewels, he has 20 + 5 = 25 jewels.If Siobhan has 2 fewer jewels than Aaron, she has 25 -2 = 23 jewels.Solution: Aaron has 5 more jewels than half of Raymond's jewels, meaning he has 40 + 5 = 45 jewels.→ incorrect.Siobhan has 2 fewer jewels than Aaron, meaning she has 45 -2 = 43 jewels.→ incorrect.
Q: A teacher teaches 5 periods a day and works 24 days a month.He is paid $5 per period.If he has been working for 6 months now, how much has he earned in total?Correct Solution: The amount paid to the teacher per day is 5 periods * $5/period = $25 per day.The amount paid for 24 days is $25/day * 24 days = $600.The total amount for 6 months is $600 * 6 = $3600.Solution: The amount paid to the teacher per day is 5 periods * $5/period = $25 per day.→ correct.The amount paid for 24 days is $25/day * 24 days = $600.→ correct.The total amount for 6 months is $600 * 6 = $1800.→ incorrect.
Q: Wynter went to her local town bike shop to buy her sister a bicycle as her birthday gift.While at the shop, Wynter counted 50 bicycles and 20 tricycles.How many wheels in total did the vehicles she saw have?Correct Solution: The bicycles had a total of 50 bikes * 2 wheels/bike = 100 wheels.There were 20 tricycles * 3 wheels/tricycle = 60 wheels for the tricycles.The total number of wheels is 100 wheels + 60 wheels = 160 wheels.Solution: There are 50 bicycles at the shop.→ correct.Each bicycle has 2 wheels.→ correct.So, there are 50 * 2 = 100 wheels.→ correct.There are 20 tricycles at the shop.→ correct.Each tricycle has 3 wheels.→ correct.So, there are 20 * 3 = 60 wheels.→ correct.The total number of wheels is 100 + 60 = 160.→ correct.
... Table 6: An example of the few-shot prompt given to GPT-3.5 to predict prefix correctness (described in section 4), which is used to evaluate GRACE against the baselines.We use 10 manually annotated solutions from GSM8K as in-context learning demonstrations.

D Human Evaluation Details
Annotators are presented with the question, the reference solution, and a generated solution.They are then instructed to follow the instruction: "You are given a math problem, the reference solution, and the generated model solution, please indicate the first generated step with a major mistake, if any exist.A major mistake is a step where the information expressed is incorrect, or it would no longer be possible to reach the correct solution without undoing that step."Initially, we asked two annotators to annotate 100 solutions, and obtained an interannotator agreement of 0.93 by Cohen-Kappa's coefficient.Since we obtained high agreement, we then asked only one of the annotators to annotate all 400 solutions (200 from GRACE and 200 from greedy decoding).

E Datasets Info E.1 Step-by-step Reference Generation
To generate reference step-by-step solutions for SVAMP and MultiArith, we prompt GPT-3.5-turbo with the few-shot prompt shown in Table 8.A similar prompt is used for Shuffled Objects and Coin Flip but uses demonstrations from the corresponding task.For each question, we sample 20 different solutions and filter our the ones that did not reach the correct final answer.We then pick a random solution with the correct final answer as our reference solution.If GPT-3.5-turbo was not able to reach the final answer after 5 tries with different demonstrations, we discard that question from the training data.

E.2 Statistics
Table 7 shows the statistics for the datasets used for our evaluation.

F Further Analysis
Discriminator Size.We study how the size of the discriminator impacts the final answer accuracy.In addition to the FLAN-T5 Large encoder used so far, we run experiments with a FLAN-T5-Base encoder (110M) and a FLAN-T5-Small encoder (30M) as discriminators on GSM8K and MultiArith and with LLaMA 7B as the backbone LM. Figure 7 shows the accuracy on both datasets with different model sizes.For MultiArith, better performance is brought by larger discriminator models, which is expected.Interestingly, using the T5-base discriminator, GRACE can already surpass self-consistency by 0.7 points, and such a boost is achieved using a discriminator that is 63X smaller than LLaMA 7B .As for GSM8K, we observe a very different trend, where smaller models (base and small) do not perform well.This can be understood in the light of GSM8K being a more difficult task with more complex reasoning requirements compared to Mul-tiArith and therefore a discriminator with sufficient capacity is needed.

G Few-shot prompts
Tables 8, 9, and 10 show the 6-shot prompt used with LLaMA 7B with GSM8K, SVAMP, and Mul-tiArith, respectively.Tables 11 and 12 show the 6-shot prompt used with LLaMA 13B for the Coin Flip and Tracking Shuffled Objects, respectively.

H Sample Outputs
Tables 13,14,15,16,17,18,19,20,21 show compare sampled solutions from both GRACE and self-consistency where GRACE reached the correct answer but self-consistency did not.Q: An entrepreneur is crowdfunding a new business effort.He has three different dollar amount levels of financial backing options and each level is ten times as high as the previous one.He needs to raise $12000 to get his business off the ground.He succeeded after getting two backers at the level of financial backing, three at the second level, and ten at the lowest level.How many dollars was the highest level of financial backing?A: Let L represent the lowest level of financial backing.Thus, the second level is 10L and the highest level is 10 * 10L = 100L.The entrepreneur got 2 highest level backers, 3 second level backers, and 10 lowest level backers, so 2 * 100L + 3 * 10L + 10L = 200L + 30L + 10L = 240L = $12000.Therefore, the lowest level of financial backing was L = 12000 / 240 = $50.Thus, the highest level of financial backing was 10 * 10 * 50 = $5000.
Q: 1/3 of the townspeople have received the full COVID vaccine.1/3 are immune because they already recovered from COVID.If 1/6 of the townspeople are both vaccinated and already had COVID, what percent of the town is immune in some way?A: First add the fractions of vaccinated and recovered townspeople: 1/3 + 1/3 = 2/3.Then you need to subtract the fraction that are both vaccinated and recovered, since they were double-counted.To do this, start by converting 2/3 to the same denominator as 1/6: 2/3 * 2/2 = 4/6.Then subtract 1/6 from 4/6: 4/6 -1/6 = 3/6.Then divide the numerator by the denominator and multiply by 100% to express the answer as a percentage: 3/6 * 100% = 50%.
Q: James was doing some shopping in his favorite shop and he saw an offer on shoes that give him a second pair of shoes for half off the original price.He took the offer and took the first pair of shoes for $40 and the second one for $60.At the cashier, the lady told him that the offer applies strictly to the cheaper pair of shoes, and also that he was selected for an extra discount of a fourth off the total amount.How much money did James end up paying for both pairs of shoes?A: The price of the cheaper pair after the discount was $40/2 = $20.The total price of both pairs of shoes $20 + $60 = $80.The final discount was $80/4 = $20.The price he ends up paying was $80 -$20 = $60.
Q: A wildlife team is monitoring the number of birds in a park.There are 3 blackbirds in each of the park's 7 trees.There are also 13 magpies roaming around the park.How many birds are in the park in total?A: In the trees, there are 7 trees * 3 blackbirds per tree = 21 blackbirds.In addition to the magpies, there is a total of 21 blackbirds + 13 magpies = 34 birds in the park.Q: White t-shirts can be purchased in packages of 13.If mom buys 39 white t-shirts, how many packages will she have?A: If the t-shirts come in packages of 13, then mom bought 39/13 = 3 packages of t-shirts.
Q: There are 384 oranges and 192 bananas in Philip's collection.If the bananas are organized into 345 groups and oranges are organized into 16 groups, how big is each group of oranges?A: To find out how big each group of oranges is, we need to divide the total number of oranges by the number of orange groups.So, 384 oranges divided by 16 groups = 24 oranges per group.
Q: In a school, there are 868 girls and the rest are boys.If there are 281 more girls than boys, how many pupils are there in that school?A: Let's use algebra to solve the problem.Let's call the number of boys "b".We know that there are 868 girls and 281 more girls than boys, so: 868 = b + 281.We can solve for "b" by subtracting 281 from both sides: 587 = b.So there are 587 boys in the school.To find the total number of pupils, we can add the number of girls and boys: Total = 868 + 587 = 1455.
Q: Mary is baking a cake.The recipe calls for 9 cups of flour and 5 cups of sugar.She already put in 3 cups of flour and 62 cups of sugar.How many more cups of flour does she need to add? A: The recipe calls for 9 cups of flour and she already put in 3 cups, so she still needs to add 9 -3 = 6 cups of flour.There is no indication that Mary needs to add more sugar, so the 62 cups of sugar are irrelevant to this question.Therefore, she needs to add 6 cups of flour.
Q: Danny collects bottle caps and wrappers.He found 46 wrappers and 50 bottle caps at the park.Now he has 21 bottle caps and 52 wrappers in his collection.How many more bottle caps than wrappers did Danny find at the park?A: Danny found 50 bottle caps and 46 wrappers at the park.Now he has 21 bottle caps and 52 wrappers in his collection.
To find out how many more bottle caps than wrappers he found at the park, we need to subtract the number of wrappers from the number of bottle caps: 50 -46 = 4. Therefore, Danny found 4 more bottle caps than wrappers at the park.
Q: A store had 27 coloring books in stock.They ended up putting them on sale and getting rid of 6 of them.The put the ones they still had onto shelves with 7 on each shelf.How many shelves did they use?A: After selling 6 coloring books, the store had 27 -6 = 21 coloring books left.To figure out how many shelves were used, divide the number of coloring books by the number of books on each shelf: 21 / 7 = 3 shelves.Therefore, the store used 3 shelves to display the coloring books.
Q: Paige was unboxing some of her old winter clothes.She found 6 boxes of clothing and inside each box there were 5 scarves and 5 mittens.How many pieces of winter clothing did Paige have total?A: Each box has 5 scarves and 5 mittens, so each box contains 5 + 5 = 10 pieces of winter clothing.Paige has 6 boxes, so she has a total of 6 x 10 = 60 pieces of winter clothing.
Q: A toy store had 6 giant stuffed bears in stock when they got another shipment with 18 bears in it.The put the bears onto shelves with 6 on each shelf.How many shelves did they use?A: There are now 6 + 18 = 24 giant stuffed bears in the store.If there are 6 bears on each shelf, then the number of shelves required is 24 / 6 = 4 shelves.

Q:
The schools debate team had 26 boys and 46 girls on it.If they were split into groups of 9, how many groups could they make?A: The total number of students on the debate team is 26 + 46 = 72.To split them into groups of 9, we need to divide 72 by 9: 72 / 9 = 8 groups.So they can make 8 groups.
Q: A toy store had 5 giant stuffed bears in stock when they got another shipment with 7 bears in it.The put the bears onto shelves with 6 on each shelf.How many shelves did they use?A: The total number of bears they have is 5 + 7 = 12 bears.To find out how many shelves they need, we divide the total number of bears by the number of bears on each shelf: 12 bears / 6 bears per shelf = 2 shelves.Therefore, they used 2 shelves.
Q: Frank had 11 dollars.At the store, he spent $3 on a new game.If he got another 14 dollars for his allowance, how much money does he have now?A: Frank had 11 dollars and spent 3 dollars, leaving him with 11 -3 = 8 dollars.He then received 14 dollars for his allowance, bringing his total amount of money to 8 + 14 = 22 dollars.QUESTION: Larry loves taking care of animals.He has 3 cats.He has 3 times as many dogs as cats.He has 2 fewer rabbits than dogs.He has a fish tank with three times the number of fish as rabbits.He also has a collection of gerbils that's 1/3 the number of fish he has.How many pets does Larry have?SELF-CONSISTENCY: Larry has 3 * 3 = 9 dogs.Larry has 9 -2 = 7 rabbits.Larry has 3 * 7 = 21 fish in the tank.Larry has a collection of gerbils that's 1/3 of the fish he has, so he has 21 / 3 = 7 gerbils.Larry has 3 + 9 + 7 + 21 = 40 pets.#### 40 GRACE: Larry has 3 * 3 = 9 dogs.Larry has 9 -2 = 7 rabbits.Larry has 3 * 7 = 21 fish in his tank.Larry has 21 / 3 = 7 gerbils.Larry has a total of 3 + 9 + 7 + 21 + 7 = 47 pets.#### 47.
QUESTION: James decides to run 3 sprints 3 times a week.He runs 60 meters each sprint.How many total meters does he run a week?SELF-CONSISTENCY: He runs 3*60= 180 meters a week.So he runs a total of 180*7= 1260 meters a week.#### 1260 GRACE: He runs 3*60= 180 meters a week.So he runs a total of 180*3= 540 meters a week.#### 540.
The LM used here is FLAN-T5Large.Each sentence in the solutions above represents a reasoning step.

GRACE:
The company has 50 employees.20% of the employees are management.20% of the management is 10 people.30% of the management oversee the entire company.10 people * 30% = 3 people.Thus, 3 people oversee the entire company.#### 3.
QUESTION: Melanie found a blueprint online for a toothpick sculpture she wanted to make.It requires 200 toothpicks.Melanie's mom puts toothpicks in her sandwiches when she serves them for lunch.Melanie started saving them and has saved 10 toothpicks each week for the past 12 weeks.If she continues saving toothpicks at the same rate, how many more weeks will it take her to collect 200 toothpicks?SELF-CONSISTENCY: 10 toothpicks per week * 12 weeks = 120 toothpicks saved.If Melanie continues saving toothpicks at the same rate, she will save 120 toothpicks per week * 12 weeks = 1440 toothpicks saved.200 toothpicks = 7.2 weeks.#### 7.2.
QUESTION: An 8-year old child wants to buy a toy car which costs $12.He already has $4 savings.How many days will it take him to save the remaining amount of money if he promises to save $2 daily from his allowance?SELF-CONSISTENCY: The amount of money he has now is $4.The amount of money he needs to save is $12 -$4 = 8 dollars.The amount of money he saves each day is $2.Therefore, the number of days it will take him to save the remaining amount of money is 8 days * 2 = 16 days.#### 16.

GRACE:
The amount of money he needs to save is $12 -4 = 8 dollars.He can save $2 daily from his allowance, so he needs to save $8 / 2 = 4 dollars each day.It will take him 4 days to save the remaining amount of money.#### 4.   et al., 2021) where GRACE reaches the correct answer, unlike self-consistency.
The LM used here is FLAN-T5Large.Each sentence in the solutions above represents a reasoning step.
QUESTION: Edward earned 9 dollars for each lawn he mowed.If he had 6 lawns to mow, but forgot to mow 2 of them, how much money did he actually earn?SELF-CONSISTENCY: Edward mowed 6 lawns for a total of 9 dollars.The total amount of money he should have earned is 6 x 9 = 54 dollars.However, he only mowed 5 lawns, so he actually earned 54-2 = 52 dollars.#### 52.GRACE: Edward earned 9 dollars for each lawn he mowed, so if he had 6 lawns to mow, he would have earned 6 x 9 = 54 dollars.However, he forgot to mow 2 lawns, so he only mowed 6-2 = 4 lawns.Therefore, he actually earned 4 x 9 = 36 dollars.#### 36.
QUESTION: Will invited 9 friends to a birthday party, but 4 couldn't come.If he wanted to buy enough cupcakes so each person could have exactly 8, how many should he buy?SELF-CONSISTENCY: Will invited 9 people to the party, but 4 of them couldn't come.If he wants to buy enough cupcakes so that each person has 8, he needs to buy 9 x 8 = 72 cupcakes.#### 72.GRACE: Will had 9 friends and 4 of them couldn't come, so he had 9-4 = 5 friends at the party.If he wants to buy 8 cupcakes for each person, then he needs to buy 5 x 8 = 40 cupcakes.#### 40.
QUESTION: Edward and his friend were buying trick decks from the magic shop for 6 dollars each. How much did they spend if Edward bought 3 decks and his friend bought 6 decks?SELF-CONSISTENCY: Edward bought 3 decks, so he spent 6 dollars, and his friend bought 6 decks, so he spent 6 dollars.The total amount spent was 6 dollars for Edward and 6 dollars for his friend, for a total of 12 dollars.#### 12. GRACE: Edward bought 3 decks and his friend bought 6 decks, so the total number of decks bought is 3+6 = 9 decks.The cost of each deck is 6 dollars, so the total cost is 9 x 6 = 54 dollars.#### 54.

QUESTION:
The school cafeteria had 14 apples.If they used 13 to make lunch for the students and then bought 49 more, how many apples would they have?SELF-CONSISTENCY: The cafeteria started with 14 apples and used 13 of them to make lunches for the students.After that, they bought 49 more apples, bringing their total number of apples to 14 + 49 = 63 apples.#### 63.GRACE: The school cafeteria had 14 apples and used 13 of them to make lunch for the students, leaving them with 14-13 = 1 apples.They then bought 49 more apples, bringing their total number of apples to 1 + 49 = 50 apples.#### 50.
QUESTION: There were 6 girls and 48 boys trying out for the schools basketball team.If only 7 of them got called back, how many students didn't make the cut?SELF-CONSISTENCY: There were 6 girls and 48 boys trying out for the schools basketball team.If only 7 of them got called back, then the total number of students who tried out was 6+48 = 54.To find out how many students didn't make the cut, we divide the total number of students who tried out by the number of students who were called back: 54 / 7 = 7.7 students didn't make the cut.#### 7.7.GRACE: There were 6 girls and 48 boys, so there were 6+48 = 54 students trying out for the basketball team.Only 7 were called back, so 54-7 = 47 students did not make the cut.#### 47.  , 2015) where GRACE reaches the correct answer, unlike selfconsistency.The LM used here is LLaMA7B.Each sentence in the solutions above represents a reasoning step.
QUESTION: A coin is heads up.Krishna flips the coin.Catalina does not flip the coin.Eileen does not flip the coin.Teddy flips the coin.Is the coin still heads up?Note that "flip" here means "reverse".SELF-CONSISTENCY: Krishna flips the coin, so the coin is now tails up.Catalina does not flip the coin, so the coin remains tails up.Eileen does not flip the coin, so the coin remains tails up.Teddy flips the coin, so the coin is now heads up.So, the answer is no, the coin is not heads up.#### tails.GRACE: Krishna flips the coin.The coin is now tails up.Catalina does not flip the coin.The coin is still tails up.Eileen does not flip the coin.The coin is still tails up.Teddy flips the coin.The coin is now heads up.#### heads.
QUESTION: A coin is heads up.Nubia flips the coin.Sarah does not flip the coin.Jalen does not flip the coin.Kris flips the coin.Is the coin still heads up?Note that "flip" here means "reverse".SELF-CONSISTENCY: Nubia flips the coin, so the coin is now tails up.Sarah does not flip the coin, so the coin remains tails up.Jalen does not flip the coin, so the coin remains tails up.Kris flips the coin, so the coin is now heads up.So, the coin is still tails up.#### tails.GRACE: Nubia flips the coin.The coin is now tails up.Sarah does not flip the coin.The coin is still tails up.Jalen does not flip the coin.The coin is still tails up.Kris flips the coin.The coin is now heads up.#### heads.
QUESTION: A coin is heads up.Leticia flips the coin.Jacinto flips the coin.Natasha does not flip the coin.Raphael flips the coin.Is the coin still heads up?Note that "flip" here means "reverse".SELF-CONSISTENCY: Leticia flips the coin.The coin is now tails up.Jacinto flips the coin.The coin is now heads up.Natasha does not flip the coin.The coin is still tails up.Raphael flips the coin.The coin is now heads up.#### heads.GRACE: Leticia flips the coin.The coin is now tails up.Jacinto flips the coin.The coin is now heads up.Natasha does not flip the coin.The coin is still heads up.Raphael flips the coin.The coin is now tails up.#### tails.
QUESTION: A coin is heads up.Mickey does not flip the coin.Dom flips the coin.Lilly flips the coin.Eloy does not flip the coin.Is the coin still heads up?Note that "flip" here means "reverse".SELF-CONSISTENCY: Mickey does not flip the coin, so the coin remains heads up.Dom flips the coin, so the coin is now tails up.Lilly flips the coin, so the coin is now heads up again.Eloy does not flip the coin, so the coin is still tails up.#### tails.GRACE: Mickey does not flip the coin.The coin is still heads up.Dom flips the coin.The coin is now tails up.Lilly flips the coin.The coin is now heads up.Eloy does not flip the coin.The coin is still heads up.#### heads.
QUESTION: A coin is heads up.Ashish flips the coin.Tracey flips the coin.Varun does not flip the coin.Emil flips the coin.Is the coin still heads up?Note that "flip" here means "reverse".SELF-CONSISTENCY: Ashish flips the coin, so it is now tails up.Tracey flips the coin, so it is now heads up again.
Varun does not flip the coin, so it remains tails up.Emil flips the coin, so it is now heads up again.#### heads.GRACE: Ashish flips the coin.The coin is now tails up.Tracey flips the coin.The coin is now heads up.Varun does not flip the coin.The coin is still heads up.Emil flips the coin.The coin is now tails up.#### tails.
Sample a pool of candidate next steps.2. Score steps using Eq.(6) t e x i t s h a 1 _ b a s e 6 4 = " 0 E C d O x B j 3 n k d j E + h C x I g B a k s d k o

Figure 2 :
Figure 2: Top: The three-step process to train the discriminator.(1) Sampling solutions from a given language model with different mistakes by keeping the solutions with the incorrect final answers only.(2) Aligning the sampled solutions with the reference solutions to identify incorrect steps.(3)Learning the discriminator with a max-margin loss to assign high scores to correct steps and low scores to incorrect steps.Bottom: The guided stepwise decoding process using the trained discriminator.Given the question and the prefix, we sample a pool of candidate next steps and use the discriminator to score steps as in Equation (6).The top-scored step is then selected and added to the prefix.This process repeats until a final answer is generated.

AFigure 3 :
Figure 3: An example of the alignment produced by our alignment algorithm (described in Algorithm 2).The question and the reference solutions come from GSM8K (Cobbe et al., 2021).The "-" designates an empty step placeholder.There are three possible cases when aligning a reference solution with a sampled solution: missing, extra, and comparable steps.In the comparable case, the intermediate variables (underlined) are compared to determine the correctness of the sampled step.
and includes math word problems about gain/loss.Each problem is accompanied by a step-by-step Python program.SVAMP (Patel et al., 2021) and MultiArith (Roy and Roth, 2015) consist of elementary-level math word problems.For MathQA-Gain, SVAMP, and MultiArith, we use the train-test splits included in the LILA benchmark (Mishra et al., 2022).As for symbolic reasoning tasks, we experiment with Coin Flip (CF; Wei et al. 2021; Kojima et al. 2022) and Tracking Shuffled Objects (TSO) from Big-Bench Hard (Srivastava et al., 2022) and we use the splits by Ho et al. (2022).

Figure 5 :
Figure 5: Cross-task performance over SVAMP and Multi-Arith.GRACE's final answer accuracy is shown when the discriminator is trained on different tasks.Results are averaged over 3 runs.
-step reasoning.Two main types of approaches have been explored to improve multi-step reasoning: Inference-time methods, which do not require additional language model (LM) training, and training-based methods, which require either labeled samples or rewards.Popular inference-time techniques include model prompting such as chainof-thought(Nye et al., 2021; Wei et al., 2021)  and its variants(Zhou et al., 2022;Zhang et al., 2022).While these input-based techniques operate at the LM input side, other methods target the output side.For instance, self-consistency(Wang et al., 2022) employs majority voting on multiple sampled solutions to determine the final answer.An alternative output-based method involves training a verifier model to rank sampled solutions according to correctnessCobbe et al. (2021).However, verifiers and vanilla self-consistency exhibit no control over solution sampling.We also show in this paper (see Section 4) that verifiers trained on samples from smaller LMs perform very poorly.Training-based methods, on the other hand, focus on crafting learning objectives to teach the LM to reason correctly.For instance, Uesato et al. (2022) trained a reward model to assess the correctness of the entire reasoning chain, which is then used as a reward model.Ni et al. (2022)  proposed training LMs on sampled partially correct solutions to enhance mathematical reasoning.

Figure 6 :
Figure 6: Dev set accuracy over GSM8K and SVAMP with FLAN-T5Large with two solution alignment methods: Our NW algorithm outperforms the naive alignment by 1.9% on GSM8K and 5.9% on SVAMP, highlighting the effectiveness of our proposed alignment method.Results are averaged over 3 runs.

Figure 7 :
Figure 7: GRACE's accuracy on GSM8K and MultiArith with different discriminator sizes with LLaMA7B.Discriminator size matters: Larger discriminators have more capacity to model step correctness.The complexity of the task matters:A larger discriminator is required for GSM8K than for Multi-Arith to produce an observable performance boost.

QUESTION:
Alice, Bob, and  Claire are on the same team in a soccer match.At the start of the match, they are each assigned to a position: Alice is playing goalkeeper, Bob is playing left winger, and Claire is playing right midfielder.As the game progresses, pairs of players occasionally swap positions.First, Claire and Alice trade positions.Then, Bob and Claire trade positions.Finally, Alice and Bob trade positions.At the end of the match, Claire is playing Which choice is true?Answer choices: (A) goalkeeper, (B) left winger, (C) right midfielder.SELF-CONSISTENCY: First, Claire and Alice trade positions.This means that Claire is now playing goalkeeper and Alice is playing right midfielder.Then, Bob and Claire trade positions.This means that Bob is now playing goalkeeper and Claire is playing left winger.Finally, Alice and Bob trade positions.This means that Alice is now playing goalkeeper and Bob is playing right midfielder.At the end of the match, Claire is playing right midfielder.Therefore, the correct answer is (C).#### C GRACE: First, Claire and Alice trade positions.This means that Claire is now playing goalkeeper and Alice is playing right midfielder.Then, Bob and Claire trade positions.This means that Bob is now playing goalkeeper and Claire is playing left winger.Finally, Alice and Bob trade positions.This means that Alice is now playing goalkeeper and Bob is playing right midfielder.At the end of the match, Claire is playing left winger.Therefore, the answer is (B).#### B QUESTION: Alice, Bob, and Claire are holding a white elephant gift exchange.At the start of the event, they are each holding a present of a different color: Alice has a red present, Bob has a black ball, and Claire has a white present.As the event progresses, pairs of people swap gifts.First, Bob and Claire swap their gifts.Then, Bob and Alice swap their gifts.Finally, Claire and Bob swap their gifts.At the end of the event, Bob has the Which choice is true?Answer choices: (A) red present, (B) black ball, (C) white present.SELF-CONSISTENCY: First, Bob and Claire swap their gifts.This means that Bob now has the white present, and Claire has the black ball.Then, Bob and Alice swap their gifts.This means that Alice now has the black ball, and Bob has the red present.Finally, Claire and Bob swap their gifts again.This means that Claire now has the red present, and Bob has the white present.At the end of the event, Bob has the white present.Therefore, the answer is (C).#### C GRACE: First, Bob and Claire swap their gifts.This means that Bob now has the white present, and Claire has the black ball.Then, Bob and Alice swap their gifts.This means that Bob now has the red present, and Alice has the white present.Finally, Claire and Bob swap their gifts again.This means that Claire now has the red present, and Bob has the black ball.At the end of the event, Bob has the black ball, so the answer is (B).#### B Table 20: Example outputs from the Tracking Shuffled Objects task where GRACE reaches the correct answer, unlike selfconsistency.The LM used here is FLAN-T5Large.Each sentence represents a single reasoning step.QUESTION: Alice, Bob, and Claire are playing a game.At the start of the game, they are each holding a ball: Alice has a white ball, Bob has a purple ball, and Claire has a pink ball.As the game progresses, pairs of players trade balls.First, Bob and Alice swap balls.Then, Bob and Claire swap balls.Finally, Bob and Alice swap balls.At the end of the game, Bob has the Which choice is true?Answer choices: (A) white ball, (B) purple ball, (C) pink ball.SELF-CONSISTENCY: First, Bob and Alice swap balls.This means that Bob now has the white ball and Alice has the purple ball.Then, Bob and Claire swap balls.This means that Bob now has the pink ball and Claire has the white ball.Finally, Bob and Alice swap balls again.This means that Bob now has the purple ball and Alice has the pink ball.At the end of the game, Bob has the purple ball.Therefore, the answer is (A).#### A. GRACE: First, Bob and Alice swap balls.This means that Bob now has the white ball and Alice has the purple ball.Then, Bob and Claire swap balls.This means that Bob now has the pink ball and Claire has the white ball.Finally, Bob and Alice swap balls again.This means that Bob now has the purple ball and Alice has the pink ball.At the end of the game, Bob has the purple ball.Therefore, the answer is (B).#### B. QUESTION: Alice, Bob, and Claire are playing a game.At the start of the game, they are each holding a ball: Alice has a orange ball, Bob has a white ball, and Claire has a purple ball.As the game progresses, pairs of players trade balls.First, Bob and Claire swap balls.Then, Alice and Bob swap balls.Finally, Claire and Bob swap balls.At the end of the game, Bob has the Which choice is true?Answer choices: (A) orange ball, (B) white ball, (C) purple ball.SELF-CONSISTENCY: First, Bob and Claire swap balls.This means that Bob now has the purple ball, and Claire has the white ball.Then, Alice and Bob swap balls.This means that Alice now has the white ball, and Bob has the orange ball.Finally, Claire and Bob swap balls.This means that Claire now has the orange ball, and Bob has the purple ball.At the end of the game, Bob has the orange ball.Therefore, the answer is (A).#### A. GRACE: First, Bob and Claire swap balls.This means that Bob now has the purple ball, and Claire has the white ball.Then, Alice and Bob swap balls.This means that Alice now has the white ball, and Bob has the orange ball.Finally, Claire and Bob swap balls.This means that Claire now has the orange ball, and Bob has the white ball.At the end of the game, Bob has the white ball.Therefore, the answer is (B).#### B. QUESTION: Alice, Bob, and Claire are playing a game.At the start of the game, they are each holding a ball: Alice has a red ball, Bob has a pink ball, and Claire has a black ball.As the game progresses, pairs of players trade balls.First, Claire and Alice swap balls.Then, Alice and Bob swap balls.Finally, Alice and Claire swap balls.At the end of the game, Claire has the Which choice is true?Answer choices: (A) red ball, (B) pink ball, (C) black ball.SELF-CONSISTENCY: First, Claire and Alice swap balls.This means that Claire now has the red ball, and Alice has the black ball.Then, Alice and Bob swap balls.This means that Alice now has the black ball, and Bob has the pink ball.Finally, Alice and Claire swap balls.This means that Alice now has the black ball, and Claire has the red ball.At the end of the game, Claire has the red ball.Therefore, the answer is (A).#### A. GRACE: First, Claire and Alice swap balls.This means that Claire now has the red ball, and Alice has the black ball.Then, Alice and Bob swap balls.This means that Alice now has the pink ball, and Bob has the red ball.Finally, Alice and Claire swap balls.This means that Alice now has the black ball, and Claire has the pink ball.At the end of the game, Claire has the pink ball.Therefore, the answer is (B).#### B. QUESTION: Alice, Bob, and Claire are playing a game.At the start of the game, they are each holding a ball: Alice has a white ball, Bob has a brown ball, and Claire has a black ball.As the game progresses, pairs of players trade balls.First, Alice and Claire swap balls.Then, Alice and Bob swap balls.Finally, Bob and Claire swap balls.At the end of the game, Claire has the Which choice is true?Answer choices: (A) white ball, (B) brown ball, (C) black ball.SELF-CONSISTENCY: First, Alice and Claire swap balls.This means that Alice now has the black ball and Claire has the white ball.Then, Alice and Bob swap balls.This means that Alice now has the brown ball and Bob has the white ball.Finally, Bob and Claire swap balls.This means that Bob now has the white ball, and Claire has the black ball.At the end of the game, Claire has the white ball.Therefore, the answer is (A).#### A. GRACE: First, Alice and Claire swap balls.This means that Alice now has the black ball, and Claire has the white ball.Then, Alice and Bob swap balls.This means that Alice now has the brown ball, and Bob has the white ball.Finally, Bob and Claire swap balls.This means that Bob now has the black ball, and Claire has the brown ball.At the end of the game, Claire has the black ball.Therefore, the answer is (C).#### C.

Table 1 :
Final answer accuracy on four multi-step reasoning tasks.Self-consistency and verifier results use 40 samples for FLAN-T5Large experiments and 20 samples for LLaMA.The discriminator used with GRACE is a T5Large encoder.FLAN-T5Large results are aggregated over 5 runs and LLaMA over 3 runs.Absolute improvements by GRACE vs greedy decode and by GRACE w/ self-consistency (SC) vs vanilla self-consistency are shown in parentheses.GRACE w/ self-consistency outperforms the baselines on all tasks.

Table 2 :
Final answer accuracy on Coin Flip (CF) and Tracking Shuffled objects (TSO).FLAN-T5Large results are averaged over 5 runs and LLaMA13B over 3 runs.We do not show the results of FLAN-T5Large on Coin Flip as the fine-tuned FLAN-T5Large already achieves near-perfect accuracy.

Table 3
shows the LLM and human evaluation results comparing GRACE to greedy decoding and to compute the alignment for every task except for Coin Flip, where we use the vanilla SimCSE (Gao et al., 2021) embeddings instead.

Table 7 :
Number of examples for each split in the datasets used.

Table 8 :
The 6-shot prompt used with LLaMA7B for GSM8K.Paul sold some books in a garage sale, leaving him with 27 books.If he had 136 books initially, how many books did he sell?A: To find out how many books Paul sold, we need to subtract the number of books he has now from the number he had initially: 136 -27 = 109 books.Therefore, Paul sold 109 books in the garage sale. Q:

Table 14 :
(Cobbe et al., 2021) GSM8K(Cobbe et al., 2021)where GRACE reaches the correct answer, unlike self-consistency.The LM used here is LLaMA7B.Each sentence in the solutions above represents a reasoning step.QUESTION: Brenda's mother made cookies for guests.If she prepared 38 cookies and each of them had 19 cookies.How many guests did she prepare cookies for?SELF-CONSISTENCY: If each guest had 19 cookies, then the total number of cookies Brenda's mother prepared is 38 x 19 = 722 cookies.#### 722 GRACE: To find the total number of cookies Brenda's mother prepared, we need to multiply the number of guests by the number of cookies per guest: 38 / 19 = 2 guests.Therefore, Brenda's mother prepared cookies for 2 guests.#### 2 QUESTION: Because of the decision Sofia asked the students to suggest specific types of food.If 144 students suggested adding mashed potatoes 467 suggested adding bacon to the menu and 79 suggested adding tomatoes.How many more students suggested mashed potatoes than those that suggested tomatoes?SELF-CONSISTENCY: 144 students suggested adding mashed potatoes and 467 suggested adding bacon, so the difference in number of students who suggested mashed potatoes and tomatoes is 144 -467 = -323.Therefore, 323 more students suggested mashed potatoes than tomatoes.#### 323 GRACE: The number of students who suggested mashed potatoes is 144, and the number of students who suggested tomatoes is 79.To find out how many more students suggested mashed potatoes than tomatoes, we subtract the number of students who suggested tomatoes from the number of students who suggested bacon: 144 -79 = 65.Therefore, 65 more students suggested mashed potatoes than tomatoes.#### 65 QUESTION: Each basket of peaches has 19 red peaches and 4 green peaches.If there are 15 such baskets.Jerry had 4 action figures and 22 books on a shelf in his room.Later he added 6 more action figures to the shelf.. How many action figures were on his shelf in all?SELF-CONSISTENCY: Jerry had 4 action figures and 22 books on the shelf, so there were 4 + 22 = 26 action figures on the shelf.He later added 6 more action figures, so there were now 26 + 6 = 32 action figures on the shelf.#### 32 GRACE: Jerry started with 4 action figures and added 6 more, so there were now 4 + 6 = 10 action figures on the shelf.

Table 18 :
Example outputs from MultiArith (Roy and Roth

Table 19 :
Example outputs from The Coin Flip task where GRACE reaches the correct answer, unlike self-consistency.The LM used here is LLaMA13B.Each sentence represents a single reasoning step.QUESTION: Alice, Bob, and Claire are on the same team in a soccer match.At the start of the match, they are each assigned to a position: Alice is playing right midfielder, Bob is playing goalkeeper, and Claire is playing left winger.As the game progresses, pairs of players occasionally swap positions.First, Claire and Bob trade positions.Then, Bob and Alice trade positions.Finally, Claire and Bob trade positions.At the end of the match, Alice is playing Which choice is true?Answer choices: (A) right midfielder, (B) goalkeeper, (C) left winger.SELF-CONSISTENCY: First, Claire and Bob trade positions.This means that Claire is now the goalkeeper and Bob is the left winger.Then, Bob and Alice trade positions.This means that Alice is now the goalkeeper and Bob is the right midfielder.Finally, Claire and Bob trade positions again.This means that Claire is now the right midfielder and Bob is the left winger.At the end of the match, Alice is playing as the right midfielder.Therefore, the answer is (A).#### A GRACE: First, Claire and Bob trade positions.This means that Claire is now the goalkeeper and Bob is the left winger.Then, Bob and Alice trade positions.This means that Alice is now the left winger and Bob is the right midfielder.Finally, Claire and Bob trade positions again.This means that Claire is now the right midfielder and Bob is the goalkeeper.At the end of the match, Alice is playing left winger.Therefore, the answer is (C).#### C. QUESTION: Alice, Bob, and Claire are on the same team in a soccer match.At the start of the match, they are each assigned to a position: Alice is playing left midfielder, Bob is playing right midfielder, and Claire is playing goalkeeper.As the game progresses, pairs of players occasionally swap positions.First, Bob and Alice trade positions.Then, Alice and Claire trade positions.Finally, Bob and Alice trade positions.At the end of the match, Bob is playing Which choice is true?Answer choices: (A) left midfielder, (B) right midfielder, (C) goalkeeper.SELF-CONSISTENCY: First, Bob and Alice trade positions.This means that Bob is now playing left midfielder and Alice is playing right midfielder.Then, Alice and Claire trade positions.This means that Claire is now playing right midfielder and Alice is playing goalkeeper.Finally, Bob and Alice trade positions again.This means that Bob is now playing left midfielder and Alice is playing goalkeeper.At the end of the match, Bob is playing left midfielder, so the answer is (A).#### A GRACE: First, Bob and Alice trade positions.This means that Bob is now playing left midfielder and Alice is playing right midfielder.Then, Alice and Claire trade positions.This means that Claire is now playing right midfielder and Alice is playing goalkeeper.Finally, Bob and Alice trade positions again.This means that Bob is now playing goalkeeper and Alice is playing left midfielder.Therefore, at the end of the match, Bob is playing goalkeeper.The answer is (C).#### C.

Table 21 :
Example outputs from the Tracking Shuffled Objects task where GRACE reaches the correct answer, unlike selfconsistency.The LM used here is LLaMA13B.Each sentence represents a single reasoning step.