Interpretable Math Word Problem Solution Generation via Step-by-step Planning

Solutions to math word problems (MWPs) with step-by-step explanations are valuable, especially in education, to help students better comprehend problem-solving strategies.Most existing approaches only focus on obtaining the final correct answer. A few recent approaches leverage intermediate solution steps to improve final answer correctness but often cannot generate coherent steps with a clear solution strategy.Contrary to existing work, we focus on improving the correctness and coherence of the intermediate solutions steps.We propose a step-by-step planning approach for intermediate solution generation, which strategically plans the generation of the next solution step based on the MWP and the previous solution steps.Our approach first plans the next step by predicting the necessary math operation needed to proceed, given history steps, then generates the next step, token-by-token, by prompting a language model with the predicted math operation.Experiments on the GSM8K dataset demonstrate that our approach improves the accuracy and interpretability of the solution on both automatic metrics and human evaluation.


Introduction
Arithmetic math word problems (MWPs) consist of natural language statements describing real-world scenarios that involve numerical quantities, followed by a question asking for an unknown value. Solving MWPs require parsing the textual statements and carrying out the corresponding calculations (Kumar et al., 2022). MWPs are an important educational tool that helps assess and improve student knowledge in basic mathematical concepts and skills (Walkington, 2013;Verschaffel et al., 2020). They also represent a long-standing interest in artificial intelligence (AI) research since correctly solving them serves as a key benchmark task for testing and improving the mathematical reasoning skills of AI models (Feigenbaum and Feldman, 1995;Bommasani et al., 2021;Cobbe et al., 2021;Lewkowycz et al., 2022).
There is a large body of literature that focuses on automatically solving MWP. Earlier works took a modular approach that first analyzes unconstrained natural language and then maps intricate text patterns onto mathematical vocabulary (Sundaram et al., 2022). As a result, this approach relies heavily on hand-crafted rules to fill the gap between natural language and symbolic mathematical vocabulary (Sundaram et al., 2022). Recent works leverage advances in natural language processing and take a neural network-based, end-to-end approach, where a neural network encodes a numerical representation of the MWP (and the underlying equation), from which a decoder generates the final answer (Zou and Lu, 2019;Wang et al., 2017;Chen et al., 2020;Shen et al., 2021;Shao et al., 2022;Jie et al., 2022). Unfortunately, the vast majority of these works focus on generating and predicting a single final answer, since answer correctness is often the only evaluation metric. Therefore, these works do not provide any insights or explanations into how the models arrive at the answer. As a result, it is often difficult, if not entirely impossible, to explain the model's behavior, especially when it produces a wrong answer. The lack of interpretability of these methods makes it challenging to analyze them and unsafe to use them in real-world applications.
This interpretability issue has attracted increasing interest in MWP solving research. Recent works have shifted to designing models that not only generate the final answer for an MWP, but also the intermediate steps. The ability to generate intermediate steps not only enables researchers to investigate model behavior but also new applications. For example, in personalized education and intelligent tutoring systems, these models have the potential to generate detailed, personalized solution steps as feedback to improve stu-dent understanding of the mathematical concepts and resolve misconceptions (Walkington, 2013;Karpicke, 2012;Koedinger et al., 2015). The recent GSM8K (Cobbe et al., 2021) dataset contains MWPs that come with 2 to 8 intermediate steps described in natural language, which provides us a good resource to study step-by-step solution generation. Many works apply (large) language models (LMs) on this dataset and achieve high accuracy in final answer generation, without studying the quality of intermediate steps Chowdhery, Aakanksha and others, 2022;Lewkowycz et al., 2022;Uesato et al., 2022;Kojima et al., 2022;. These works use verifiers, self-consistency decoding strategy (majority votes), chain-of-thought prompting, or calculators; see Section 4 for a detailed discussion.
However, existing LMs are still prone to generating incorrect intermediate steps despite yielding the correct final answer. The models are not competent at numerical reasoning, possibly because they generate intermediate steps word by word (or token by token) and cannot look far ahead. As a result, they only use shallow heuristics  in word occurrence and lack multi-step mathematical reasoning capabilities, which solving an MWP requires. A recent study that experiments on GPT-4 also points out that the architecture of next-word prediction precludes any "inner dialog" and cannot really plan ahead (Bubeck et al., 2023).

Contributions
In this paper, we study the problem of generating accurate and high-quality intermediate solution steps with natural language explanation via step-bystep planning using LMs. We formulate this problem as a controllable generation problem where the LM aims to generate the correct intermediate solution at each solution step, given the MWP and previous solution steps. This problem is particularly challenging since the generated solution steps need to be accurate, i.e., each intermediate step must be mathematically valid and on the path to the correct answer. We need an approach different from widely-adopted, attribute-controlled generation approaches for topic or sentiment, where the attribute is nuanced and cannot be matched exactly (Dathathri et al., 2020;Krause et al., 2020;Shirish Keskar et al., 2019).
To overcome these challenges, we introduce a planning-LM approach, where we plan the strategy for the next solution step and then use the plan to guide LMs to generate the step. Since symbols and patterns are crucial to the effectiveness of chainof-thought prompting (Madaan and Yazdanbakhsh, 2022), we design plans in the form of mathematical operations to prompt the model to generate the next intermediate step. We summarize our contributions as follows.
[C1] We explore the use of a planning approach for step-by-step solution generation for MWPs. To the best of our knowledge, our work is the first to focus on generating high-quality intermediate solution steps via LMs.
[C2] We first predict the mathematical operation applied in the next solution step using a small model and then apply a carefully-constructed prompt to control an LM to generate the next solution step. Our approach can be extended to many downstream applications due to its interpretability and high controllability.
[C3] We evaluate our planning-LM approach on the GSM8K dataset to demonstrate its effectiveness, both quantitatively and qualitatively. With minimal additional parameters (0.02%), it outperforms existing approaches on both final answer accuracy and intermediate step quality. Moreover, by manually changing the math operation prompt, we can control our approach to generate different correct solution paths for the same MWP.

Notation
We first define all of the terms and components in our approach. We define an MWP as Q = {q 1 , q 2 , . . . , q n } where q i represents a token, which is either a numerical value, a mathematical operator, or a word/sub-word. The corresponding stepby-step solution is S = {S 1 , S 2 , . . .}, where S i denotes i th step of the solution. For any step S i , we denote it as S i = {s i 1 , s i 2 , . . .}, consisting of a sequence of tokens. Next, we define our prompt in two parts. The first part is the textual instruction prompt, which contains words that LMs can understand, and the second part is the mathematical operation prompt, which is a special token that instructs the LM to perform which mathematical operation in the next solution step. We denote the instruction prompt as P = {p 1 , p 2 , . . .}, where p i represents a word/sub-word token, and the operation prompt as O = {o}, where o is a categorical variable indicating the math operation token. We define H i as the solution context, i.e., the history Figure 1: An overview of our step-by-step MWP solution generation approach. Planning-LM first predicts the next step operation hint (a-1) and controls the next step generate via the predicted operation hint (a-2). Figure (b) shows the overview generation process by given the question Q.
at step S i , which consists of the problem Q and all previous steps, {S 1 , . . . , S i−1 }. M denotes the base LM and e is its corresponding token embedding function. Finally, we define f as the prompt embedding function. Both e and f can map tokens into R K where K is the hidden state dimension of the LM.

Methodology
We now define our MWP solution generation task and detail the specifics of our approach. Our task is that given a question Q, we need to generate a stepby-step solution S = S 1 , S 2 , . . ., with each step consisting of a combination of textual and mathematical tokens, to reach the final answer. We formulate the problem as a step-wise controllable generation task using prompts-based LM fine-tuning. Figure 1 shows an overview of our approach 1 in-cluding its two main components: First, we utilize the MWP and the solution history to plan and predict next mathematical operation to apply in the next step. Second, we use the predicted operation prompt with instruction prompt to guide the next step generation process. Our key technical challenges are (i) how to learn a solution planning strategy to transition from step to step and (ii) once we have the next operation, how to apply and design prompts to guide the generative LM to generate the next step to follow the plan.

Operation Prediction
Our first step is to predict the mathematical operation to be applied in the next step. To achieve this, we concatenate the solution history H and a crafted instruction prompt P (e.g.,"What is the next operation?") followed by the special token "[cls]" as input to an (not necessarily large) LM. We encode solution history tokens with a vocabulary embedding function e β and instruction prompt tokens with a separate prompt embedding function f θ ; β and θ are the parameters of these parts, i.e., the embedding layer in an LM. Then, we obtain the representation of the solution history as the final layer hidden state of the LM, i.e., M. To predict the operation action of the next step, we use a one-layer, fully-connected network as the classifier, with weight w γ , to obtain an operation score vector for each valid math operation s ∈ [0, 1] |O| , where |O| is the number of operation classes, as where γ is the set of parameters for the classifier. Since we need to use an LM for step generation, introducing a separate LM for operation prediction leads to a large number of parameters. Therefore, we use the same LM for both operation planning and solution step generation. The objective function for operation planning is the cross-entropy loss on operators, i.e., where s i is the score of operation class i. t i is an indicator such that t i = 1 when i is the true label and t i = 0 otherwise. We obtain true labels by extracting mathematical operations from each step of the solution in the training data, which we detail below in Section 2.3.

Controllable Step Generation
Once we have the predicted operation O, we append the corresponding prompt to the instruction prompt P to form our final prompt for step generation. Our task becomes a controllable generation task: given history H and the prompt [P ; O] that plans the next step, our goal is to generate the next step S token-by-token. We generate a step Then, the overall step-by-step solution S with N steps is generated according to The step generation objective is given by the negative log-likelihood objective function where the set of parameters include previously defined β, θ, γ and the LM parameters ψ. β and ψ are fine-tuned while θ and γ are learned from scratch. We also investigate two ways to position the prompt in LM input: as prefix, where we place them at the beginning, i.e., the input is given by [P ; O; H] and as infix, where we append the prompt after the history, i.e., the input is given by [H; P ; O].

Prompt Design
Our prompt consists of two parts: the instruction prompt gives the LM general instructions on what to generate, while the operation prompt provides specific guidelines for the mathematical calculation involved in the next step. For the instruction prompt, we apply prompt mining (Yuan et al., 2021) to find good instructions, i.e., word tokens that are the most informative for the LM to accomplish the desired task. See Section D.2 for details. For the operation prompt, we extract 20 common operations from the training data, such as one step addition [n + n], subtraction [n − n], multiplication [n * n], etc and use them as prompts. We note that these operators are easy to find and can be automatically extracted, which means that there is no need to manually create labels to train the operation prediction LM. The instruction tokens and operation action tokens form the entire vocabulary of the prompt function f θ . The prompt function is a two-layer perceptron with a ReLU activation function.

Optimization
Although our entire approach can be trained together in an end-to-end way, we found that optimizing the operation prediction model and finetuning the LM/prompts for step generation asynchronously leads to better performance. Our intuition is that the operation predictor is a highlevel decision-making policy for the entire solution while the LM generation process is a low-level (token-by-token) decision-making process for the current step. Optimizing these two modules simultaneously may cause inconsistency since the operation predictor may make a decision based on LM parameters that also need to be updated. Therefore, we first optimize the parameters of the generation LM and prompts with the step generation task loss, using ground truth operation labels, which we extract from the mathematical part of each step in the training data. Then, we iterate between freezing both the LM M and the prompt function f while Table 1: Planning-LM outperforms fine-tuning LMs for both small and medium-sized GPT-2. Moreover, Planning-LM with a small GPT-2 achieves performance comparable to fine-tuning medium GPT-2, implying that our method can make a smaller model rival larger ones fine-tuned in the traditional way.  tuning the operation predictor and switching the two. In this way, we can guarantee the whole model to converge in a stable process (Wang et al., 2020).

Experiments
We now detail a series of experiments that we conducted to validate the effectiveness of our proposed planning-LM approach on step-by-step MWP solution generation. Since our focus is on MWP solution generation with explanations, GSM8K (Cobbe et al., 2021) is a good fit for our purpose. This dataset contains 8.5K high-quality and linguistically diverse MWPs, where each MWP has 2-8 solution steps. See Section C for details on data preprocessing.

Automated Metrics
We need a variety of different metrics to understand the effectiveness of our planning-LM approach.
For the final answer, we use the solve rate metric to evaluate whether the model generates the final correct answer to each MWP. Since generating meaningful steps is also key, we use the BLEU metric (Papineni et al., 2002) to evaluate language generation quality. For intermediate steps, we use the equation match accuracy (ACC-eq) metric to evaluate whether a generated step contains a math expression (including numbers) that matches the ground truth. Since LMs generate math equations as strings, we decompose the equation string into tokens and calculate the token level match rate instead of the overall string match. We also use the operation match accuracy (ACC-op) metric to evaluate whether a generated step's operation label matches the ground truth.

Human Evaluation
Our proposed planning-LM framework cannot be accurately evaluated using only automated metrics since text similarity metrics such as BLEU do not accurately reflect the mathematical validity of intermediate solution steps. To address this limitation, we implemented a human evaluation protocol with three metrics: reasoning strategy, clear explanation, and overall preference. Ten raters with a good understanding of fundamental mathematics concepts evaluated 50 randomly selected MWPs using the protocol, where their task is to compare two different step-by-step solutions. Each MWP receiving at least three ratings. The full evaluation template can be found in Section G.

Experimental Settings
We conduct two experiments to verify the effectiveness of our planning-LM framework. In the first, single-step experiment, we input the question and ground-truth solution steps to the model and let it generate the next step and calculate the ACCeq and ACC-op metrics for each generated step. Since some of the steps are too short, yielding a high variance in BLEU scores, we concatenate all generated steps and calculate the overall BLEU metric between the ground truth solution and this true history-informed solution. In the second, allstep experiment, we only provide the model with the MWP and ask it to generate all solution steps. We then calculate the solve rate metric to evaluate whether the final answer is correct. We choose GPT-2 (117M parameters) and GPT-2-medium (345M) as our base models and compare the generation results between LM fine-tuning and planning-LM. Meanwhile, we perform another experiment using the ground truth operation prompt as input for planning-LM to generate the next step. The result, an upper bound on the performance of planning-LM, reflects the effectiveness of low-level tokenby-token generation in each step, while ACC-eq and ACC-op reflect the effectiveness of high-level mathematical operation planning across steps. We also conduct the above experiments on encoder-decoder LMs: T5-base(220M) and T5large(770M). The decoder architecture is the same as GPT-2 models, but instead of treating the question as history input, T5 contains an extra encoder to encode the question and uses cross attention to the question to generate results.
To fairly compare planning-LM with other works on LLM prompting such as chain-of-thought, instead of prompt-tuning on a relatively small LM, we adapt our approach for in-context learning. We select five examples with a specific format (Q, P, O 1 , S 1 , P, O 2 , S 2 , . . .), i.e., the question followed by a number of prompt-operation-solution triples. We use the examples with GPT-3 ("textdavinci-003") for in-context learning. An example of the prompt we use is shown in Table 6. Table 1 shows the experimental results for all prompt-tuning based approaches across the two experiments. We see that planning-GPT-2 and planning-T5 with our operation classifier outperform chain-of-thought-tuning on both GPT-2 and T5. We also observe that a similar trend holds for the larger models, GPT-2-medium and T5-large. We highlight that with the planning component, which introduces only around 10K new parameters for the MWP solving task, a base GPT-2 model with 117M parameters performs similarly to a much larger base GPT-2-medium model with 345M parameters. This observation shows that our planning approach is highly parameter-efficient for MWP solving. The other observation is that our approach seems to adapt better to decoder-only LMs than to encoder-decoder LMs, even ones with more parameters; T5-base yields almost the same performance as GPT-2, with twice as many parameters.

Prompt-tuning
To validate the effectiveness of each component in our planning-LM approach, we conduct an ablation study on four different components: using prefix or infix prompts, fixed or fine-tuned mathematical operation prompts, instruction prompt mining, and the operation predictior. We see that using infix, fine-tuned mathematical prompts, and the operation predictor improve performance the most across different settings. We also see that infix prompts are significantly better than prefix prompts, which is different from the observation made in prior work (Li and Liang, 2021). One possible explanation is the incompatibility between prefix prompting and step-by-step generation: prefix prompts put the most important instruction at the front of the LM input, making all generated tokens attend to it, which leads to higher operation prediction accuracy but worse generation performance on other tokens.

In-context learning
We conduct experiments by giving in-context prompting examples to GPT-3 in different formats and the result is shown in Table 3. We see that planning-LM yields the best solving rate, significantly higher than other approaches. We further analyze the human evaluation results in Section 3.5. Figure 2 shows the distributions of participants' selections on human evaluation metrics for the generated solutions. We see that solutions generated by planning-LM are significantly better than those produced by chain-of-thought on all three metrics, proving that our approach leads to solutions with more precise language and better problem solving  Table 4 shows two examples that compare the full step-by-step solutions generated by our planning-LM approach and chain-of-thought prompting. For Example 1, we see that although chain-of-thought happens to produce the correct answer, the reasoning starts to fall apart at Step 3. It generated the correct final answer only because the question mentioned rounding the answer to the nearest integer; however, its intermediate answer 1.33 is wrong. For Example 2, the answer generated by the chainof-thought does not have detailed wording explanations, whereas planning LM's solution has details of each step of the solving strategy, making the solution much easier to understand.

Qualitative Analysis
Perhaps surprisingly, we observe that planing-LM can generate multiple solutions if it predicts a different math operation in the next step compared to the ground truth solution. Therefore, we conduct a follow-up experiment by giving the model a handcrafted plan via operation prompts to see whether it can generate an alternative correct solution strategy.
Table 5 further demonstrates that our approach can generate multiple correct solution paths for the same problem. For example, feeding Plans I and II enables the model to generate the correct final answer among the four strategies we used; the generated solutions follow the operation steps given, indicating that the model has some reasoning ability and can extract some meaningful patterns from data. Plan III results in a flawed solution and Plan IV failed since we do not have an operation class that matched the step. For plan III, the first step, [n + n + . . .], is not seen often enough in the training data. For plan IV, (n + n) × n is not seen in the training data either. However, we note that in this case, using the closest operation, [n + n × n], results in a solution that gets very close to the correct final answer. These results suggest that a better representation of the operation prompt is crucial for future work since our current approach is limited to a finite number of predefined operations; a prompt operation generator rather than classifier could be a better choice for a wide variety of mathematical operations. We also note that this flexibility gives our planning-LM approach potential to be useful in real-world applications. For example, these solution plan controls may encourage math students to explore different solution strategies and be more creative.

Related work
MWP solver A large body of recently proposed MWP solvers parses an MWP into its underlying equation, which has been a very active research area with a plethora of related work. These works differ mainly in the technical approaches which broadly fall in three categories. First, some works explore MWP solving via reinforcement learning, which rewards the model with the correct answer generated (Huang et al., 2018;Wang et al., 2018b). RL methods generally requires a sizable dataset and can be unstable to train, which may not be suitable for most MWP datasets that are only of modest sizes. Second, some works exploit the combination of symbolic-and neural-network-based approaches, e.g., by combining a pre-defined symbolic patterns such as solution templates (Wang et al., 2019) and symbolic tree structures of equations (Xie and Sun, 2019;Li et al., 2020;Qin et al., 2020;Wang et al., 2018a;. These methods can be significantly constrained by these patterns and it may be challenging to generalize them to other MWPs whose solutions are not expressed by these patterns. Lastly, some works build on large LMs (LLMs) via special fine-tuning or inference techniques. Chain-of-thought prompting  prompts LLMs to generate intermediates steps before reaching the final answer. Cobbe et al. (2021) fine-tunes a model as a verifier and applies the verifier to rank outputs in the decoding phase.  are using a majority vote among outputs to select the best answer. Lewkowycz et al. (2022) fine-tunes an LLM by a large collection of math-specific datasets combining existing tech- Table 4: Demonstrations of generated solutions comparing planning-LM and chain-of-thought. Question 1 shows the intermediate step of chain-of-thought has wrong reasoning but still reaches the final answer. Question 2 shows that planning-LM results in a better reasoning strategy since the calculation process is simple and more concrete.
1.Question: Antoine's french onion soup recipe calls for 2 pounds of onions. He likes to double that amount. His soup serves 6 people. The onions are currently on sale for $2.00 a pound. He also needs 2 boxes of beef stock, that are also on sale for $2.00 a box. What is the cost per serving? (Round to the nearest integer.) Chain-of-thought: The cost of the onions is 2 pounds times $2.00 per pound, which is 2*2 = 4 dollars. The cost of the beef stock is 2 boxes times $2.00 per box, which is 2*2 = 4 dollars. The total cost of the ingredients is 4+4 = 8 dollars. The cost per serving is 8 divided by 6 servings, which is 8/6 = 1.33 dollars. Rounded to the nearest integer, the cost per serving is $2. PlanningLM: Antoine needs 4 pounds of onions, which costs 4*2 = 8 dollars. He also needs 2 boxes of beef stock, which costs 2*2 = 4 dollars. The total cost for the ingredients is 8+4 = 12 dollars. The cost per serving is 12/6 = 2 dollars.
2.Question: Judy teaches 5 dance classes, every day, on the weekdays and 8 classes on Saturday. If each class has 15 students and she charges $15.00 per student, how much money does she make in 1 week? Chain-of-thought: On the weekdays, Judy teaches 5 classes per day, with 15 students in each class. This means she has 5*15 = 75 students per day. At a rate of $15 per student, she makes 75*15 = 1125 dollars per day. For the weekdays, she makes 5*1125 = 5625 dollars. On Saturday, Judy teaches 8 classes with 15 students in each class. This means she has 8*15 = 120 students. At a rate of $15 per student, she makes 120*15 = 1800 dollars. For the week, Judy makes 5625+1800 = 7425 dollars. Rounded to the nearest integer, the cost per serving is $2. PlanningLM: Judy teaches 5 classes per weekday, so 5*5 = 25 classes. Judy teaches 8 classes on Saturday, so 8*1 = 8 classes. The total number of classes she teaches in a week is 25+8 = 33 classes. Each class has 15 students, so 15*33 = 495 students. Judy charges $15.00 per student, so 495*15 = $7,425.
niques. There are also some extension works based on CoT, like the least-to-most prompting  that decomposes the complicated question into a single-hop question; STaR (Zelikman et al., 2022) iterative rationale generation using a small number of examples and a large dataset. Our work differs from previous studies by not only prioritizing the final solution accuracy but also emphasizing the generation quality of individual solution steps. Additionally, we introduce a novel hierarchical planning method for fine-tuning, in contrast to previous approaches that rely solely on ordinary language modeling techniques. Result: The first train covers 80 * 2 = 160 miles. On the second day, the trains covered 2 trains * 150 miles/train = 300 miles. Total distance covered is 160 + 300 = 460 miles. Answer is 460. Plan III: [n + n + . . .] -> n + n -> [Ans] Result: The total distance covered in the two days is 80 + 80 + 150 = 310 miles. The total distance covered in one day is 310 + 150 = 460 miles. Answer is 460. Plan IV: [n + n * n] -> [Ans] Result: The total distance covered by trains in the two days is 150 + 80 * 2 = 310 miles. Answer is 310.
Controllable text generation Given the rise of LLMs, controllable generation methods that guide these large models to generate desirable content and avoid potential pitfalls such as bias (Bender et al., 2021) has been a recent research trend. These controllable generation methods generally fall into two categories. Works in the first category modify the token distribution at each time step to achieve controllable generation via gradient-based methods (Dathathri et al., 2020), external classifierbased methods (Krause et al., 2020;Liu et al., 2021a), or resampling (Clark et al., 2020;Bhattacharyya et al., 2021;Bakhtin et al., 2021). Works in the second category fine-tunes the LLM via either language modeling (Shirish Keskar et al., 2019) or reinforcement learning (Khalifa et al., 2021). These works focus on controllable generation for natural language and study nuanced control attributes such as topic and sentiment that can only be matched implicitly. In contrast, our work focuses differently on both natural and mathematical language, which involves control attributes, e.g., math operation hints in the form of equations that need to be matched exactly.
In this paper, we addressed the new problem of performing fine-grained, step-by-step controllable solution generation for math word problems. We proposed an approach combining planning and language models to generate interpretable solution steps. Our approach leverages pre-trained language models in two ways: at each step, plan the mathematical operation to be applied, followed by using these plans as prompts to control the tokenby-token generation of each step. We demonstrated that with minimal additional parameters introduced, our approach significantly improves math word problem-solving performance over simply fine-tuning language models. We also showed that due to the interpretability and high controllability of operation prompts, we can use our approach to generate solutions with alternative strategies by giving it different solution plans. Future work can further explore generating an entire solution path by predicting math operators for each step and revising the plan after each step is generated. We can also explore the application of our approach in real-world educational settings, e.g., for openended answer scoring (Lan et al., 2015;.

Limitations
First, our work applies hand-crafted action labels as operation hints, which leads to some limitations to represent more complex operation steps. For the future work, we can use a generator instead of a classifier to generate a more flexible set of operation prompts, making them more representative and meaningful Secondly, due to the high controllable generation of our approach, if our approach yields a wrong operation step prediction, it would further mislead the intermediate step generation. To eliminate the drawback where inaccurately generated operation prompts would mislead the next step, we can apply a verifier (Cobbe et al., 2021) to evaluate the reliability of the generated operation prompts. When the reliability is low, we ditch the operation prompt to prevent it from guiding the model into an incorrect path.

Ethics Statement
Currently, most existing works leverage the capability of generating intermediate reasoning steps of large, pre-trained language models for either understanding the model's behaviors (e.g., models' moral judgments (Jin et al., 2022)) or improving their problem-solving accuracies (e.g., MWP solving (Lewkowycz et al., 2022)). Few works focus on the quality of the generated intermediate reasoning steps themselves. These generated steps have potentially significant real-world applications, such as providing feedback automatically in large-scale education scenarios, but they are not yet of high enough quality to be readily utilized in practice. Our work contributes to the important direction in making such generated intermediate steps more accurate, coherent, and high-quality. However, language models equipped with our approach may still generate intermediate steps that are unreasonable, even though it improves upon existing approaches. These unreasonable generated steps may be misleading to students when they are learning, posing a potential risk to their usage. As a result, more work is required before our approach can be readily deployed in practice. We believe that, in its current form, our work is best suitable for use with experts, i.e., education subject matter experts or instructors to help them write solution steps for new MWPs in a more efficient manner.

A Hyper-parameters
We use a learning rate of 5e-5, a batch size of 8, and 10 epochs for all training processes. We set "what is the next operation?" as our instruction prompt and apply calculators to avoid calculation errors and greedy decoding during token generation. Model training is carried out on an NVIDIA RTX 3090 GPU.

B Setting for generation
In order to have consistent results and fair comparison, we apply a greedy-decoding procedure with temp=0 for all of the generation process during testing.

C Data prepossessing
We detail how to obtain our step operation here. For each MWP, we split the solution into steps according to the period symbol "." at the end of sentences. We restrict ourselves to the top-20 most frequent mathematical operations after merging some operations that have a similar meaning, e.g., [n + n + n] and [n + n + n + n] are both labeled as "multi-step addition" to avoid highly infrequent operations. Check table 7 for full descriptions of operation actions. We applied packages 'NLTK' and 'Spacy' for data preprocessing.

D More Details of prompt tuning design D.1 Operation prompts
We initialize the embedding of each math operation token as the original pre-trained LM's embedding of the mathematical operator token instead of initializing them randomly (Liu et al., 2021c). For example, we initialize the operations action token [n + n] with the same value as embedding of the "+" token in the pre-trained model. For operation classes that contain multiple operations, we initialize the embedding to the mean of all operation embeddings involved. We do this since initializing a new token with related embeddings has been proven to be effective on speeding up the training process of LM-based models (Li and Liang, 2021;Zhong et al., 2021;Lester et al., 2021;Hambardzumyan et al., 2021;Liu et al., 2021b).

D.2 Prompt mining through paraphrasing
For the instruction prompt, finding good prompts is an art that takes time and experience (Liu et al., 2021b). Thus, we apply prompt mining through paraphrasing by first starting with a seed prompt (e.g. "The next step operation is: ") and paraphrase it into a set of other candidate prompts with similar meaning (Yuan et al., 2021). Then, we tune the model with these candidates by treating them as hyper-parameters and select the one that performs best on the target task. We find that anchor tokens (e.g. "?") are helpful and leads to good performance, which is consistent with prior work (Liu et al., 2021c).

E Architecture for text-to-text language modeling
See figure 3 F In-context Prompting design At a rate of 30 minutes per poodle, 3 poodles will take 3*30=«3*30=90»90 minutes.

: [/=]
If it takes half as much time to groom a terrier as it takes to groom a poodle, it takes 30/2=«30/2=15»15 minutes per terrier.
: [end] boxed{210} QUESTION: Darnell has 1000 square feet of fabric that he's using to make mini flags. He makes square flags that are 4 feet by 4 feet, wide rectangular flags that are 5 feet by 3 feet, and tall rectangular flags that are 3 feet by 5 feet. He has already made 16 square flags, 20 wide flags, and 10 tall flags. How many square feet of fabric does he have left? He has used 150 square feet for the tall flags because ten times 15 equals «15*10=150»150. The total number of visitors after Tuesday will be 100+50 = «100+50=150»150 visitors.
: [end] boxed{250} QUESTION: James decides to build a tin house by collecting 500 tins in a week. On the first day, he collects 50 tins. On the second day, he manages to collect 3 times that number. On the third day, he collects 50 tins fewer than the number he collected on the second day. If he collects an equal number of tins on the remaining days of the week, what's the number of tins he collected each day for the rest of the week? SOLUTION: On the second day, he collected 3 times the number of tins he collected on the first day, which is 3*50 = «3*50=150»150 tins.

: [/=]
Since the total number of days left in the week is 4, he'll need to collect 200/4 = «200/4=50»50 tins per day to reach his goal : [end] boxed{50} QUESTION: Lilah's family gallery has 400 photos. On a two day trip to the Grand Canyon, they took half as many photos they have in the family's gallery on the first day and 120 more photos than they took on the first day on the second day. If they added all these photos to the family gallery, calculate the total number of photos in the gallery.

SOLUTION:
:[/=] On their first day at the grand canyon, the family took half as many photos as the ones they have in the gallery, meaning they took 1/2*400 = «400/2=200»200 photos.
: [+=] The total number of photos, if they add the ones they took on the first day to the family's gallery, is 400+200 = «400+200=600»600
: [+=] After adding the photos they took on the second day to the galley, the number of photos will be 600+320 = «600+320=920»920 : [end] boxed{920} G Human Evaluation template I Examples of control generation Table 8 shows the generated step apply different operation prompts on same input. This table demonstrates the generated results from applying different operation prompts with the same input to the model. We Figure 3: An overview architecture for text-to-text transformer based language model such as T5. The question is not considered as history; instead will feed into an encoder and also have cross-attention to the generated steps tokens. observe that when the operation prompt is logical and aligned with solving the question, the generated result follows the guidance given by the operation prompt wit higher probability. In contrast, when the operation prompt does not make sense, there is the chance that the generated result will not obey its directions. Details in table 8