Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement

To enhance the multi-step reasoning capabilities of large language models, researchers have extensively explored prompting methods, notably the Chain-of-Thought (CoT) method which explicitly elicits human-like rationales. However, they have inadvertently overlooked the potential of enhancing model reasoning performance by formulating higher-quality problems. In this work, we start from the problem side and propose Self-Polish (SP), a novel method that facilitates the model's reasoning by guiding it to progressively refine the given problems to be more comprehensible and solvable. We also explore several automatic prompting varients and propose the Self-Polish prompt bank for the community. SP is orthogonal to all other prompting methods of answer/reasoning side like CoT, allowing for seamless integration with state-of-the-art techniques for further improvement. Thorough experiments show that the proposed method attains notable and consistent effectiveness on five reasoning benchmarks across different models. Furthermore, our method also showcases impressive performance on robustness evaluation. Codes and prompts are available at https://github.com/WooooDyy/Self-Polish.


Introduction
Large language models (LLMs) have achieved impressive performance on a variety of NLP tasks (Brown et al., 2020;Otter et al., 2021;Chowdhery et al., 2022), but their capability to perform multistep reasoning is considered a limitation, which can not be tackled solely by scaling up the model size (Rae et al., 2021;Srivastava et al., 2022).To address this challenge, many prompting methods have been proposed to elicit reasoning in LLMs, and Figure 1: Schematic comparison between Self-Polish and other representative approaches for reasoning with prompting.Previous paradigms enhance the reasoning capability of LLMs from the aspect of the answer side/reasoning side, while our method starts from the problem side, and refines problems to be simpler and more comprehensible for models.
Chain-of-Thought (CoT) is a breakthrough method that teaches a language model to imitate the step-by-step reasoning process of humans to solve a reasoning task (Wei et al., 2022b).Many following work has explored variants of CoT to improve the quality of rationales of LLMs (Kojima et al., 2022;Fu et al., 2022;Zhou et al., 2022a).There is also a line of work that optimizes the rationales for better consistency and continuity (Wang et al., 2022;Li et al., 2022;Zelikman et al., 2022;Zheng et al., 2023), and a representative one is Self-Consistency (SC).SC generates diverse reasoning paths and answers, and then leverages the majority vote strategy to get the most consistent answer (Wang et al., 2022).Despite the boosted reasoning performance of the aforementioned methods, they focus on the answer/reasoning side, and little emphasis has been placed on the problems side.problem description are crucial factors for human understanding and model comprehension (Shou and Smithson, 2015;Faruqui and Das, 2018;Chu et al., 2020).LLMs often exhibit poor reasoning performance when confronted with low-quality real-world reasoning problems, which may be excessively long, ambiguous, unclear in focus, or contain irrelevant information (Zellers et al., 2018;Shi et al., 2023;Ye and Durrett, 2022).To tackle this challenge, we consider refining problems into a better formulation.
In this work, we propose Self-Polish (Figure 1 right) that leverages LLMs themselves to refine reasoning problems without training for better reasoning performance.We first present several principles for refined problems: concise, clear, well-focused, and absent of irrelevant information.To achieve our goal, we propose the Self-Polish Prompt Bank which includes several feasible solutions as outlined in the following text.An intuitive strategy is to reformulate problems via instruction-following (Sanh et al., 2022;Ouyang et al., 2022), and we call it zero-shot problem refining.Next, we include demonstrations in the prompts (Brown et al., 2020;Chowdhery et al., 2022) to enable models to better internalize and apply the principles, which is defined as in-context problem refining.During the construction of the demonstrations, we incorporated a curated collection of problem-refining patterns, e.g., eliminating irrelevant information, rearranging the logic structure, and organizing local conditions into new ones in parallel.Moreover, we explore automatic prompting methods to construct enhanced prompts and mitigate manual efforts, based on the criteria of complexity (complexitybased Self-Polish) or diversity (automatic Self-Polish).To further enhance the reliability and consistency of the generated problems, we propose to progressively refine problems until obtaining a convergent answer.
Experiments show that our method consistently improves the reasoning performance of various models (i.e., Text-davinci-002, Text-davinci-003, GPT-3.5-Turbo) on five benchmarks (Table 1 & Figure 3).Moreover, the proposed method is orthogonal to all other reasoning-side state-of-theart prompting methods, making it convenient to be combined with them for further improvement.Detailed experiments demonstrate that the performance of reasoning-side methods can be significantly boosted when integrated with SP (Table 2 &   Table 3).Self-Polish also showcases exceptional performance on robustness evaluation (Figure 4).
In summary, we make the following contributions: 1. We propose a novel method, Self-Polish, to improve the reasoning performance and robustness of LLMs.
2. We demonstrate the effectiveness of our method when applied alone or combined with other prompting approaches on five benchmarks with different models.
3. We believe that the proposed Self-Polish represents an important step in enhancing LLMs' reasoning capabilities by shifting the perspective from the answer/reasoning side to the problem side.We hope it could inspire future research in this field.

Related Work
Multi-step reasoning.Multi-step reasoning tasks have posed significant challenges for language models (Rae et al., 2021;Bommasani et al., 2021;Qiao et al., 2022), and it is considered as an emergent ability of LLMs (Schaeffer et al., 2023).It is in these tasks that the effectiveness of few-shot prompting begins to surpass that of full training set fine-tuning (Lewkowycz et al., 2022).Moreover, such capability is considered important in building more complex artificial intelligence such as large language model-based agents (LLM-based agents) (Xi et al., 2023).Our work represents a significant stride in enhancing the ability of language models to perform multi-step reasoning tasks, through the facilitation of models' comprehension and processing of given reasoning problems.
Reasoning with prompting.Prompting strategies have substantially improved the reasoning ability of LLMs by a large margin (Qiao et al., 2022;Lewkowycz et al., 2022).An important line of work in this area is Chain-of-Thought (CoT) prompting which elicits the reasoning ability of models by prompting them to imitate the stepby-step reasoning process of humans (Wei et al., 2022b;Kojima et al., 2022;Fu et al., 2022;Zhou et al., 2022a).Another line of work focuses on optimizing the rationales for better consistency and continuity (Wang et al., 2022;Li et al., 2022;Zelikman et al., 2022;Zheng et al., 2023).A representative one is Self-Consistency (SC), which samples Begin, Problem Refine (Remove Irrelevant Information) [Refined Problem 1 ] : Kylie makes 12 beaded necklaces, 5 beaded bracelets.Each beaded necklace needs 20 beads.Each beaded bracelet requires 10 beads.How many beads does Kylie use in total to make her jewelry?
[Refined Problem 2 ] : Kylie makes 12 beaded necklaces, and each beaded necklaces needs 20 beads.She also makes 5 beaded bracelets, and each beaded bracelet needs 10 beads.How many beads does Kylie use in total to make her jewelry?
[Refined Problem 3 ] : Kylie requires 240 beads to make beaded necklaces.She also requires 50 beads to make beaded bracelets.How many beads does Kylie use in total to make her jewelry?
A 1 !=A 0 , Continue Refine (Reorder Conditions) Refine example 人造 Figure 2: An example illustrating the framework and problem-refining patterns of Self-Polish.In the first refining iteration, the irrelevant information "Ada bought 2000 tomatoes from the grocery store." is removed.In the second iteration, the conditions are reordered for easier calculation of the number of beads required for each type of beaded product.In the third iteration, the local conditions were parallelly combined to form new conditions (the total number of beads required for necklaces and bracelets).
multiple reasoning paths and generate the most consistent answer by majority vote (Wang et al., 2022).Different from Self-Polish, the aforementioned strategies emphasize improving the quality of rationales from the answer/reasoning side.Our method is a problem-side method, so it is orthogonal to all of them and can be combined with them for further improvement.
See Appendix A for more related work and the detailed differences between Self-Polish and Leastto-Most (Zhou et al., 2022a).

Self-Polish Prompting
In this section, we first revisit previous prompting paradigms aiming at solving reasoning problems.Next, we describe the proposed Self-Polish method detailedly.

Revisiting Paradigms of Reasoning Problem Solving
In the context of enhancing the capabilities of LLMs, the prompting technique has emerged as one of the most popular approaches owing to its training-free nature and effectiveness (Qiao et al., 2022;Lewkowycz et al., 2022).Here, we formalize several representative paradigms.See Figure 1 for a schematic comparison between them and our method.
Standard.The prompt contains k× [Problem, Answer] pairs, followed by the test problem.
Chain-of-Thought (Wei et al., 2022b).The prompt contains k× [Problem, Rationale, Answer] tuples, followed by the test problem.This method teaches models to generate rationales and answers, achieving significant improvement in reasoning.Auto-CoT (Fu et al., 2022) and Complex-CoT (Zhou et al., 2022a) are two automatic varients that constructs CoT demonstrations according to the criteria of problem diversity and reasoning complexity, respecticely.
Least-to-Most (Zhou et al., 2022a making it evident what the question is asking; (4) absence of irrelevant information: the problems should be free from extraneous details that could cause confusion or distractions.

Construction of Refining Prompts
Zero-shot Self-Polish.It is difficult to internalize the aforementioned principles within the model via training due to the tedious process of constructing a corresponding dataset and potential catastrophic forgetting problems (Goodfellow et al., 2014;Parisi et al., 2019).So we turn to training-free strategies.
As LLMs demonstrate emergent abilities of instruction-following (Schaeffer et al., 2023;Sanh et al., 2022;Wei et al., 2022a), a simple and intuitive strategy to refine problems is prompting LLMs with an instruction.In the instruction, we guide the model to rewrite new versions of the original reasoning problem to be more understandable and easy to answer, and never omit any useful information.The prompt contains [Instruction, Original Problem] and the model responds with a newly generated problem.Next, we can adopt any prompting method in Section 3.1 to get the answer to the new problem, and we take this answer as the final one.We conduct preliminary validation experiments and the results are illustrated in Table 1.Zero-shot refining can consistently improve reasoning performance on various benchmarks.
In-context Self-Polish.As empirical results show that zero-shot refining can only provide limited performance gain, especially on difficult datasets, we then add demonstrations to the prompt to enable models to better internalize and apply design principles.Specifically, demonstrations are formulated as [Original Problem, New Problem] pairs, and we incorporate a curated collection of problem-refining patterns in the demonstrations: (1) remove irrelevant information, as the first iteration in Figure 2; (2) rearrange the logic structure and group relevant conditions together to better match the reasoning logic of the model, as the second iteration in Figure 2; (3) summarize local conditions into new ones in parallel, as the third iteration in Figure 2. 2 Results in Table 1 show that in-context problem refining yields more performance gain than zero-shot refining.
Automatic Self-Polish.This is an automatic variant of the in-context problem-refining.We draw inspiration from Zhang et al. (2022) and construct the refining prompt according to the diverse semantics of problems with the technique of k-means clustering.The underlying hypothesis is that a diverse set of demonstrations can cover a broad semantic space of problems, thereby the model can locate relevant reference demonstrations for more test examples.Table 1 shows that Auto-SP also yields significant improvement.
Complexity-based Self-Polish.This is another variant of the in-context problem-refining for automatically selecting refining demonstrations.We draw inspiration from Fu et al. (2022) and construct the refining prompt according to the complexity of each problem.The underlying hypothesis is that the refining ability of the model can generalize from complex problems to simpler ones.Table 1 demonstrates that Complex-SP can also yield substantial performance gain.

Progressively Refining Framework
To enhance the consistency and reliability of the refined problems, we propose a progressive framework that has two stages: the problem-solving stage (Section 3.1) and the problem-refining stage (Section 3.2).The two stages are executed alternatively until the return condition is satisfied.
Return condition & answer selection.There are two situations that terminate the iterative process.The first is when the last two answers are the same, indicating convergence of the answer.In this case, we can directly return the answer.The second situation is when the iteration number exceeds the maximum count T = 2. 3 In such case, we have multiple options for selecting the final answer, such as the answer to the original problem, the answer to the first generated problem, the answer to the last generated problem, or utilizing a majority voting approach to select the answer (Wang et al., 2022), which will be discussed in our ablation study in Section 5.1.Here we choose the answer to the last generated problem by default.As shown in Table 1, adding Progressively Refining to our method can bring further improvement across different promptconstruction approaches.
The overall framework is shown in Algorithm 1 in Appendix B.

Experiments
In this section, we conduct experiments to demonstrate the effectiveness and robustness of SP.

Experimental Setups.
Models.We employ three GPT-series models, namely text-davinci-002, text-davinci-003, and GPT-3.5-Turbo(Brown et al., 2020;Ouyang et al., 2022), as they are widely recognized and accessible to the public, ensuring reproducibility of our research.Our experiments are based on Ope-nAI's API.All methods use greedy decoding (i.e., temperature = 0) for stable responses. 3One iteration means one time of problem refinement.Note that a bigger T can yield a larger performance gain, as discussed in Section 5.1 Here we set T = 2 to achieve a balance in computational efficiency and performance.Datasets.We evaluate the performance of our method on five reasoning datasets, including GSM8K (Cobbe et al., 2021), AQuA (Ling et al., 2017), SVAMP (Patel et al., 2021), MultiArith (Roy and Roth, 2015) and MathQA (Amini et al., 2019).The datasets are evaluated by prior studies in the field of multi-hop reasoning (Wei et al., 2022b;Fu et al., 2022;Zhou et al., 2022a).We evaluate on the whole test set of AQuA and GSM8K.For other datasets, we adopt the split from Mishra et al. (2022) or randomly select 500 test instances, and perform 3 restarts for stable results.
Prompts.For the sake of generalizability, GSM8K, SVAMP and MultiArith share the same Self-Polish prompts constructed from GSM8K; AQuA and MathQA share the same Self-Polish prompts constructed from AQuA.See Appendix F for SP prompts.The prompts for the standard fewshot prompting method are from Wei et al. (2022b).

Experimental Results
Standard few-shot setting.Figure 3 shows the results of evaluating the performance in the standard few-shot setting.We can find that : (1) Our method consistently improves reasoning performance by a large margin across multiple models and datasets, indicating its capability to enhance model understanding of problems.
(2) On relatively weaker models, automated prompting methods like Auto-CoT and Complex-CoT yield more gains compared to in-context SP.However, on stronger models, the differences in performance gain between the three approaches are not significant, revealing that the stronger models are less sensitive to prompts.
Combining Self-Polish with other prompting strategies.Table 2 demonstrates the evaluating results when combining our method with other state-of-the-art reasoning-side promoting strategies.
There are several critical and interesting observations: (1) Generally, SP yields substantial performance gains for all reasoning-side methods, revealing that when the model is able to better comprehend problems, both its step-by-step reasoning capabilities and problem decomposition abilities can be significantly enhanced.
(2) Whether for the reasoning side or the problem side, the Complexbased approach performs the best.This indicates that LLMs have the ability to generalize from com-plex tasks to simple ones, both in terms of reasoning and problem refinement.(3) As Fu et al. ( 2022) stated, the average number of words in problems, i.e., GSM8K (46.9),AQuA (51.9),SVAMP (32.1),MultiArith (31.2), and MathQA (60.1), can serve as a proxy for measuring the reasoning complexity of each task.We find that the more challenging the task, the higher the improvement achieved by SP, highlighting its suitability for intricate reasoning tasks.It is noteworthy that when combined with the CoT-series methods, our approach has limited improvement on MultiArith.This could be because the task itself can already be well solved by CoT and is relatively simple.Excessive refinement of simple problems carries the risk of information loss or semantic alterations, leading to a decline in performance, as depicted in Figure 9.
Robustness evaluation.GSM-IC (Shi et al., 2023) is an adversarial arithmetic reasoning dataset with distracting information in the problem to fool the model.So it is well-suited for evaluating the robustness of models.It has two splits: GSM-IC-2step which contains problems that require two reasoning steps to solve and GSM-IC-mstep which contains problems that require more than two reasoning steps to solve.As shown in Figure 4, our method enhances the robustness and reliability of various models across different prompting techniques, shielding them from the interference of low-quality problems.Figure 4: Evaluation results on GSMIC (Shi et al., 2023).Self-Polish (SP) enhances the robustness and reliability of various models when combined with different prompting techniques.

Ablation Studies
As mentioned in Section 3.3, the maximum iteration times T and the strategy to select the final answer if the convergence is not achieved are two main components of Self-Polish.Here we perform ablation studies on them.
Max iterating times T .As shown in Figure 5(a) and Figure 5(c), for both the Standard and CoT methods, larger iteration counts lead to higher convergence accuracy ("Converge" in figures), which aligns with common knowledge and further demonstrates the effectiveness of our method: by gradually optimizing problems, we enable the model to handle them more easily.But when T is too big, the performance of SP may suffer a drop, indicating that excessive rewriting can lead to a decline in the quality of problems.We set T = 2 not only for the sake of efficiency, but also because it can achieve competitive performance especially when combined with CoT-series methods.
Final answer selection strategies.We can easily observe that with a smaller T , the "Last One" strat-egy tends to have an advantage, while as the iteration count increases, other strategies become more effective, even outperforming "Last One".This is intuitive as after multiple rewriting iterations, the semantic meaning of a problem may deviate significantly from the original one.T actual exhibits a long-tail distribution, with only a few samples exceeding the max times.This finding provides evidence that our method is highly efficient that consumes few additional computational resources.

Further Improvement for Self-Consistency
Self-Consistency is a prompting method that samples multiple reasoning paths and generates a consistent answer by majority vote strategy (Wang  (Wang et al., 2022).Here, we combine the Self-Polish and Self-Consistency methods to investigate whether there will be further performance improvement.We conduct experiments on two difficult datasets (i.e., GSM8K and AQuA) with temperature = 0.7 for diversity following (Wang et al., 2022).
Results in Table 3 demonstrate that SP provides a substantial performance gain for SC in Auto-CoT and Complex-CoT manners.Moreover, an increase in the number of reasoning paths leads to a corresponding improvement in performance, showing the advantage of voting strategy.

Case Study
To further demonstrate the effectiveness of the problem-refining patterns we proposed and how our method embodies the proposed principles, we conducted a case study as shown in Figure 6.More cases can be found in the Appendix D (Figure 7 and Figure 8).
From Figure 6, we observe that removing irrelevant information (i.e., "Grover's neighbor made a salary of $10 last year.")can help the model avoid distractions and facilitate accurate reasoning.Next, rearranging the problem conditions and grouping pertinent conditions together can facilitate the model in generating more effective novel deductions during the process of reasoning (e.g., resulting in the streamlined computation of the total number of face masks in Refined Problem 2 ).

GSMIC
Figure 6: A case of Self-Polish on GSM-IC with Chainof-Thought.The case is with Text-davinci-003.The irrelevant information "Grover's neighbor made a salary of $10 last year." is removed.In the second iteration, the order of the condition "Each box has 20 face masks." is moved forward and the model can calculate the total number of masks more easily when performing reasoning.
Additionally, summarizing local conditions into new ones can effectively simplify complex problems, enabling the model to handle them with greater ease.This is demonstrated in the first iteration of Figure 7 and the second iteration of Figure 8. Furthermore, the second iteration in Figure 7 highlights how our approach can explicitly and precisely define the problem in a formal manner.Specifically, in the Refined Problem 2 of Figure 7, the model accurately identifies the two teams as "Team A" and "Team B" instead of referring to them as "one team" and "the other team", and then it is able to clearly specify the exact question to be asked.This significantly reduces the model's burden of understanding during the reasoning process, enhancing its overall performance.reasoning in large language models.We present a novel prompting method called Self-Polish which progressively refines the given reasoning problems to facilitate model comprehension and processing.It demonstrates impressive effectiveness, robustness, and reliability in various benchmarks across different models, and can seamlessly integrate with other state-of-the-art methods.We hope it could motivate future research in this field.

Limitations
Despite the significant enhancement in the reasoning performance achieved by our approach, this work still has limitations.Firstly, our criterion for convergence is based on obtaining two identical answers rather than assessing whether the problem itself has been sufficiently optimized.Future work could involve designing methods that enable the model to autonomously determine whether a problem has reached its optimal form.Secondly, we have explored two approaches to automatically construct problem-refining prompts (i.e., Auto-Sp and Complex-SP).However, in the future, it would be beneficial to incorporate more techniques for automatically generating instructions or selecting demonstrations.Thirdly, although our designed patterns for problem refining have proven highly effective, they do not encompass all possible scenarios in the real world.In the future, it is conceivable to incorporate additional patterns to further expand the scope of applicability.
In-context learning.It is demonstrated that a large language model can learn patterns from a few input-output examples in the context (input) to perform the task for an unseen inference-time example (Brown et al., 2020;Chowdhery et al., 2022), and such ability is referred to as in-context learning (ICL).Recent studies have further highlighted the impressive performance of ICL in reasoning tasks (Wei et al., 2022b;Fu et al., 2022;Zhou et al., 2022a).In our research, we capitalize on this capability to generate new formulations of problems by injecting rephrasing patterns to the demonstrations.
Instruction following.LLMs can learn to perform unseen tasks solely through the comprehension of task-specific natual language instructions (Sanh et al., 2022;Wei et al., 2022a;Chung et al., 2022;Ouyang et al., 2022).There is also work showing that combining instructions with incontext learning can provide further benefits and that few-shot demonstrations can be viewed as a special kind of instruction that arouses the implicit ability in LLMs (Chung et al., 2022;Zhou et al., 2022b;Qiao et al., 2022).
Compaison with LtM.The work that is most similar to ours may be Least-to-Most (LtM) which decomposes the original problem into a series of sub-problems that need to be solved sequentially (Zhou et al., 2022a).However, LtM is an variant of CoT, and there are differences in motivation and operation process between LtM and SP.Firstly, LtM is an answer/reasoning side approach that emphasizes the decomposition of a complex problem into sub-problems, while we emphasize refining the original problem to make it more understandable.Secondly, in LtM, sub-problems are solved sequentially, requiring the answer of the previous sub-problem to tackle the next one, which can lead to fragility in the reasoning chain.In contrast, our method allows for the combination of local related conditions to form new conditions parallelly.

B The Algorithm of Self-Polish
See Algorithm 1 for the overall framework of Self-Polish.

C Implementation Details
We set the maximum iterating count to T = 2.Note that the bigger maximum iteration count may lead to better performance, but here we set it to 2 to achieve a trade-off between computational efficiency and effectiveness.
When combining with other reasoning-side methods (i.e., CoT, LtM, Complex-CoT and Auto-CoT) on MultiArith and SVAMP, we set the answer selection strategy as "selecting the answer to the original problem" because this dataset is relatively easy for these prompting methods.Actually, in cases where it is not necessary, rewriting easy problems may result in the loss of critical information or altering the semantics of the original problem.In other settings, we set the answer selection strategy as "selecting the answer to the last problem".

D More Cases and Examples
Here we list more cases of Self-Polish in Figure 7 and Figure 8.We also list the failure case of excessive problem refining in Figure 9 [Refined Problem 2 ] : Two track teams are competing against each other in a 4 by 400 meter relay.Team A has four members that each run 400 meters in 55 seconds, while Team B has four members that run 400 meters in 60, 57, 54, and 51 seconds respectively.How many seconds faster will Team A finish the relay compared to Team B?
[Rationale 2 ] : Team A has four members that each run 400 meters in 55 seconds.Team B has four members that run 400 meters in 60, 57, 54, and 51 seconds respectively.So the total time for Team A is 4 * 55 = 220 seconds, and the total time for Team B is 60 + 57 + 54 + 51 = 222 seconds.Team A will finish the relay 2 seconds faster than Team B. The answer is 2.
[Refined Problem 1 ] : Two track teams are competing against each other in a 4 by 400 meter relay.One team has four members that each run 400 meters in 55 seconds, while the other team has four members that run 400 meters in 60, 57, 54, and 51 seconds respectively.How many seconds will the faster team win by?
[Rationale 1 ] : The faster team has four members that run 400 meters in 55, 60, 57, and 54 seconds respectively.The slower team has four members that run 400 meters in 51 seconds each.So the faster team will win by 55 -51 = 4 seconds.The answer is 4.
[Original Problem 0 ] : Two track teams are competing against each other in a 4 by 400 meter relay; a race where each competing team has four members that each run 400 meters, or one lap, around a standard track.One of the two teams is very well-rounded and each of their members will run their 400 meter leg in precisely 55 seconds.The other team is less well-rounded; their first runner will run their 400 meter leg in 60 seconds then each subsequent runner will be 3 seconds faster than the previous runner.Using this information, how many seconds will the faster team win by?In the first iteration, some irrelevant information is removed and the average time each member of the second team consumes is clarified.In the second iteration, the model accurately identifies the two teams as "Team A" and "Team B" instead of referring to them as "one team" and "the other team", and it explicitly states what the question to be asked is, reducing the burden of understanding on the model during the reasoning process.

E Sensitivity to Number and Order of Demonstrations
As widely recognized, in-context learning is highly sensitive to the number and order of demonstrations within the prompt (Min et al., 2022;Lu et al., 2022;Liu et al., 2022).In this regard, we investigate whether our problem-refining process is sensitive to these variables via experiments on GSM8K with Text-davinci-003.We randomly select 200 examples from the test set.For a specific shot number, we randomly select five sets of demonstrations.
For each set of demonstrations, we obtain performance results in five different orders.We observed that in the standard manner, increasing the number of demonstrations leads to improved performance.However, in the CoT manner, the performance con-

Gsm8k example 2
Figure 8: A case of Self-Polish on GSM8K with Chainof-Thought.In the first iteration, the model translates "every" to "1".In the second iteration, the model summarizes relevant conditions and specifies the total cost for Charlie.verges when the number of shots is equal to 5, demonstrating impressive sample efficiency.Additionally, in the standard manner, our method is not sensitive to the order of demonstrations while it is  In-context SP 24.5 Table 5: More results on MATH dataset, using Chainof-Thought as the answer-side method highly sensitive to the order of demonstrations in the CoT manner.

F Prompts of Self-Polish
The in-context Self-Polish prompt for AQuA and MathQA is in Table 6.The Auto-SP prompt for AQuA and MathQA is in Table 7 and Table 8.The Complex-SP prompt for AQuA and MathQA is in Table 9 and Table 10.The in-context Self-Polish prompt for GSM8K, SVAMP and MultiArith is in Table 11.The Auto-SP prompt for GSM8K, SVAMP and MultiArith is in Table 12 and Table 13.The Complex-SP prompt for GSM8K, SVAMP and MultiArith is in Table 14 and Table 15.

G More results on MATH dataset
As Table 5 shows, we also conducted Self-Polish methods on the MATH dataset (Hendrycks et al., 2021).Our approach demonstrated promising results.Specifically, we randomly selected 200 samples for testing, and use the Chain-of-Thought as the answer-side method.Original Question: Susy goes to a large school with 800 students, while Sarah goes to a smaller school with only 300 students.At the start of the school year, Susy had 100 social media followers.She gained 40 new followers in the first week of the school year, half that in the second week, and half of that in the third week.Sarah only had 50 social media followers at the start of the year, but she gained 90 new followers the first week, a third of that in the second week, and a third of that in the third week.After three weeks, how many social media followers did the girl with the most total followers have?New Question: At the start of the school year, Susy had 100 social media followers and Sarah had 50 social media followers.Susy gained 40 followers in the first week, 20 in the second week, and 10 in the third week.Sarah gained 90 followers in the first week, 30 in the second week, and 10 in the third week.After three weeks, how many social media followers did the girl with the most total followers have?
Original Question: Sam bought a dozen boxes, each with 30 highlighter pens inside, for $10 each box.He rearranged five of these boxes into packages of six highlighters each and sold them for $3 per package.He sold the rest of the highlighters separately at the rate of three pens for $2. How much profit did he make in total, in dollars?New Question: Sam bought 12 boxes for $10 each, and each contains 30 highlighter pens. 1 package contains 6 highlighters.He rearranged five of these boxes into packages and sold them for $3 per package.He sold the remaining highlighters separately at the price of $2 for every three one.How much profit did Sam make in total, in dollars?
Original Question: In a certain school, 2/3 of the male students like to play basketball, but only 1/5 of the female students like to play basketball.What percent of the population of the school do not like to play basketball if the ratio of the male to female students is 3:2 and there are 1000 students?New Question: In a certain school, there is a total of 1000 students, while 3 male students for every 2 female students.So there are 600 male students and 2/3 of the male students like to play basketball, and there are 400 female students but only 1/5 of the female students like to play basketball.What percent of the population of the school do not like to play basketball?

Figure 3 :
Figure3: Evaluating Self-Polish on various benchmarks with different models.Self-Polish consistently improves reasoning performance across multiple models and benchmarks.

Figure 5 :
Figure 5: Ablation studies and the distribution of actual iterating times.(a) and (c) illustrate the performance (vertical axis on the left) when using different final answer selection strategies and different max iterating times T .The "Converge" means the performance calculated by N conv /N all where N conv means the number of examples that are answered correctly with converged answers, while the N all means the number of all test examples.We also incorporate a line to represent the average actual iteration times at each value of T (vertical axis on the right).In (b) and (d), we show the distribution of actual iterating times when we set T = 5.

Figure 5
Figure5(a) and Figure5(c) show that the actual iterating times T actual does not grow significantly as the max iterating times T increases, revealing that SP can achieve a converged answer on most of the problems with few iterations.To verify this, we illustrate the distribution map of T actual with T = 5 in Figure5(b) and Figure5(d).T actual exhibits a long-tail distribution, with only a few samples exceeding the max times.This finding provides evidence that our method is highly efficient that consumes few additional computational resources.

[
Refined Problem 2 ] : Grover bought 3 boxes of face masks.Each box has 20 face masks.Grover bought the 3 boxes for $15.He plans to sell the face masks for $0.50 each.How much will be his total profit?[Rationale 2 ] : Grover bought 3 boxes of face masks.Each box has 20 face masks.So he bought 3 * 20 = 60 face masks in total.He bought the 3 boxes for 15 dollars.He plans to sell the face masks for 0.50 dollars each.So he will sell the 60 face masks for 60 * 0.50 = 30 dollars.His total profit will be 30 -15 = 15 dollars.Grover bought 3 boxes of face masks.He plans to sell them for $0.50 each.If each box has 20 face masks, and Grover bought the 3 boxes for $15, how much will be his total profit?[Rationale 1 ] : Grover bought 3 boxes of face masks.Each box has 20 face masks.He bought the 3 boxes for 15 dollars.He plans to sell them for 0.50 dollars each.So his total profit will be (3 Grover bought 3 boxes of face masks.Heplans to sell them for $0.50 each.Grover's neighbor made a salary of $10 last year.If each box has 20 face masks, and Grover bought the 3 boxes for $15, how much will be his total profit?

[Figure 7 :
Figure7: A case of Self-Polish on GSM8K with Chainof-Thought.In the first iteration, some irrelevant information is removed and the average time each member of the second team consumes is clarified.In the second iteration, the model accurately identifies the two teams as "Team A" and "Team B" instead of referring to them as "one team" and "the other team", and it explicitly states what the question to be asked is, reducing the burden of understanding on the model during the reasoning process.

Figure 9 :
Figure 9: A failure case of Self-Polish on MultiArith with Chain-of-Thought.The semantics of the second generated problem have deviated from the semantics of the original problem, and the model cannot answer correctly.
Original Question: The average weight of a, b and c is 45 kg.If the average weight of a and b be 40 kg and that of b and c be 45 kg , then the weight of b is? Answer Choices: (A) 31 kg (B) 32 kg (C) 33 kg (D) 35 kg (E) none of these New Question: The average weight of a, b and c is 45 kg, which means the total weight of a, b and c is 135 kg.If the average weight of a and b is 40 kg, which means the total weight of a and b is 80kg, so the weight of c is 45kg.The average weight of b and c is 45 kg which means the total weight of b and c is 90kg.What is the weight of b? Answer Choices: (A) 31 kg (B) 32 kg (C) 33 kg (D) 35 kg (E) none of these Original Question: The compound and the simple interests on a certain sum at the same rate of interest for two years are rs.11730 and rs.10200 respectively .The sum is? Answer Choices: (A) rs.17037 (B) rs.17000 (C) rs.17276 (D) rs.170287 (E) rs.171881New Question: A sum of money earns compound interest and simple interest at the same rate for two years.The compound interest is Rs.11730 and the simple interest is Rs.10200.What is the sum of money?Answer Choices: (A) rs.17037 (B) rs.17000 (C) rs.17276 (D) rs.170287 (E) rs.171881 Kylie makes 10 beaded necklaces on Monday and 2 beaded necklaces on Tuesday.Then Kylie makes 5 beaded bracelets on Wednesday.20 beads are needed to make one beaded necklace.10 beads are needed to make one beaded bracelet.Ada bought 2000 tomatoes from the grocery store.How many beads does Kylie use in total to make her jewelry?

Table 2 :
Evaluation results when combining Self-Polish with other answer/reasoning side prompting strategies.The results are with Text-davinci-003.The best performance for each answer side strategy of one task is underlined.The best performance for each task is in bold.

T
Algorithm 1: Self-Polish Prompting Input: language model G, problem set S, prompt P ref ine of the problem side refining method, prompt P answer of the answer/reasoning side method, max iteration number T , answer selection strategy Z.

] : Charlie needs 1 pound of beeswax and wicks to make 10 tapered candles, which cost him $10.00 in supplies. He sells each candle for $2.00. So he makes a profit of $2.00 per candle. If he makes and sells 20 candles, his net profit is 20 * 2 = 40 dollars. The answer is 40 dollars. [Answer 1 ] : 40. [Original Problem 0 ] :
Charlie wants to sell beeswax candles.For every pound of beeswax, he can make 10 tapered candles.One pound of beeswax and the wicks cost $10.00 in supplies.If he sells each candle for $2.00 each, what is his net profit if he makes and sells 20 candles?

Rationale 2 ] : The teacher had 38 worksheets to grade originally. After grading 4 worksheets, 15 additional worksheets were turned in. So the teacher now has 38 + 15 = 53 worksheets to grade in total. The answer is 53. [Answer 2 ] : 53.
[Refined Problem 1 ] : A teacher had 38 worksheets to grade.After grading 4 worksheets, 15 more were turned in.How many worksheets does the teacher now have to grade?[

Rationale 1 ] : The teacher had 38 worksheets to grade originally. After grading 4 worksheets, 15 more were turned in. So the teacher now has 38 + 15 = 53 worksheets to grade. The answer is 53. [Answer 1 ] : 53.
[Original Problem 0 ] : A teacher had 38 worksheets to grade.If she graded 4, but then another 15 were turned in, how many worksheets would she have to grade?

Table 4 :
Sensitivity to the number and order of problemrefining demonstrations.Mean represents the average performance for the current number of shots, while the order deviation represents the average standard deviation introduced by different demonstration orders.The results are with Text-davinci-003.In the problem side, we leverage the in-context SP.

Table 10 :
Continuation of Complex-SP prompt for AQuA and MathQA.