Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data

Chain-of-thought (CoT) advances the reasoning abilities of large language models (LLMs) and achieves superior performance in complex reasoning tasks. However, most CoT studies rely on carefully designed human-annotated rational chains to prompt LLMs, posing challenges for real-world applications where labeled data is available without rational chains. This paper proposes a new strategy, Automate-CoT (Automatic Prompt Augmentation and Selection with Chain-of-Thought), that can bypass human engineering of CoT by automatically augmenting rational chains from a small labeled dataset, and then pruning low-quality chains to construct a candidate pool of machine-generated rationale chains based on the labels. Finally, it selects the optimal combination of several rationale chains from the pool for CoT prompting by employing a variance-reduced policy gradient strategy to estimate the significance of each example. Automate-CoT enables a quick adaptation of the CoT technique to different tasks. Experimental results demonstrate the effectiveness of our method, where competitive results are achieved on arithmetic reasoning (+2.7%), commonsense reasoning (+3.4%), symbolic reasoning (+3.2%), and non-reasoning tasks (+2.5%). The code is available at https://github.com/SHUMKASHUN/Automate-CoT.


Introduction
The recent success in large language models (LLMs) has shown that properly prompted LLMs demonstrate emergent capabilities on complex understanding and question-answering tasks (Wei et al., 2022a).Especially, with the recently proposed chain-of-thought (CoT) prompting (Wei et al., 2022b), LLMs are capable of solving reasoning tasks including arithmetic reasoning, commonsense reasoning, and symbolic reasoning.The basic idea of CoT prompting is adding a few rationale chains to the answer as exemplars to illustrate the intermediate reasoning steps.Following CoT, several recent studies improve it by leveraging self-consistency (Wang et al., 2023), explanation learning (Lampinen et al., 2022), complexity-based prompting (Fu et al., 2023), self-training (Huang et al., 2022), voting verifier (Li et al., 2022a), and bootstrapping (Zelikman et al., 2022).However, most of them are constrained to a few fixed human-written exemplars, which require significant human efforts to create and adapt to new datasets.The annotation process is nontrivial because humans need to not only select the questions but also carefully design the reasoning steps for each question.In the process of searching for the perfect exemplars, we identify four critical factors that affect the performance of chain-of-thought prompting and require large human effort to deal with: (1) order sensitivity (Zhao et al., 2021): the order combination of the exemplars; (2) complexity (Sugawara et al., 2018;Lai et al., 2021;Fu et al., 2023): the number of reasoning steps of the rationale chains; (3) diversity: the combination of different complex-level exemplars; (4) style sensitivity (Papadopoulos et al., 2010): the writing/linguistic style of the rationale chains.Detailed analysis of the four factors is covered in Section 2. All of these sensitivities make human-based prompt engineering costly and motivate us to find an automatic and task-agnostic way to adapt chain-ofthought exemplars to any downstream tasks.In this paper, we solve these problems by a CoT augmentation and selection process to find suitable exemplars automatically.This can be divided into three steps: (1) Augment: The language model generates multiple pseudo-chains for query questions automatically.(2) Prune: Based on an assumption: Generating correct reasoning is a necessary condition for generating correct answers.
This assumption is natural because the answer is generated after several reasoning steps.When a correct answer is generated, the rationale chain of these steps is most likely correct, contributing to the final correctness.As a result, We prune the pseudo-chains according to the consistency between generated and ground-truth answers to reduce the noise.(3) Select: Given that all the data have been annotated with rationale paths, we propose to apply a variance-reduced policy gradient strategy (Williams, 1992;Dong et al., 2020;Zhou et al., 2021;Diao et al., 2022) to estimate the gradients and optimize the selection process to find the most helpful chain-of-thought for each task.Compared to prior manually written CoT, Automate-CoT could find the optimal and diverse CoT automatically, adaptable to any task without human effort.Compared with Auto-CoT (Zhang et al., 2023), which samples diverse questions by clustering and generates rationale chains, Automate-CoT considers and mitigates the aforementioned sensitivity issues, while achieving a greater performance boost for each task.Automate-CoT is a fully automatic pipeline for finding better chainof-thought prompts, mitigating the sensitivity issues of manually written exemplars, and further improving the performance by a large margin.Experimental results demonstrate the effectiveness of Automate-CoT on arithmetic reasoning (+2.7%), commonsense reasoning (+3.4%), symbolic reasoning (+3.2%), and non-reasoning tasks (+2.5%).

Motivation
Recent studies observed sensitivity issues of GPT-3's few-shot learning caused by different selections of in-context examples such as order instability (Zhao et al., 2021;Zhang et al., 2022;Liu et al., 2022;Lu et al., 2022).Based on their findings, we first investigate whether these sensitivities still exist in chain-of-thought methods.Then we further explore other factors that would not only affect the performance but require human efforts to deal with.We conclude with the following four factors: • Order Sensitivity: Different orders of few-shot exemplars may cause a huge impact on the performance in traditional few-shot prompting (Lu et al., 2022).Thus we conduct experiments on GPT-3 to test if there is such sensitivity in chainof-thought methods.Although Manual-CoT (Wei et al., 2022b) reports that the human-written CoT is robust to order changes (<2%) with the LaMDA model, we observed that the performance of  Figure 1: The performance across different numbers of hops (reasoning steps of rationale chains) on GSM8K.Manual-CoT refers to the human-written chain-ofthought by Wei et al. (2022b).Complex-CoT refers to the chain-of-thought using 9-hop rationale chains.
fluctuates with different orders of chain-of-though exemplars.For the GSM8K dataset, we simply randomly shuffle the order of the exemplars in Manual-CoT 10 times and the lowest accuracy can be 59.8% which is 3.3% lower than the average accuracy (63.1%) they report, suggesting that order sensitivity still exists.
• Complexity: We first define complexity as the number of hops (reasoning steps) in an exemplar where more steps indicate greater complexity.It is observed that human-written CoT tends to be simple (≤3 hops), achieving good accuracy in simple math questions while suffering from complex questions, as shown in Figure 1.In addition, a previous study (Fu et al., 2023) suggested that using all complex exemplars can improve CoT performance.However, in our experiments (Figure 1), we found that Complex-CoT can improve the accuracy of complex questions, but perform poorly in simple questions.Therefore, we conjecture that the inconsistency between the hops of provided exemplars and the required hops of the real question causes the performance drop, suggesting that determining the appropriate complexity level of exemplars is crucial.
• Diversity: Based on the above discovery about complexity, a natural question is what combination of different complex-level exemplars is most effective.However, testing various combinations is a challenging task for humans and requires significant effort to determine the optimal one.In our experiments (Figure 1), we found that a combination of different complex-level exemplars outperforms CoT with all complex exemplars, suggesting a complexity-diversity trade-off.
• Style Sensitivity: Previous research in educational psychology found that different learning styles would limit the cognitive benefit for students from the prompting (Papadopoulos et al., 2010).We further argue that students with specific learning styles benefit to varying degrees from different styles of prompting.In addition, the empirical evidence from Manual-CoT (Wei et al., 2022b) shows that different annotators can cause up to 28.2% accuracy difference in a symbolic reasoning task, verifying our conjecture.As a result, some bad styles may lead to a huge performance drop.However, humans cannot determine the performance of a particular style beforehand, so it requires trial and error by checking on the validation set, which further increases the effort of writing chain-of-thought exemplars.In light of this empirical evidence, we are motivated to design a framework not only to augment rationale chains but also to select helpful chains adaptively.With this framework, it is expected to bypass the order and style sensitivities and reach a better complexity-diversity trade-off without human effort, finally boosting performance.

Approach
Our approach receives a training dataset D containing n questions Q = {q 1 , q 2 , ..., q n }, and n answers A = {a 1 , a 2 , ..., a n }.The overall architecture of our approach is shown in Figure 2. In this section, we start with a detailed description of augment and prune operation and end with an illustration of select operation.

Augment and Prune
Inspired by Wang et al. (2022), which shows that the generated rationale chains are of comparable quality to the human-annotated ones, we aim to automatically generate the rationale chains to augment the candidate exemplars.Given m fixed rationale chains C = {c 1 , c 2 , ..., c m }, a question q, we ask the large language model G to generate k rationale chains for each q.A larger k can form a larger pool and some post-processes can be done to improve the quality of the pool.Considering the cost and efficiency, we choose k = 1 for our experiments.Our method works well even without C (i.e., m = 0), which is based on zero-shot prompting.Then we prune those incorrect ones out and only keep those with the correct final answer.In other words, the final answer should be consistent with the ground-truth answer.After pruning, we obtain a pool of K high-quality exemplars.

Select
With a large pool of high-quality exemplars, we cannot directly apply all of them due to four considerations: (1) context length limit: the maximum length is 2,048 for GPT-3, so we cannot feed too many exemplars into the model.(2) fair comparison: existing studies usually take 4-8 questionanswer pairs as exemplars following Wei et al. (2022b).( 3) sensitivity: the model performance may be sensitive to the contexts (Jiang et al., 2020), orders (Lu et al., 2022) and lengths (Lester et al., 2021) from the observation of prompt learning literature.(4) adaptation: different downstream tasks may require different exemplars.Therefore, a natural idea is to select the most suitable 4-8 exemplars automatically.The process can be deemed as optimizing a supervised model with latent variables.For each chain-of-thought index i, we initialize a latent variable j i ∼ Cat(p i ).The random variable j i is sampled with the probability distribution ] over the N candidate demonstration indexes, where p i ∈ C and C = {p : ∥p∥ 1 = 1, 0 ⪯ p ⪯ 1}.Since p i is independent of each other, the joint probability of the whole input exemplars is P (T ) = Π n i=1 P (t i ) = Π n i=1 p i,j i .The loss is formulated as L(G([T, S], y)), where T represents the full few-shot exemplars, t i denotes the i-th exemplar, and S is the current question (user's query).However, directly updating the prompts by back-propagating through ∇ p i L(G([T, S], y)) is not possible because of the inaccessible gradients, where y is the label.We resort to the variance-reduced policy gradient estimator (VR-PGE) (Williams, 1992;Dong et al., 2020;Zhou et al., 2021;Diao et al., 2022), a kind of reinforcement learning method to optimize the loss function via forward propagation with: and estimate the gradient of p i by: where T (k) , k = 1, • • • , I are sampled independently from P (T ).Therefore, the exemplar distribution p i can be updated by a projected stochastic gradient descent algorithm: where η is the learning rate, I is the sample size, and proj C is the projection calculation (details are presented in the Appendix A).

Experimental Settings
In this section, we first introduce the setting of eleven datasets and their corresponding evaluation metrics ( § 4.1).Then the baseline models ( § 4.2) and implementation details ( § 4.3) are presented in the following two subsections, respectively.Full details about the experimental setting are illustrated in Appendix B.

Datasets and Evaluation Metrics
Following Wei et al. (2022b), we conduct our experiments on eight reasoning tasks, including five math word problem datasets: GSM8K, ASDiv, SVAMP, AQuA, and SingleOp; two commonsense reasoning datasets: CommonsenseQA (CSQA) and Strat-egyQA, and one symbolic reasoning task: Last Letter Concatenation (Letter (4)).We also generalize our method to non-reasoning tasks including one question-answering task (OpenBookQA), one natural language inference task (e-SNLI), and one sentiment analysis task (SST-2).The detailed statistics of the datasets are listed in Table 5.The evaluation metric for all tasks is the exact match accuracy.First, we conduct pre-processing for predictions to remove all the special symbols.For example, "$100,000" will be processed to "100000".Then we check if it has the same value as the ground truth to calculate the exact match accuracy.
And we utilize the public APIs from OpenAI's services2 and test with text-davinci-002 and code-davinci-002.

Implementation
Augment and Prune: Following Wei et al. (2022b) and Wang et al. (2022), we keep the same number of exemplars (4-8) listed in Table 5.For main experiments, we augment and prune a pool of 100 high-quality exemplars for all datasets.For some entries they did not report, we obtain the result from DIVERSE (Li et al., 2022b).
Select: Both the training and validation sets have a size of 100 to reach a performance and cost tradeoff.Then by utilizing the log probability returned by API calls, we calculate the cross-entropy loss of the answer token.Finally, we optimize the latent variables by AdamW (Loshchilov and Hutter, 2019) for 5 epochs with a learning rate of 1 × 10 −3 and batch size of 10.After optimization, we choose the exemplars combination (arg max p i ) with the highest validation accuracy to be further evaluated on the test set.By default, we query the language model once to get the answer.Under the selfconsistency setting, similar to Wang et al. (2023), we query the language model 40 times and choose the most consistent one as the final answer.
Hyper-parameter Setting: Under few-shot setting, we set max_tokens = 256 for all augmentation, selection and inference.In addition, we set logprobs = 5 when training.Moreover, we set temperature = 0.7 for evaluation under self-consistency while temperature = 0 for all other cases.

Experimental Results
The experimental results are shown in Table 1.We discuss our results in three sections based on the task categories.Automate-CoT are averaged over three runs, and the variance over different runs is reported in Appendix Table 7. Overall, Automate-CoT achieves superior results on all tasks.With text-davinci-002, Automate-CoT outperforms Manual-CoT and SC by 2.6% and 3.7% on average.3.4% improvement on text-davinci-002 and code-davinci-002 respectively, demonstrating that our method is effective on different task types.More surprisingly, the improvement in the Letter ( 4) is significant, demonstrating our method's robustness to deal with out-of-distribution data.
Non-Reasoning Tasks Automate-CoT has also reached great success on question answering (OpenBookQA), natural language inference (e-SNLI), and sentiment analysis (SST-2) tasks by an improvement of 2.8%, 3.4% and 1.3%, respectively.The results show that our method can be generalized to various task types and is not limited to reasoning tasks.

Additional Experiments and Analysis
We further conduct several experiments to evaluate the effectiveness of Automate-CoT and analyze the contributions of each module.Since queries to text-davinci-002 are limited and expensive, most additional experiments are conducted with code-davinci-002.

Effects of Selection Algorithm
After obtaining a large pool of exemplars, a natural question would be what is the performance if we randomly select from the pool regardless of order.
In Figure 3, we compare the accuracy obtained by random selection, human-written (Manual-CoT), and our Automate-CoT.For random selection, we randomly sample exemplars from the pool and combine them regardless of order to form the prompts.We repeat this process five times and report the accuracy with an error bar.The results show that random selection suffers from high variance and relatively low accuracy compared to Manual-CoT and Automate-CoT.Surprisingly, we observed the average performance of a random selection from modelgenerated exemplars can outperform Manual-CoT in some datasets (e.g.GSM8K, CSQA).This also suggests that manual prompt engineering needs to take efforts to design carefully in terms of difficulty, diversity, and style.In conclusion, if we simply randomly select the exemplars from the pool, it is very likely to obtain a much lower accuracy than the manually written method.However, our Automate-CoT can consistently outperform random selection and Manual-CoT which shows the effectiveness of our method.

Effects of Pool Size
We further conduct a set of experiments to test different pool sizes.As shown in Figure 4, if the pool size is limited to only 10, the performance of Automate-CoT is worse than Manual-CoT or comparable with Manual-CoT.It turns out that if the pool size is small, Automate-CoT is unable to select a good combination to beat carefully designed Manual-CoT.However, Automate-CoT can outperform Manual-CoT when the pool size reaches 20 or larger.The trends show that the performance would be better as pool size keeps increasing.This is intuitive and matches our hypothesis because as pool size increases, there would be more complex, diverse exemplars to choose from.It is expected that the performance would keep increasing, but since more queries for GPT-3 are time-consuming and expensive, we limited these additional experiments to have a max pool size of 150.

Effects of Chain Complexity
It is observed that exemplars written by human are rather simple, so we further explore how chain complexity affect performance.We randomly pick Table 3: The performance of Automate-CoT in zeroshot setting compared with other baselines.Lightgray highlights our main model which uses a manually constructed chain-of-thought and is not intended for comparison.We list it here only for reference.
Instead of using 4-8 manual-written exemplars to generate the chains, we simply add "Let's think step by step." and let LLMs generate the chains.We test the result under text-davinci-002 model on GSM8K, SVAMP, and Letter (4) and compare it with Zero-shot-CoT, Manual-CoT and Auto-CoT.Surprisingly, we observe the result can be comparable and even outperform Manual-CoT and Auto-CoT a bit as shown in Table 3.The results further demonstrate that our method can effectively select a suitable combination of exemplars even from a pool that may contain low-quality chains.
In conclusion, if a dataset already has manually written chains, our method can be applied to boost the performance.If a dataset does not have manually written chains, our method can still be used to achieve higher accuracy than if it had manually written chains, demonstrating the superiority of our method.

Ablation Study
In this section, we further conduct ablation experiments to verify the advantage of the generated prompts on four factors, respectively.
Advantage over Order Factor The advantages of Automate-CoT on order factor can be viewed in two ways.Firstly, it requires a large human effort to determine a good order by trying many different orders on validation sets.However, Automate-CoT can automatically construct the exemplars without further adjustment to have a good result.Secondly, Automate-CoT is less affected by the order sensitivity.We further conduct an experiment to compare selected exemplars and random permutations of Automate-CoT's selected exemplars as shown in lected exemplars have better performance than that of all 5 random permutation runs, demonstrating Automate-CoT can automatically choose a good order without any human effort.
Advantage over Complexity Factor As discussed in the complexity factor of Section 2, we show that the complexity of manually written chains is quite simple (less than or equal to 3 hops).
It would require more human effort to design complex rationales.However, Automate-CoT can automatically augment and select examples with different complexity, reaching a better trade-off accuracy between simple questions and complex questions (Appendix Table 9).

Advantage over Diversity Factor
The diversity of Manual-CoT or Complexity-CoT is limited.For example, every exemplar of Complexity-CoT has the same complexity and every exemplar of Manual-CoT ranges from 1-3 hops as illustrated in the motivation section.However, Automate-CoT can automatically select an optimal combination of complexity that best suits the dataset.For example, our selected exemplars on GSM8K have an average hop of 5.4 and range from 3 hops to 8 hops as shown in Appendix G.It contains both simple exemplars and complex exemplars which reach the best performance.
Advantage over Style Factor Our extensive experience with multiple experiments indicates that a good linguistic style is typically formal and detailed.This style entails the use of (1) explicit and logical connection words (e.g., "so", "that means"), (2) detailed reasoning steps within a single sentence, (3) symbols when appropriate (e.g., using the $ symbol to denote monetary values), and (4) minimizing the use of abbreviations.We further conduct an ablation experiment to test how our method can choose the examples with better style.Firstly, we use Automate-CoT to select 8 rationale exem- Then we copy this set and edit its written / linguistic style manually to be worse while keeping the order, complexity, and diversity the same which gives

Related Work
In this section, we first review the recent progress of prompt-based learning ( §8.1) and chain-of-thought prompting ( §8.2), and then discuss the black-box optimization methods ( §8.3).

Black-box Optimization
Nowadays, large language models provide services as commercial APIs deployed in the cloud, such as OpenAI's GPT-3 (Brown et al., 2020)

Conclusion
In this paper, we proposed a chain-of-thought optimization method consisting of three steps: augment, prune, and select.Automate-CoT first generates rationale chains according to the standard CoT process with several exemplars, and then prunes those incorrect ones according to the consistency of the predicted answer and ground-truth answer.Finally, we apply a variance-reduced policy gradient strategy to estimate the gradients and optimize the latent variables to select better CoTs.Experimental results demonstrate the effectiveness of our method on arithmetic reasoning, commonsense reasoning, symbolic reasoning tasks, and non-reasoning tasks.

Limitations
It is shown that Automate-CoT demonstrates superior performance over previous chain-of-thought 3 https://openai.com/blog/chatgpt/prompting methods.However, despite these exciting results, there are still some limitations to our current work, as well as potential opportunities for future research.
Comparision with Fine-tuning : Our main baselines include original chain-of-thought (Wei et al., 2022b), self-consistency (Wang et al., 2023) which are manual-written based prompt method.In addition, we also compare the clustering-based and retrieval-based methods to select the prompt exemplars like Auto-CoT (Zhang et al., 2023), BM25 (Robertson, 2009), PromptPG (Lu et al., 2023).As large language models are dominating the field, the performance of training the large language models by using these labeled data might be interesting.However, it is not covered in this study due to the prompt setting of this study and limited resources.

A Algorithm Details
In this section, we provide more details about the derivation of the equation (1) in Section 3.2.Given the loss function: We can estimate the gradient of p i by: (5) The j-th component of ∇ p i log P (t i ) could be solved explicitly by: . When j ̸ = j i , equation ( 6) is calculated by: Therefore, we adopted a variance-reduced policy gradient estimator (VR-PGE) as described in Williams (1992); Dong et al. (2020);Zhou et al. (2021) to mitigate the high-variance issue of PGE.The estimated gradient is calculated by: where T (k) , k = 1, • • • , I are sampled independently from P (T ).Thus, the prompt token distribution p i can be updated by a projected stochastic gradient descent algorithm: where η is the learning rate of prompt learning, I is the sample size, and proj C is the projection calculation.
The detailed training procedure of our VR-PGE algorithm is displayed in Algorithm 1.

B.1 Datasets and Evaluation Metrics
Following Wei et al. (2022b), we conduct our experiments on eight reasoning tasks, including five math word problem datasets: GSM8K, ASDiv, SVAMP, AQuA, and SingleOp; two commonsense reasoning datasets: CommonsenseQA (CSQA) and StrategyQA, and one symbolic reasoning task: Last Letter Concatenation (Letter (4)).We also generalize our method to non-reasoning tasks including one questionanswering task (OpenBookQA), one natural language inference task (e-SNLI), and one sentiment analysis task (SST-2).The detailed statistics of the datasets are listed in Table 5.
To make a fair comparison with our baselines, we use the same number of exemplars as Wei et al. (2022b) and Wang et al. (2022), as shown in Table 5.We keep the same setting for the evaluation split as well.
By default, we use the test split for evaluation, and for datasets that do not have publicly available test set labels, we evaluate the validation set instead.In addition, for last letter concatenation, since the model has already achieved almost 100% accuracy under the in-distribution setting, we only test the out-of-distribution (OOD) setting, Letter (4), where prompts are 2-letters, and test examples are 4-letters.
The evaluation metric for all tasks is the exact match accuracy.First, we conduct pre-processing for predictions to remove all the special symbols.For example, "$100,000" will be processed to "100000".
Then we check if it has the same value as the ground truth to calculate the exact match accuracy.

B.2 Baselines
In our experiments, the following three methods serve as the main baselines: • chain-of-thought (Manual-CoT) (Wei et al., 2022b): standard chain-of-thought prompting which provides manual-written intermediate reasoning steps.
• self-consistency (SC) (Wang et al., 2023): an improved version of CoT.Instead of greedy decoding, it samples a diverse set of reasoning paths and chooses the most common answer.• Auto-CoT (Zhang et al., 2023): an automatic exemplars construction method that applies clustering techniques to sample questions and then generates chains.Our experiments are conducted with two popular large language models: • GPT-3 (Brown et al., 2020): we test an advanced version of GPT-3, text-davinci-002, which corresponds to InstructGPT (Ouyang et al., 2022) model.• CodeX (Chen et al., 2021): we test code-davinci-002 which has better code representation ability.We utilize the public APIs directly from OpenAI's services4 .In our main experiments, we test on both text-davinci-002 and code-davinci-002 engines.However, in additional experiments, we mainly test on code-davinci-002 for two reasons : (1) It is the most capable model available at the time we were conducting our experiments, consistent with the observations in previous studies (Wei et al., 2022b;Wang et al., 2023;Miao et al., 2020).( 2) Compared to costly text-davinci-002, it is free of charge because we are in the initial limited beta period during our experiments process.

B.3 Implementation
Augment and Prune: Following Wei et al. (2022b) and Wang et al. (2022), we keep the same number of exemplars (4-8) listed in Table 5.For main experiments, we augment and prune a pool of 100 high-quality exemplars for all datasets.Firstly, pool construction questions are randomly sampled and then fed to LLMs to construct model-generated answers with rationale chains.Given that some datasets only have the test split, we use the pool result of GSM8K and transferred it to these datasets for further inference.Here for arithmetic reasoning tasks, pool construction questions are randomly sampled from the training split of GSM8K and AQuA.For CSQA and StrategyQA, exemplars are randomly sampled from the official training split (Talmor et al., 2019) and question-only set from BIG-bench collaboration (Srivastava et al., 2022).For letter concatenation, exemplars are randomly sampled from the 2-letter set.After the pool is constructed, we use labels to prune the incorrect model-generated exemplars and retain 100 high-quality exemplars.Select: The train set and validation set are also randomly sampled following the same rule as above except Letter (4) dataset.Since LLM has already reached almost 100% accuracy on the 2-letter set, we choose to optimize the model based on the 3-letter OOD set.Thus the train set and validation set are randomly sampled from the 3-letter set.Both the train and validation sets have a size of 100 to reach a performance and cost trade-off.Then by utilizing the log probability returned by API calls, we calculate the cross-entropy loss of the answer token.Finally, we optimize the latent variables by AdamW (Loshchilov and Hutter, 2019) for 5 epochs with a learning rate of 1 × 10 −3 and batch size of 10.After optimization, as shown in Figure 2 inference stage, we choose the exemplars combination (arg max p i ) with the highest validation accuracy to be further evaluated on the test set.By default, we query the language model once to get the answer.Under the self-consistency setting, similar to Wang et al. (2023), we query the language model 40 times and choose the most consistent one as the final answer.Hyper-parameter Setting: Under few-shot setting, we set max_tokens = 256 for all augmentation, selection and inference.In addition, we set logprobs = 5 when training.Moreover, we set temperature = 0.7 for evaluation under self-consistency while temperature = 0 for all other cases.Under zero-shot setting ( §6.5), we keep the same hyper-parameters as Kojima et al. (2022) which first uses max_tokens = 128 for generating the rationale chains and then uses max_tokens = 32 for generating the answers to construct the pool.The hyper-parameters for selecting and evaluating are the same as the few-shot setting above.

C.1 Experiments under ChatGPT
To further verify the effectiveness of Automate-CoT, we further conduct the experiments on gpt-3.5-turbo.Automate-CoT also shows consistent improvement on each task with 2.8% improvement on arithmetic reasoning, 3.9% improvement on commonsense reasoning, 3.2% on symbolic reasoning, and 2.8% improvement overall as shown in Table 6.

C.2 Comparison with Retrieval Methods
We also compare Automate-CoT with simple retrieval method BM25 (Robertson, 2009) and reinforcement learning-based retrieval method PromptPG (Lu et al., 2023).We first implemented a BM25 selection method and tested the performance on all the datasets.The results are shown in Table 6.It indicates that retrieval-based methods can only select examples with similar meaning to the query question while the diversity is overlooked.As shown in the table, the average performance of the BM25 retrieval-based method even has a 1% degradation compared to Manual-CoT, and 3.8% lower than Automate-CoT.A similar phenomenon is observed in Auto-CoT (Zhang et al., 2023), which indicates that with similar questions being sampled for test questions, Retrieval-Q-CoT is negatively affected by misleading by similarity.
In addition, we also compare with PromptPG (Lu et al., 2023), a dynamic example-selection baseline.
We adopt the same setting as ours for PomptPG, where the number of training examples is 100, the size of the candidate pool is 100, and the backbone model is gpt-3.5-turbo.Further, we keep the same prompt format as the original chain-of-thought and ours.The other settings we use are consistent with the settings provided by their original code.The results are shown in Table 6.It indicates that Automate-CoT outperforms PromptPG.

C.3 Comparison with Clustering Methods
We further conduct additional experiments to compare Automate-CoT with methods selecting demonstration exemplars through clustering.We use K-Means as the clustering method and create k clusters according to the number of exemplars specified in Table 5.Then we use these k representative exemplars as the demonstration exemplars to prompt the language models.The results are shown in Table 6.It indicates that clustering-based methods can select examples with different semantic meanings and generally perform better than Manual-CoT.However, the complexity and diversity are overlooked.For example, most of the selected few-shot exemplars in GSM8K have around 3-4 hops where complex questions and moderately difficult questions are overlooked.As a result, it generally performs worse than Automate-CoT with a 2.6% gap.

C.4 Variance Report
Since Automate-CoT's results in Table 1 are averaged over three runs, we also report the variance in

E Additional Analysis
We list some additional analysis here that cannot be put in the main section because of the page limit.

E.1 Effects of Several Tricks
Previous studies have found some tricks like add "Let's think step by step."before each rationale chain and replace "Q:" with "Question:" (Fu et al., 2023;Kojima et al., 2022) can boost the performance on top of Manual-CoT.Following their settings, we also test Automate-CoT with tricks on GSM8K as an additional experiment.By adding tricks, Automate-CoT can further boost the accuracy to 69.8% (+2.2%) under the normal setting and 83.0%(+0.6%) under the self-consistency setting, respectively.

F Exact Match Number over Each Hop
The exact match number over each hop of Figure 1     Q: The output of a factory was increased by 10% to keep up with rising demand.To handle the holiday rush, this new output was increased by 20%.By approximately what percent would the output now have to be decreased in order to restore the original output?Answer Choices: (a) 20% (b) 24% (c) 30% (d) 32% (e) 79% A: Let initial output is O then after 10% increase it will be 1.1O and after 20% increase on this new output the latest output will be 1.1O * 1.20 = 1.32O.Now we have to decrease the output by some percentage so that the new output is same as the starting output (O).so, 1.32O * (1-x/100) = O. => x = 24.24%.So, answer will be B.
The answer is (b).
In a graduate physics course, 70 percent of the students are male and 30 percent of the students are married.If two-sevenths of the male students are married, what fraction of the male students is single?Answer Choices: (a) 2/7 (b) 1/3 (c) 1/2 (d) 2/3 (e) 5/7 A: let assume there are 100 students of which 70 are male and 30 are females.if 30 are married then 70 will be single.now its given that two-sevenths of the male students are married that means 2/7 of 70 = 20 males are married.if 30 is the total number of students who are married and out of that 20 are males then the remaining 10 will be females who are married.total females = 70.married males = 20.then single males = 70-20 = 50.we need to find the fraction of male students who are single i.e single male students / total male student = 50/70 = 5/7 [E].The answer is (e).A: The answer should be something that John would be unable to hold his tools if he injured it.Of the above choices, only hands make sense.So the answer is (e).
Q: What is required to be good at playing tennis?Answer Choices: (a) tennis racket (b) skill (c) competition (d) tennis court (e) legs A: The answer must be something that is required to be good at playing tennis.Of the above choices, only skill is required to be good at playing tennis.So the answer is (b).
Q: What is someone who gets angry after getting drunk likely to participate in?Answer Choices: A: The answer must be a type of sensor that would set off an alarm when something is near.Of the above choices, only proximity sensor makes sense.So the answer is (d).
Q: The student needed to get some new pencils, where did he go?Answer Choices: (a) store (b) classroom (c) backpack (d) cabinet (e) teacher A: The answer should be the place where pencils are sold.Of the above choices, only store makes sense.So the answer is (a).Table 12: One example of selected model-generated exemplars with rationale chains.This set of exemplars is trained and selected on CommonsenseQA.Note that there are newlines between the answer choices which are omitted in the table to save space.Table 15: One example of selected exemplars with rationale This set of exemplars is trained and selected on OpenBookQA.
Figure 2: Illustrations of our proposed approach.The left and middle parts of the figure contain two steps of our method: (1) Augment and Prune and (2) Select.The right part illustrates the training stage (right top) and the inference stage (right bottom), respectively.

Figure 3 :
Figure 3: Comparisons between Random Selection, Manual-CoT and Automate-CoT on six datasets.

Figure 4 :
Figure 4: The performance across different pool sizes of Automate-CoT compare with Manual-CoT.Pool size refers to the number of exemplars in the pool.
started with 180 seashells.He gave away 40 to his friends and 30 to his brothers.That means he gave away 40 + 30 = 70 seashells.He has 180 -70 = 110 seashells left.He sold half of them, so he has 110 / 2 = 55 seashells left.The answer is 55.Q: Marvin and Tina were selling candy bars to help fund their class trip.The candy bars cost $2 each.Marvin sold 35 candy bars total.Tina sold three times the number of candy bars as Marvin.How much more money did Tina make for the class trip selling candy bars compared to Marvin? 5-hop A: Marvin sold 35 candy bars.Tina sold 3 times as many, so she sold 35 * 3 = 105 candy bars.Each candy bar cost $2, so Marvin made 35 * $2 = $70.Tina made 105 * $2 = $210.The difference is $210 -$70 = $140.The answer is 140.
(a) violence (b) falling down (c) vomiting (d) vomiting (e) lower standards A: The answer should be something that someone who gets angry after getting drunk is likely to do.Of the above choices, the closest answer is violence.So the answer is (a).Q: James felt himself losing consciousness.He was frightened.He didn't want this.He was scared of what?Answer Choices: (a) death (b) coma (c) peace (d) forgetfulness (e) loss of life A: The answer should be something that James would be scared of.Of the above choices, the closest thing is death.So the answer is (a).Q: Sally lost her kite because she wasn't careful.She thought that there was more string on the spool, but it slipped out of her fingers when she reached the what?. Answer Choices: (a) child's hand (b) the last straw (c) hobby shop (d) (e) toy store A: The answer should be the end of the string.Of the above choices, the end of the line makes the most sense.So the answer is (d).Q: The sensor would just the distance then set off an alarm, the installation expert explained it was called a what kind of sensor?Answer Choices: (a) near (b) closeness (c) here (d) proximity (e) this As a car approaches you in the night (a) the headlights become more intense (b) the headlights recede into the dark (c) the headlights remain at a constant (d) the headlights turn off A: The headlights become more intense as the car approaches.The answer is (a).Q: Many animals that give birth to live young have (a) gills (b) scales (c) exoskeletons (d) legs A: Animals that give birth to live young are mammals.Mammals have hair and give milk to their young.The answer is (a).Q: A person is lost in a dense forest, and needs to find their home.They know their home is to the south, and they are headed north.They can find home by using a (a) northern-directing device (b) northern light reader (c) northeastern winds (d) north central credit A: he person needs to find south, so they need a northern-directing device.The answer is (a).Q: When the weather changes as it does from Christmas to Easter, (a) the air may chill (b) the ground may freeze (c) the plants may die (d) the ground may warm A: The weather changes from Christmas to Easter, the ground may warm.The answer is (d).

Table 1 :
(Wang et al., 2023)ance of Automate-CoT and the comparison against existing models on eleven downstream tasks.Manual-CoT and SC represent chain-of-thought(Wei et al., 2022b)and self-consistency(Wang et al., 2023)methods.Bold denotes the best in code-davinci-002-based methods and Underline denotes the best in text-davinci-002-based methods.*: Prior Best is the best performance before CoT comes out.a: Cobbe et al.

Table 4 .
We randomly permutate the selected exemplars to see how performance varies compared to the selected order by Automate-CoT.It is observed that the order sensitivity still exists and our se-

Table 4 :
Comparison of different permutations orders of Automate-CoT's selected examplars.
H 2 ].Now we have 16 examplars says S = [A 1 , A 2 , B 1 , B 2 , ..., H 1 , H 2 ].A-H represents the No.1-8 exemplars.Subscript 1 represents the originally selected exemplars and 2 represents the edited ones.Then, Automate-CoT selects 8 exemplars from the previous 16 exemplars.Note that we limit Automate-CoT to select exactly one of [A 1 , A 2 ] and [B 1 , B 2 ] . ..and keep the same order A-H.Subsequently, when we perform Automate-CoT algorithm , we observe that Automate-CoT is able to successfully select the original exemplars S 1 .Furthermore, we find that the selected exemplars can outperform the non-selected exemplars by 2%.
Input batch S, Label batch Y , Parameter of categorical distribution p 1 , • • • , p n , Prediction model G, Loss function L. 1: for k ≤ I do Style Definition : Another limitation of this work is that it does not provide a rigorous definition of what constitutes good versus bad linguistic style.While we have observed several patterns of good and bad style during numerous experiments, and the results show that Automate-CoT is able to mitigate style sensitivity in Manual-CoT, we cannot determine what perfect style entails.As such, we acknowledge that defining what constitutes good versus bad linguistic style can be a challenging task and an important area for further exploration and development.Algorithm 1 The black-box optimization procedures.Require:

Table 5 :
Wang et al. (2022) et al., 2016)tasets.#EX.: the number of few-shot chain-of-thought exemplars used to prompt each task.#EVAL.: the number of evaluation data.EVAL.SPLIT: evaluation split.TRANSFERRED: a checkmark means that the exemplars are generated and trained from other datasets and then applied to this task.♣:SingleOp is a subset of MAWPS(Koncel-Kedziorski et al., 2016).♦:CSQA,StrategyQA,andSST-2do not have publicly available test set labels, so we simply follow the setting byWei et al. (2022b)andWang et al. (2022)to evaluate the performance of the validation set.♥:FollowingWang et al. (2022), we evaluate the first 1,000 data points for a fair comparison.

Table 6 :
The overall performance of Automate-CoT under gpt-3.5-turboand the comparison with retrieval-based and clustering-based exemplars selection methods.

Table 7 :
The variance of the results in Table1over 3 runs.(SC) denotes under self-consistency setting.
Cobbe et al. (2021)observed that Automate-CoT achieves quite a low variance, especially compared to the large variance of Manual-CoT as shown in § 2 Motivation.D Additional Comparison with Fine-tuningSince our method uses a training-based pipeline, we also compare it with fine-tuning large language models in terms of the number of parameters, training cost, estimated total training cost, and required training set size.As shown in the study ofCobbe et al. (2021), fine-tuning on gpt-3 requires thousands (e.g., 8000) of training examples to be effective while Automate-CoT only needs 100 training examples.In addition, fine-tuning has a larger training and inference cost than Automate-CoT because it not only requires a one-off fine-tuning cost but also has a higher unit price on subsequent usage.For Automate-CoT, under the setting of gpt-3.5-turbo, the direct usage is $ 0.0015 / 1k tokens for input and $ 0.002 / 1k tokens for output.With the training epochs of 3, a training set size of 100 and a validation set size of 100, an input length of around 750 tokens and an average output length of 150 tokens, it takes about (750/1000 • 0.0015 + 150/1000 • 0.002) • 100 • 10 • 3 + (750/1000 • 0.0015 + 150/1000 • 0.002) • 100 • 3= $ 4.7.However, for fine-tuneing, given the training price of gpt-3.5-turbo is $ 0.008 / 1K tokens, the usage of finetuned gpt-3.5-turbo is $ 0.0015 / 1K tokens for input and $ 0.002 / 1K tokens for output tokens.Under the finetuning setting, suppose the average length of training examples is 300 tokens, and training a whole training set of 8000 examples for 3 epochs takes about 300/1000 • 8000 • 3 • 0.008= $ 57.6, which costs 12x more than Automate-CoT.It is also worth noting that the further usage of finetuned gpt-3.5-turbo is $ 0.012 / 1K tokens for input and $ 0.016 / 1K tokens for output while Automate-CoT remains the normal cost, which is 8x less cost than fine-tuning.

Table 8 :
Comparison between Fine-tuning and Automate-CoT on GSM8K.The cost is copied from the OpenAI official website.5 is reported in Table9.Ralph is going to practice playing tennis with a tennis ball machine that shoots out tennis balls for Ralph to hit.He loads up the machine with 175 tennis balls to start with.Out of the first 100 balls, he manages to hit 2/5 of them.Of the next 75 tennis balls, he manages to hit 1/3 of them.Out of all the tennis balls, how many did Ralph not hit?James earns $20 an hour at his main job.He earns 20% less at his second job.That means he earns $20 -20% of $20 = $20 -$4 = $16 an hour at his second job.He works 30 hours at his main job and half that much at his second job.That means he works 30 / 2 = 15 hours at his second job.In total he works 30 + 15 = 45 hours.He earns $20 an hour at his main job and $16 an hour at his second job.In total he earns $20 * 30 + $16 * 15 = $600 + $240 = $840.The answer is 840.Q: Leah earned $28 working odd jobs around the neighborhood.She spent a seventh of it on a milkshake and put half of the rest in her savings account.She left the remaining money in her wallet.Her dog got ahold of her wallet and shredded all the money inside but $1. How many dollars did Leah lose?Sam and Jeff had a skipping competition at recess.The competition was split into four rounds.Sam completed 1 more skip than Jeff in the first round.Jeff skipped 3 fewer times than Sam in the second round.Jeff skipped 4 more times than Sam in the third round.Jeff got tired and only completed half the number of skips as Sam in the last round.If Sam skipped 16 times in each round, what is the average number of skips per round completed by Jeff?Ali had a collection of seashells.He started with 180 seashells.He then gave away 40 seashells to his friends.He also gave 30 seashells to his brothers.If he sold half of the remaining seashells, how many seashells did he have left?
A: Leah earned $28.She spent a seventh of it on a milkshake, so she spent $28 / 7 = $4 on the milkshake.She put half of the rest in her savings account, so she put $28 -$4 / 2 = $12 in her savings account.She left the remaining money in her wallet.She lost all the money in her wallet except $1.So she lost $28 -$4 -$12 -$1 =$11.The answer is 11.Q:Q:

Table 10 :
One example of selected model-generated exemplars with rationale chains of average hops = 5.4.This set of exemplars is trained and selected on GSM8K and transferred to other arithmetic reasoning tasks.: A person can walk at a constant rate of 8mph and can bike at a rate of 16mph.If he wants to travel 64 miles in 8 hours using bike and walking at their constant rates, how much distance would he require to walk?Answer Choices: Q

Table 11 :
One example of selected model-generated exemplars with rationale chains.Note that there are newlines between the answer choices which are omitted in the table to save space.John was punching and punching at the wall but succeeded only in bloodying his knuckles.This was bad.He would be unable to hold his tools if he injured what?Answer Choices: (a) hostility (b) anger (c) nose (d) fists (e) hands