Democratizing Reasoning Ability: Tailored Learning from Large Language Model

Large language models (LLMs) exhibit impressive emergent abilities in natural language processing, but their democratization is hindered due to huge computation requirements and closed-source nature. Recent research on advancing open-source smaller LMs by distilling knowledge from black-box LLMs has obtained promising results in the instruction-following ability. However, the reasoning ability which is more challenging to foster, is relatively rarely explored. In this paper, we propose a tailored learning approach to distill such reasoning ability to smaller LMs to facilitate the democratization of the exclusive reasoning ability. In contrast to merely employing LLM as a data annotator, we exploit the potential of LLM as a reasoning teacher by building an interactive multi-round learning paradigm. This paradigm enables the student to expose its deficiencies to the black-box teacher who then can provide customized training data in return. Further, to exploit the reasoning potential of the smaller LM, we propose self-reflection learning to motivate the student to learn from self-made mistakes. The learning from self-reflection and LLM are all tailored to the student's learning status, thanks to the seamless integration with the multi-round learning paradigm. Comprehensive experiments and analysis on mathematical and commonsense reasoning tasks demonstrate the effectiveness of our method. The code will be available at https://github.com/Raibows/Learn-to-Reason.


Introduction
Large language models (LLMs) with emergent abilities have achieved remarkable success across a wide range of tasks, deeply changed the landscape of both research and applications in natural language processing (Brown et al., 2020;Chen et al  2021; Chowdhery et al., 2022;OpenAI, 2023).And Wei et al. (2022a,b) argue that emergent abilities particularly in reasoning only exist in LLMs whose parameters are commonly larger than 100B.Nevertheless, a line of research (Touvron et al., 2023a,b;Taori et al., 2023;Zeng et al., 2023) has indicated that smaller LMs with about 7B parameters after supervised fine-tuning such as Vicuna (Chiang et al., 2023) can be comparable to LLMs in following human instructions, while still falling short of reasoning.In this paper, we aim to harness the untapped reasoning potential of smaller LMs to democratize this important emergent ability.Chain-of-Thought (CoT) prompts LMs to generate intermediate reasoning steps (i.e., rationale) to reach the final answer, significantly improving the complex reasoning ability (Wei et al., 2022b;Kojima et al., 2022a;Chung et al., 2022;Wang et al., 2023a).However, it is challenging to prompt smaller LMs to generate reasoning steps, since such ability appears to be exclusive to LLMs (Wei et al., 2022a,b;Chowdhery et al., 2022), which indicates the necessity of utilizing data annotated with rationales to cultivate smaller LMs' reasoning ability.Unfortunately, most existing reasoning datasets lack high-quality rationale annotations, and manual labeling them can be costly.Inspired by the success of collecting instruction data from LLMs (e.g., ChatGPT) for instruction tuning smaller LMs (Wang et al., 2023b;Taori et al., 2023;Touvron et al., 2023a,b), we propose to leverage the rationales generated by LLMs to train smaller LMs to learn to use CoT towards reasoning.
Recently, teaching smaller LMs towards reasoning with the help of LLMs has gained increasing attention.Most of these works (Ho et al., 2023;Magister et al., 2023;Fu et al., 2023b;Shridhar et al., 2023) can be summarized in two main steps: (1) employing LLMs to generate rationales for annotating the training data.(2) Fine-tuning smaller LMs on these data to enable reasoning with CoT.This approach can be viewed as a distant variant of black-box knowledge distillation (Jianping et al., 2021).However, these methods only employ LLMs to annotate the data for training smaller LMs, without leveraging the smaller LMs to assist LLMs in return.As a consequence, the LLMs are not aware of the weaknesses of the smaller LMs, thereby hindering their powerful ability to analyze and provide targeted feedback, which undermines the effectiveness of the reasoning distillation.
To this end, we propose a multi-round interactive learning paradigm to exploit the potential of black-box LLM as a reasoning teacher.In each round of learning, the student (i.e., smaller LM) first provides its learning status to the teacher LLM who then can provide customized rationales as the feedback to the student.The data annotated with these rationales serves as our customized training data.Such a paradigm is natural as it is in inline with how we human beings learn from teachers.
Beyond learning from the teacher, another crucial paradigm for human learning lies in selfreflection on self-made mistakes.In parallel, recent studies (Huang et al., 2022;Shinn et al., 2023;Madaan et al., 2023;Pan et al., 2023) have also shown that LLMs can self-improve by reflecting on their own mistakes.Therefore, we exploit the reasoning potential of smaller LM by eliciting it to take self-reflection on the mistakes.These mistakes can complement correct rationales collected from the teacher LLM to teach the student LM to distinguish bad and good reasoning steps, thereby enhancing its reasoning ability.
Putting them together, as briefly presented in Fig. 1, we propose a tailored multi-round learning paradigm based on the student's learning status and deficiencies, including learning from LLM's customized training data and self-reflection.In summary, our contributions are three-fold: 1) A multi-round learning paradigm is introduced to enable the student LM to provide feedback to the teacher LLM who then can offer customized training data in response, building the interaction between smaller LM and black-box LLM.
2) We propose self-reflection learning that motivates the student to learn from mistakes.Together with learning from customized training data, it can be seamlessly integrated into the multi-round learning paradigm.
3) Experiments and analysis on mathematical and commonsense reasoning tasks demonstrate the effectiveness of our method in distilling the reasoning ability from LLMs to smaller LMs.

Related Work
Emergence in LLM LLMs show emergent abilities in a wide range of NLP tasks (Brown et al., 2020;Chowdhery et al., 2022;Wei et al., 2022a,b;OpenAI, 2023), among which the reasoning ability is the most noteworthy as it requires the model to perform multi-hop reasoning like human beings.Smaller LMs (< 100B) are often considered to be falling significantly short in reasoning, highlighting the superiority of LLMs in this aspect (Wei et al., 2022a).In this paper, we aim to democratize such emergent reasoning ability to smaller LMs.
CoT Prompting CoT prompts LMs to solve reasoning tasks by generating intermediate rationales to reach the answer, which has greatly improved the reasoning performance (Wei et al., 2022b;Kojima et al., 2022b;Wang et al., 2023a).However, according to the reasoning performance curve (Wei et al., 2022a), the CoT reasoning performance of smaller LMs is far from satisfactory, since the generation of rationales is challenging for them.Distilling Knowledge from LLM Fine-tuning smaller LMs to follow instructions with highquality data collected from LLMs shows the feasibility of distilling knowledge from LLMs (Taori et al., 2023;Chiang et al., 2023;Xu et al., 2023).This procedure can also be viewed as a distant variant of black-box distillation (Hinton et al., 2015;Jianping et al., 2021).However, these works aim to improve the instruction-following ability of smaller LMs, while the reasoning ability that we focus on is often overlooked.Some recent studies (Ho et al., 2023;Fu et al., 2023b;Shridhar et al., 2023) propose to employ LLMs to annotate rationales for training smaller student LMs towards reasoning, not considering the student's feedback to the teacher.In contrast, we exploit the potential of the black-box LLM as the teacher instead of the data annotator by proposing a multi-round learning paradigm.This paradigm enables the mutual feedback between the LLM and smaller LM, thus can make the teacher LLM offer training data tailored for the student LM's learning status.Besides, we propose self-reflection learning to motivate the student LM to learn from mistakes.

Method
As shown in Fig. 2, we propose a multi-round learning paradigm that motivates the student LM and the teacher LLM to learn feedback from each other in an interactive manner.Specifically, each round of learning consists of three key steps: (1) The student LM undergoes an "exam" on the training set for collecting mistakes which are the wrong generated rationales.Existing works (Fu et al., 2023b;Ho et al., 2023;Shridhar et al., 2023;Magister et al., 2023) merely provide the sample question for the LLM to collect annotated rationales, neglecting the importance of the student's feedback.However, the student's feedback is crucial in knowledge distillation (Fu et al., 2021;Pham et al., 2021;Ren et al., 2023).( 2) Therefore, we propose to curate a prompt integrated with the student's wrong rationale to ask the teacher LLM to generate customized feedback for the student.(3) In the last step, the student learns to reason via training on the tailored training data collected from the LLM, and selfreflection on its self-made mistakes.These steps are iterated to improve the reasoning ability of the student LM until convergence.

Undertaking an Exam
Given a dataset D train = {(x, y)}, where x is the question and y is the answer, the correct rationale r is often not provided.During inference of CoT, the input is the question x, and the student LM's generated output f (x) = [r, ŷ] is the concatenation of the generated rationale r and answer ŷ.The answer is often at the end of the output.The student LM undertakes an "exam" on the training set D train for evaluating the learning sta-tus, and collecting the mistakes D neg which are the samples with wrong rationales and answers1 : for each question, we collect up to 4 wrong rationales through the decoding with sampling strategy.The collected mistake set D neg reflecting the student's learning status and weakness are used for the following two purposes: (1) As the feedback for the teacher LLM to generate rationales tailored for the student.
(2) As the negative contrastive samples for the student to learn from self-reflection.

Student's Feedback to LLM
We expect the black-box LLM to be a reasoning teacher instead of a data annotator.Thus, we propose to provide the student's feedback to help the teacher LLM generate customized training data to effectively target the student's weakness.In detail, we devise a prompt template T shown in Fig. 3, which integrates both the question x and the student's feedback (i.e., the wrong rationale r).The student's feedback can not only (1) assist teacher in identifying deficiencies in student's reasoning, but also (2) as the wrong demonstration example to help LLM increase the chance of generating correct rationales.Besides, to improve the LLM's accuracy and reduce the costs of calling APIs, we follow Zelikman et al. ( 2022) by adding a hint to explicitly tell LLM the golden answer of the question.
For each sample (x, r, ŷ) ∈ D neg , we request the LLM with T (x, r, ŷ) to generate 4 rationales, and only those containing correct answers are retained, since training with diverse reasoning paths can boost the reasoning performance of smaller LMs (Ho et al., 2023;Fu et al., 2023b).The collected rationale together with its question and answer is denoted as (x, r, y), which extends the original data to the customized training data D train .

Tailored Learning
The reasoning ability of student LM f can be improved via tailored learning from both selfreflection and teacher's customized training data.
Learning from Self-Reflection We propose to learn from the mistakes D neg to simulate the selfreflection process of humans, which can help the  student LM to identify the quality of different rationales.The utilization can be defined in multiple forms (e.g., likelihood ranking), here we adopt a simple triplet-loss to encourage the model to learn different representations for good and bad rationales.Specifically, the wrong reasoning path [x, r, ŷ] ∈ D neg , and the correct reasoning path [x, r ′ , y] ∈ D train are utilized as the negative and positive contrastive samples, respectively.The hidden state of the last token is used as the representation of the whole reasoning path, which is denoted as h (r,y) x .Finally, the form of self-reflection learning is defined as follows: where cos denotes the cosine similarity function, and ρ set to 1.0 is the margin.(x, r, y) ∈ D train is the anchor sample whose positive and negative samples are randomly sampled from D train and D neg with the same question x, respectively2 .
Learning from Customized Feedback LLM's generated rationales are tailored to the student's weakness, thanks to the previous student's feedback.These collected rationales merged into the training set D train as the customized feedback for the student, which is used to fine-tune the student LM f .In addition, we add several fixed demonstrations "demo" listed in Table 15 to the prefix of each input sample, since recent research (Min et al., 2022;Zelikman et al., 2022;Fu et al., 2023b) where the square brackets represent the string concatenation.This process can directly help the student LM learn to generate intermediate reasoning steps and master the CoT skill.
Require: the student LM f , the teacher LLM, the training data Dtrain, the template T in Fig. 3 1: Initialize f 0 with pre-trained weights and set the learning round count r ← 0 2: repeat 3: Infer on Dtrain with f and collects the mistakes (x, r, ŷ) ∼ Dneg by Eq. ( 1) 5: if r ≤ 1 then 6: Collect the rationale r for each sample of Dtrain from teacher LLM with T (x, null, y) 7: else 8: Collect the rationale r for each sample of Dneg from teacher LLM with T (x, r, y) 9: end if 10: Optimize weights of f r using Eq. ( 4) 11: until Converges Joint Learning The final optimization incorporates the learning from both self-reflection and LLM's customized feedback.The contrastive learning loss in Eq. ( 2) and the language modeling loss in Eq. ( 3) are combined as follows: where λ controls the impacts of self-reflection learning, balancing the two learning objectives.

Multi-round Learning
As depicted in Fig. 2, we adopt a multi-round learning paradigm to iteratively cultivate the reasoning ability of the student LM.Multiple rounds of learning can assist the teacher LLM in staying updated on the student's learning status, and thus offer more customized training data.Based on the student's learning status, the customized training data and self-made mistakes are adjusted in each round and tailored to the student's specific deficiencies.
The untrained student LM nearly has no reasoning ability, resulting in the noisy generations which are unhelpful as the feedback to the teacher LLM.Consequently, to prepare the data required by the initial round, we directly request the teacher LLM to generate rationales for the entire training set excluding the noisy feedback from the student.In the subsequent rounds, we adhere to the procedures outlined in Sections 3.1 to 3.3: (1) the student LM takes an "exam" to reveal self deficiencies and collect mistakes.(2) The teacher LLM is requested to generate customized training data based on the student's feedback.(3) The student is trained via learning both from self-reflection and teacher's customized feedback.These steps are repeated until the student's performance reaches a plateau.The whole paradigm is summarized in Algorithm 1.

Tasks & Datasets
Mathematical Task We adopt three math word problem datasets to evaluate the mathematical reasoning ability.GSM8k is a primary school level mathematical dataset (Cobbe et al., 2021).MultiArith is a multi-step arithmetic reasoning dataset (Roy and Roth, 2015).SVAMP is created by applying chosen variations over examples sampled from existing datasets (Patel et al., 2021).

Commonsense Task
We use two closed-ended question answering datasets to evaluate the commonsense reasoning ability.CSQA (Talmor et al., 2019) is a multi-choice commonsense question answering dataset.StrategyQA dataset (Geva et al., 2021) which implicitly requires reasoning steps and strategies to answer the yes-no questions.

Models & Baselines
Models Following previous works (Ho et al., 2023;Zelikman et al., 2022;Hu et al., 2023), we mainly utilize a publicly available LM GPT-J (Wang and Komatsuzaki, 2021) as our student LM which has about 6B parameters.Considering the pricing and availability, we select ChatGPT3 , a popular black-box 175B LLM provided by OpenAI, as our teacher LLM.
Baselines To demonstrate the effectiveness of our method, we compare with the following baselines: (1) the teacher LLM and student LM (w/o fine-tuning), for showing the effectiveness of distilling reasoning ability from the LLM.(2) Methods without the help of LLMs, including the student fine-tuned to directly generate answers without rationales, and STaR (Zelikman et al., 2022) which self-iteratively trains the LM to generate rationales and answers with very few annotated data.They are compared to highlight the importance of highquality rationales in teaching smaller LMs.(3) Three concurrent works which all use LLMs to help train smaller LMs to reason, including LM fine-tuned on CoT data (Magister et al., 2023), Specializing smaller LMs for mathematical reasoning (Fu et al., 2023b), and the LLM-adapter (Hu et al., 2023)  tuning on CoT data.(4) Our one-round distillation method, for demonstrating the superiority of the proposed multi-round learning paradigm.

Experimental Setup
The student is fine-tuned with a learning rate of 1e−6 in 10 epochs using AdamW (Loshchilov and Hutter, 2019) in default.Without any heavy tuning, λ in Eq. ( 4) is set to 0.5 to control the impact of self-reflection.The CoT prompt accompanied by a fixed 3-shot demonstration is used for most datasets to balance the efficiency and performance.Some prompts are referred to previous research (Zelikman et al., 2022).And we use greedy decoding to generate the rationale and answer for evaluation.More implementation details are in Appendix A.

Main Results
The evaluation results are presented in Table 1.

Effect of Distillation
From the results of smaller LM with or without distillation, it is evident that the reasoning performance of smaller LM can be significantly improved by distilling the reasoning ability from LLM.Although the student LM falls short in mathematical reasoning, it can achieve comparable performance in commonsense reasoning with the teacher LLM while being 20x smaller in size.
Importance of Rationales CoT can significantly improve reasoning performance which shows the necessity of high-quality rationales in teaching smaller LMs.Though STaR performs well in CSQA which often only involves single-step reasoning, the self-generated rationales encounter dif-ficulties when applied to other multi-step reasoning tasks.Conversely, nearly all distillation methods can beat STaR in mathematical reasoning, which indicates that LLM's generated rationales can often better guide the smaller LM to reason.
Comparison with Concurrent Works Compared to concurrent distillation works (Hu et al., 2023;Fu et al., 2023b;Magister et al., 2023), our method consistently achieves better performance across all datasets, which demonstrates the success of customized feedback from the black-box LLM.For GSM8K, in contrast to training an 11B model with 130k rationales used by Specializing, our method can yield better performance with a 6B model and only 54k rationales, significantly reducing the cost of model training and data collection.
Effect of Multi-round & Self-reflection Compared with our one-round distillation approach, multi-round learning leads to an average improvement of 5.1 in accuracy, indicating the success of building the interaction between teacher and student.Besides, the self-reflection learning can further exploit the reasoning potential of the student LM.Another advantage is that the self-reflection can be seamlessly integrated into multi-round learning, pushing the boundaries of reasoning.

Contribution of Student's Feedback
To validate the contribution of student's feedback to LLM, an ablation study is conducted by removing this feedback of the requesting prompt template  Student's Feedback Haiku is a form of poetry that has three lines and a 5-7-5 syllable count.A tweet is 140 characters long.So, the answer is No.
Teacher's Response A haiku can be written in a single tweet if it follows the 5-7-5 syllable pattern within 140 characters.For example, "Autumn moonlight.A worm digs silently into the chestnut.Dewdrops on a leaf."The poem is considered a Haiku even if it is shorted, so it's possible to fit a whole Haiku in a single tweet.
So, the answer is Yes.
Teacher's Response w/o Feedback A Haiku is a traditional form of poetry that is comprised of three lines.The first line is five syllables, the second line is seven syllables and the last line is five syllables.So, one cannot write a whole Haiku in a single tweet as it exceeds the character limit.So, the answer is No.  (Fig. 3).Results in Table 3 show that student feedback to LLM can first help the teacher LLM to generate more accurate and tailored rationales (larger # Success), which is then beneficial to the student's learning (higher Accuracy).Note that cooperating with our multi-round learning paradigm, the cumulative gains of student's feedback can be substantial.Further, we take a case study of the teacher LLM's generated rationales in Table 2 which shows that the LLM can often response improved rationales when the student's feedback is taken into account.
For StrategyQA, the teacher LLM even gives a counterexample to the student's wrong answer, indicating the LLM can provide customized training data based on the student's feedback.

Effect of Self-Reflection
First, to intuitive understand the effect of selfreflection learning, Fig. 4 visualizes the latent space representations of generated rationales.It shows that the self-reflection could effectively cluster correct rationales and wrong ones respectively, helping the model to distinguish each other.Moreover, we compare the distance and preference differences in Table 4 which indicates that the self-reflection contributes to aligning the preference of the student LM with correct reasoning paths, while away from self-made wrong ones.Fig. 5 illustrates the effect of the self-reflection learning on the reasoning performance.The observation is consistent with findings in Table 1 that self-reflection learning can help improve the reasoning ability when λ < 0.5.However, excessive emphasis on self-reflection learning (i.e., a larger value of λ) typically leads to poorer performance and instability, especially for the MultiArith dataset.We conjecture that it has a negative impact on the learning of teacher's training data.
To verify the above hypothesis, we plot the loss curve in Fig. 6.It shows that the excessive emphasis on self-reflection learning (higher λ) can result in underfitting of the these training data within a limited number of training steps.Consequently, the reasoning performance of the student LM could be significantly decreased due to not fully converged.In general, a small value of λ is preferred to achieve a balanced learning approach that incorporates both the teacher's rationales and self-made mistakes.3) in the initial round of the student LM with different weight λ on the Multi-Arith dataset.We also observe that the loss of Eq. ( 2) with different λ can all converge.

Analysis of Multi-round Learning
We examine each learning round of the student LM, as detailed in Table 5.The error rate and accuracy are typically gradually decreased and increased with the learning rounds, respectively.This is because of each round of learning aims to enhance the student LM in solving the questions that were not learned well in previous round.Additionally, inspired by recent research on employing the LLM as the evaluator (Chiang and Lee, 2023;Fu et al., 2023a;Liu et al., 2023), we instruct GPT-4 (Ope-nAI, 2023) to automatically evaluate the quality of generated rationales.From the results in Table 6, we find that there is an enhancement in the quality of both generated correct rationales and wrong ones as the learning rounds progress.However, the gains in reasoning performance reach a plateau after several rounds of training.This can be attributed as follows: (1) For GSM8K, the most challenging task, the student is reaching its capacity after 3 rounds of learning, still not performing well (49.2ER). ( 2) For SVAMP and CSQA, relatively easy tasks, the student achieves a good performance on the training set after the 2 nd round, leading to a small ER.Consequently, the prepared data for the next round will be relatively scarce, which is unlikely to further help improve the student.
We conduct the 4 th round learning on GSM8K for justifying the above analysis, where the ER remains unsatisfactory (51.8 ER) despite a marginal improvement (+1.4 ∆) in accuracy.Besides, the results of the 3 rd round on SVAMP and CSQA datasets show that there are no more gains after the 2 nd round.Thus, we suggest to take early stopping  in the multi-round learning if the student can nearly reach its plateau.By prior estimation of the task difficulty and observing performance gains in each round, we can avoid excessive parameter tuning on the number of learning rounds and balance the reasoning performance and training costs.

Feasibility Study
To further benefit the community concerning about individual affordable computation resources, we conduct a feasibility study by using different LMs spanning from 760M to 2.7B parameters.The tested models include two common LM architectures, i.e., encoder-decoder and decoder-only.The results shown in Table 7 first suggest that the reasoning abilities of these small LMs can all be en- Table 7: Results of our method with various LM sizes."760M", "770M", "1.3B" and "2.7B" refer to T5-Large (Raffel et al., 2020), GPT-2 Large (Radford et al., 2019), OPT-IML (Iyer et al., 2023) and GPT-Neo (Gao et al., 2020;Black et al., 2021), respectively.
The indentation means the modifications are based on the up-level indentation.
hanced with the proposed self-reflection learning.
With self-reflection, student LMs often achieve satisfying performance with just one round of learning for commonsense tasks.Moreover, we find that our multi-round learning can generally further improve the performance in mathematical reasoning.However, there are no more gains for StrategyQA, as it heavily relies on the memorization of commonsense knowledge mostly acquired from the pretraining stage, rather than on complex reasoning.Another evidence is that increasing the model size seems not to have contribution to the performance on this dataset.Besides, the relatively limited capacity of these smaller LMs may also restrict the gains from additional rounds of learning.

Conclusion
In this paper, we propose a tailored learning approach to cultivate the reasoning ability of the smaller LM, aiming to democratize the emergent reasoning ability of the LLM.First, we propose a multi-round interactive learning paradigm that enables the teacher LLM to provide customized training data according to the student's feedback.Next, we propose the self-reflection learning to motivate the student to distinguish correct rationales from wrong ones.Further, the integration of learning from LLM's customized feedback and self-reflection can complement each other within the proposed multi-round learning paradigm.The empirical results from mathematical and commonsense reasoning tasks demonstrate the success of unleashing the reasoning potential of smaller LMs.We believe that these findings can benefit the opensource and NLP communities in the era of LLM.

Limitations
In this section, we discuss the limitations of our method with integrity while offering potentially useful advice for future research.
1) Our experiments primarily utilize ChatGPT and GPT-J (Wang and Komatsuzaki, 2021) as the teacher LLM and student LM, respectively, due to the considerations of availability and costs.
Although fine-tuning GPT-J on the outputs of ChatGPT boosts their reasoning performance, a substantial gap still remains between them.It is valuable to validate our findings using more powerful LMs (e.g., LLaMA (Touvron et al., 2023a,b)).And training better foundation LMs should be the primary task for the open-source community, since imitating proprietary LLMs may be a false promise (Gudibande et al., 2023).
2) We have demonstrated the importance of student's feedback in distilling the knowledge from the black-box LLM, but without extensive engineering the feedback prompt templates (e.g., explicitly instructing the LLM to act as a teacher).And the interactions (e.g., use reinforcement learning to connect LLM and smaller LM) can be explored in the future.
3) Our self-reflection learning currently is defined in a straightforward triplet-loss form.However, the core of self-reflection is learning from mistakes.Thus, the training objectives or forms can be defined in various ways, such as ranking loss or verbal critic are expected to further help the smaller LMs to reflect and learn from mistakes.
4) Evaluating the correctness of generated rationale is mainly based on the final answer.Though most existing works (Zelikman et al., 2022;Ho et al., 2023;Fu et al., 2023b;Shridhar et al., 2023) in this field adopt this simple criterion, we call attention to develop more trustworthy criteria to evaluate the quality of rationales.Potential methods can be using GPT-4 (OpenAI, 2023) or a process reward model (Lightman et al., 2023) for automatic evaluation.

Ethics Statement
Risk in using closed-source LLMs Though the datasets used for evaluation is publicly available, the annotated rationales in this paper are collected from close-source ChatGPT provided by OpenAI.
Open-source LLMs (e.g., LLaMA) have boomed in recent months, it is noteworthy that many of them use the outputs from closed-source LLMs (e.g., Alpaca and Vicuna are trained on ChatGPT's outputs) for further improvements.According to the Sec. 2 "Usage Requirements", within OpenAI's terms of use4 , there exists a prohibition against "use output from the Services to develop models that compete with OpenAI".However, beyond its terms of use, the crucial matter lies in determining "ownership of the copyright pertaining to the outputs of generative AI".As of today, there remains an ambiguity regarding the copyright status of generative AI outputs, both in scholarly circles and legal contexts.Compelling evidence indicates that these closed-source LLMs undergo training using numerous copyrighted materials, such as books, academic publishings, etc.Thus, we think at least the authors of the training data that directly supports LLM's outputs hold the copyright, as opposed to the LLM service provider.The prompt creators may also hold the copyright if their prompts substantially influence LLM's outputs.For open-source and research communities, we call for a responsible discussion about data collection.
Social Impact This paper explores how to utilize the LLM as a teacher to enhance the reasoning performance of smaller LMs, which can help democratize these emergent abilities for the benefit of broader communities (e.g., math education).Furthermore, we firmly believe that the utilization of LLMs can be a significant area of interest in natural language processing applications and research.

A Implementation Details
The codes will be made publicly available after anonymous reviewing period.

A.1 Data Preparation
The dataset statistics are shown in Table 8.Following Ho et al. (2023), the data of SVAMP (Patel et al., 2021), MultiArith (Roy and Roth, 2015) and StrategyQA (Geva et al., 2021) is split with a ratio of 70 : 30 for the training and evaluation, while GSM8K (Cobbe et al., 2021) and CSQA (Talmor et al., 2019) datasets follow the original split.In mistakes collection, we use sampling decoding to prompt student LM to generate 4 rationales for each sample, and only the wrong ones are collected.In rationales collection, the teacher LLM is requested to generate 4 diverse rationales for each question, and only the correct ones are collected.An example of Fig. 3 for using student's feedback to request the LLM is shown in Table 12.The decoding generation configs are listed in Table 9.

A.2 Training & Evaluation
Hyperparameter Experiments are performed with the help of Transformers5 (Wolf et al., 2020) and Deepspeed6 (Rajbhandari et al., 2020)   In addition, from pilot experiments, we empirically find that assigning less weights (0.1) to the fixed demonstration examples than the input sample helps the model focus on the input sample and yield better performance, which can be investigated in the future.
Evaluation We use a simple-yet-effective CoT prompt template as follows: Question: x \n Reasoning: r \n Answer: y (5) where \n is the line break symbol, x is the question, r and y are expected reasoning steps and answer, respectively.The greedy decoding is adopted for the generation of the student LM though beam search may further improve the performance.The answer extraction of evaluation is simply using the first valid token after the "Answer:", which can avoid complex post-processing.

B Generalization Results
Generalization experiments are conducted to evaluate the generalization of the student LM, as shown in Table 11.The results reveal the following insights: (1) the in-domain generalization performance is enhanced after the reasoning distillation, while the out-of-domain (OOD) performance is usually slightly decreased.This finding is consistent with Fu et al. (2023b) although our method is better than theirs in terms of OOD performance.
(2) The in-domain performance can be further improved by employing our multi-round learning paradigm.And we surprisingly find that, for some cases, the OOD performance can also be improved via multi-round learning.This can be attributed to that the customized training data of the following rounds may assists the model in generalizing its reasoning abilities to other domains.(3) The student LM trained on the GSM8K dataset exhibits the most significant improvements in in-domain reasoning performance.Note that the GSM8K dataset is the most challenging one among these mathematical datasets.Consequently, it is reasonable to expect gains on the other datasets if the student can already tackle the difficult problems.

C Case Study
Contribution of Student's Feedback Additional examples of the LLM's generated rationales are presented in Table 13.We observe that the teacher LLM, ChatGPT, is capable of generating more detailed and precise reasoning steps when provided with student's feedback (i.e., wrong solution).These detailed reasoning steps can help address the student's deficiencies and thereby improve the reasoning performance in the subsequent round of learning.Although both rationales, with and without feedback, are correct, their quality can vary.More precise and customized rationales can help the student better understand its own mistakes, especially coupled with our self-reflection learning, which is beneficial for student's reasoning learning.
Multi-round Learning To better understand the impact of each learning round, we conduct a case study in Table 14.First, it is clear that the student LM initialized with pre-trained weights (i.e., the 0 th round) is powerless to generate meaningful answers for the mathematical reasoning task, which may confuse the teacher LLM.Thus, we tend not to utilize these noisy feedback for preparing the train-ing data of the initial round.Second, the LLM's generated response is often tailored to student's current deficiencies, thus effectively improving student's reasoning performance in the next round of learning.Third, a single round of distillation may not enable the student to solve challenging questions.However, with the help of our multi-round learning paradigm, the student can have the opportunity to tackle such challenging questions.

GSM8K Example#1
Question: The number of short students in a class is 2/5 of the total number of students.If there are 90 tall students, and the class has 400 students, calculate the total number of students with average height.Ground Truth: 150

Round
Student's Generation Correctness: The number of short students is 2/5 of the total number of students.The number of tall students is 2/5 of the total number of students.The number of students with average height is 2/5 of the total number of students.
Teacher's Feedback The number of short students in the class is 2/5 of the total number of students, which means there are (2/5) * 400 = 160 short students.If there are 90 tall students, then the number of students with average height can be calculated by subtracting the number of short and tall students from the total number of students.The total number of students with average height is 400 -160 -90 = 150 students.

Student's Generation Correctness:
1 st There are 90 tall students in the class.There are 2/5 of 400 students, which is 2/5 * 90 = 36 short students.So the total number of students in the class is 90 + 36 = 126.
Teacher's Feedback 2 nd There are 90 tall students in the class.Short students are 2/5 of the total number of students.The total number of students in the class is 400.Short students: 2/5 * 400 = 160.We know the total number of students in the class is 400, so 90 + 160 = 250 is the number of tall and short students.Therefore, the total number of students with average height is 400 -250 = 150.
Student's Generation Correctness: The number of short students in a class is 2/5 of the total number of students.So, 2/5 * 400 = 160.The number of tall students in a class is 90.So, the total number of students with average height is 400 -160 -90 = 170.
Teacher's Feedback 0 th The answer should be a place where students would keep rubber bands.Bedside tables are often where students keep their rubber bands.Therefore, the answer is bedside table (d).
Teacher's Feedback 1 st The answer should be a place where students might keep rubber bands.Students might keep rubber bands in their desks in the classroom.Therefore, the answer is classroom (e).
Student's Generation Correctness: 1 st The answer should be a place where students would keep rubber bands.Rubber bands are often used for school projects, so students would keep them in their classroom.Therefore, the answer is classroom (e).
Table 14: Case study of the student LM's generation and teacher LLM's feedback by multi-round learning.The teacher can generate more tailored training data thanks to being aware of the student's deficiencies.And the student can gradually improve its reasoning ability with the help of these customized feedback.

Figure 1 :
Figure 1: Tailored learning from LLM.In contrast to previous works merely adopt a), we propose b) and c) to further improve the reasoning distillation.

Figure 2 :
Figure 2: Overview of the proposed multi-round learning paradigm.(1) The student LM first undertakes an "exam" to gather mistakes (i.e., wrong rationales) made by itself.(2) These mistakes are subsequently utilized as the student's feedback to the teacher LLM, which in turn can generate training data (i.e., correct rationales) as the teacher's customized feedback to the student.(3) Finally, the student learns to improve reasoning via self-reflection on self-made mistakes, and assimilation of the customized training data from the teacher LLM.The trained student LM will initiate the next round of learning by repeating the three steps until the performance plateau is reached.

Figure 3 :
Figure 3: The prompt template T for asking the teacher LLM to generate customized rationales.The part colored in golden is the integrated student feedback.
have shown that training with demonstration examples can improve the in-context learning ability of LMs.The training objective is as follows:

Figure 4 :Figure 5 :
Figure 4: The t-SNE visualization (van der Maaten and Hinton, 2008) of latent space representations of rationales generated on the GSM8K dataset.

Figure 6 :
Figure 6: The training loss of Eq. (3) in the initial round of the student LM with different weight λ on the Multi-Arith dataset.We also observe that the loss of Eq. (2) with different λ can all converge.
of students in the class is 400.The number of short students is 2/5 of the total number of students, 2/5 * 400 = 160.The number of tall students is 90.The total number of short and tall students is 160 + 90 = 250.So the total number of students with average height is 400 -250 = 150.Student's Generation Correctness: 4 thThe number of short students is 2/5 of the total number of students, which is 2/5 * 400 = 160.The total number of students with short height is 160 + 90 = 250.The total number of students with average height is 400 -250 = 150.CSQA Example#1 Question: Where do students keep rubber bands?(a) trees (b) drug store (c) trash (d) beside table (e) classroom Ground Truth: (e) Round Student's Generation Correctness: ., † Work done during internship at Microsoft.

Table 1 :
(Raffel et al., 2020)rs for efficiently Accuracy (%) on various reasoning tasks with different methods."LLM-Adapter"refers to results of GPT-J using LoRA adapter(Hu et al., 2022)."Specializing"refers to results of FlanT5-XXL(Chung et al., 2022) which has about 11B parameters."CoTFine-tuned"refers to results of T5-11B(Raffel et al., 2020)fine-tuned on CoT data from GPT-3 175B(Brown et al., 2020).* denotes the results are from the original paper.Indentation means the modifications are based on the up-level indentation.The best performance among small LMs are marked in bold.
cars at first.Grandpa gave Olaf twice as many toy cars as the uncle.Dad gave Olaf 10 toy cars, 5 less than Mum.Auntie gave Olaf 6 toy cars, 1 more than the uncle.How many toy cars does Olaf have in total, finally?Student's Feedback Olaf has 150 toy cars.His family gives him twice as many toy cars as the uncle.The uncle gives Olaf 10 toy cars.Dad gives Olaf 10 toy cars.Auntie gives Olaf 6 toy cars.So, Olaf has 150 + 10 + 10 + 6 = 236 toy cars.

Table 2 :
Case study of the impact of student's feedback on the teacher LLM's response.The incorrect reasoning steps and answers are colored in red.More cases are listed in Appendix C.

Table 3
"Distance" measures the Euclidean distance between correct and wrong reasoning paths in latent space."Preference" is the likelihood ratio of correct reasoning paths to wrong ones.Both are higher is better.

Table 5 :
Observation of the student LM in each round of learning."Initial" refers to model w/o distillation."#Data" represents the size of training samples."ER" refers to the error rate on train set."Acc" denotes the initial accuracy of the student LM, and "∆" indicates its performance change after each round.

Table 6 :
Results of GPT-4 score for student LM's generated rationales in each round of learning.The score is given based on accuracy and quality of the reasoning path."Correct" and "Wrong" stand for the rationales with correct answers and wrong answers, respectively.

Table 9 :
Generation configs for collecting student's selfmade mistakes and rationales from teacher LLM.