Large Language Models Are Reasoning Teachers

Recent works have shown that chain-of-thought (CoT) prompting can elicit language models to solve complex reasoning tasks, step-by-step. However, prompt-based CoT methods are dependent on very large models such as GPT-3 175B which are prohibitive to deploy at scale. In this paper, we use these large models as reasoning teachers to enable complex reasoning in smaller models and reduce model size requirements by several orders of magnitude. We propose Fine-tune-CoT, a method that generates reasoning samples from very large teacher models to fine-tune smaller models. We evaluate our method on a wide range of public models and complex tasks. We find that Fine-tune-CoT enables substantial reasoning capability in small models, far outperforming prompt-based baselines and even the teacher model in many tasks. Additionally, we extend our method by leveraging the teacher model’s ability to generate multiple distinct rationales for each original sample. Enriching the fine-tuning data with such diverse reasoning results in a substantial performance boost across datasets, even for very small models. We conduct ablations and sample studies to understand the emergence of reasoning capabilities of student models. Our code implementation and data are available at https://github.com/itsnamgyu/reasoning-teacher.


Introduction
Language models (LMs) have demonstrated remarkable performance in a wide range of downstream tasks, mainly attributed to their scalability enabled by the Transformer architecture (Vaswani et al., 2017) and availability of web-scale training data.Previous works on language models have followed the paradigm of pre-training on a large corpus and then fine-tuning on downstream tasks 1 Our code implementation is publicly available at https://github.com/itsnamgyu/reasoning-teacher.

Small Student
Questions Zero-shot-CoT

Reasoning samples
Fine-tuning Prompt Figure 1: Fine-tune-CoT.We consider a method consisting of multiple stages.First, a large teacher model is prompted to answer questions using multi-step reasoning, without relying on correct examples.That is, the teacher employs zero-shot chain-of-thought reasoning to generate output.We then use the resulting reasoning samples (consisting of the question and teacher output) for fine-tuning a much smaller student model.(Raffel et al., 2020;Devlin et al., 2018).Recently, large language models (LLMs) have demonstrated in-context generalization capabilities: performing downstream tasks simply by conditioning on few in-context exemplars or plain natural language task descriptions (Brown et al., 2020;Sun et al., 2021).
LMs have also demonstrated the ability to solve complex tasks, when prompted to generate intermediate rationales.Standard prompting methods, which use few-shot exemplars of question-answer pairs or zero-shot instructions, have been shown to be insufficient for downstream tasks which require multiple reasoning steps (Chowdhery et al., 2022).However, recent works have demonstrated the possibility of eliciting complex reasoning abili-

Original Sample
Question A pet store had 56 puppies.In one day they sold 24 of them and put the rest into cages with 4 in each cage.How many cages did they use?Answer .8.
Prompt (Zero-shot-CoT) Q: A pet store had 56 puppies.In one day they sold 24 of them and put the rest into cages with 4 in each cage.How many cages did they use?A: Let's think step by step.

Completion (Generated)
The store started with 56 puppies.24 of them were sold, so that means that there are now 32 puppies left.Since there are 4 puppies in each cage, that means that the store now has .8cages.
Step 1. Reasoning Generation Large GPT-3 175B Teacher LM* Step 2. Curation Small Student LM Prompt A pet store had 56 puppies.In one day they sold 24 of them and put the rest into cages with 4 in each cage.How many cages did they use?###

Completion
The store started with 56 puppies.24 of them were sold, so that means that there are now 32 puppies left.Since there are 4 puppies in each cage, that means that the store now has 8 cages.Step 1: a very large teacher model is prompted to solve complex questions (yellow) by generating multi-step reasoning explanations (green).
Step 2: generated completions are filtered based on the correctness of the final prediction (red).The question, rationale, and answer are used to compose a reasoning sample comprised of the prompt and a multi-step completion.
Step 3: the curated reasoning samples are used to fine-tune a small, lightweight student to exhibit reasoning capabilities.The application of an LM-based teacher enables diverse reasoning-generating multiple distinct rationales for each original sample to enrich the fine-tuning data.This boosts the performance of student models without requiring human annotation.
A major drawback of prompt-based CoT reasoning methods is their reliance on extremely large LMs, spanning hundreds of billions of parameters (Wei et al., 2022b;Kojima et al., 2022).These models are prohibitive to deploy at scale due to overwhelming computational requirements and inference costs (Wei et al., 2022b).Therefore, we strive to enable complex reasoning in small models for use in real-world applications.
In this light, we propose an approach named Fine-tune-CoT, which aims to utilize the CoT reasoning capabilities of very large LMs to teach small models how to solve complex tasks.To elaborate, we apply existing zero-shot CoT prompting (Kojima et al., 2022) to generate rationales from very large teacher models, and use them to fine-tune smaller student models2 , as shown in Figure 2. We note that similar to standard prompting, it has been shown that vanilla fine-tuning is often inadequate for training LMs to solve complex reasoning tasks.While there have been attempts to fine-tune small models with explicit reasoning steps to tackle this issue, they require arduous reasoning annotation and often also task-specific training setups (Nye et al., 2021;Cobbe et al., 2021).Our approach, on the other hand, can be readily applied to novel downstream tasks, owing to the remarkable zero-shot reasoning ability of LM-based teachers (Kojima et al., 2022), without hand-crafted reasoning annotations or task-specific engineering.In essence, our method preserves the versatility of prompt-based CoT without demanding excessively large models.
We propose an extension to our method, referred to as diverse reasoning, which maximizes the teaching effects of Fine-tune-CoT by generating multiple reasoning solutions for each training sample.This can be achieved simply through repeated stochastic sampling.Diverse reasoning is motivated by the intuition that multiple reasoning paths can be used to solve complex, type-2 tasks (Evans, 2010).We posit that such diversity in reasoning paths as well as linguistic templates can substantially aid in fine-tuning for complex reasoning3 .
We perform empirical evaluations of Fine-tune-CoT and diverse reasoning on various tasks and model sizes using publicly available GPT-3 models.Our fine-tuning approach elicits notable reasoning performance in small models on complex tasks, whereas previous prompt-based methods achieve near-random performance.We show that small models under Fine-tune-CoT even outperform their very large teachers in some tasks.With diverse reasoning, we find that the performance of Finetune-CoT is highly scalable, and leads to high sample efficiency and notable reasoning performance even with few-shot training examples.We conduct thorough samples studies and ablations of Finetune-CoT and its performance on a multitude of datasets, while demonstrating its value on much smaller models.In doing so, we shed light on important nuances of fine-tuning on CoT reasoning that have not been considered in previous works.

Related Work
Downstream transfer in language models Much previous work established a "pre-train and fine-tune" paradigm for enhancing large language models' performance on downstream tasks (Radford et al., 2018;Raffel et al., 2020;Dong et al., 2019;Vaswani et al., 2017;Devlin et al., 2018).However, given that fine-tuning requires a very large dataset of task-specific labeled examples, and often does not generalize well to out-of-distribution settings, it is not always easily applicable (Liu et al., 2021;Hendrycks et al., 2020).
More recent literature exhibits a paradigm shift towards "prompting" the model to predict the desired output (Liu et al., 2021;Raffel et al., 2020).Large LMs can exhibit strong performance in this setting (Brown et al., 2020).For smaller models to be able to perform similarly, additional engineering is usually required (Gao et al., 2021;Schick and Schütze, 2021b;Schick et al., 2020).In more complex tasks, the performance of very large LMs under prompting can be boosted with chain-ofthought (CoT) prompting (Wei et al., 2022b).This approach is preceded by the idea of using samples with explicit reasoning steps for fine-tuning a model, which however usually requires human reasoning annotation and task-specific training setups (Nye et al., 2021;Cobbe et al., 2021).

Chain-of-thought
In few-shot CoT prompting, the model is fed examples of step-by-step reasoning in natural language.It can then generate intermediate reasoning steps leading to a problem solution.This improves performance on a wide range of tasks (Wang et al., 2022b).Additionally, LLMs can perform well in an unsupervised task-agnostic setting, using zero-shot CoT (Kojima et al., 2022).This requires no fine-tuning or task specific conditioning, and substantially outperforms standard zero-shot learning and sometimes even few-shot learning on a wide number of tasks.Yet, prior work has shown that CoT requires extremely large models for optimal performance (Hoffmann et al., 2022;Chowdhery et al., 2022).In our work, we contrast this by showing how to utilize CoT reasoning methods for smaller models by fine-tuning them on rationales generated by a very large model.Using various LLMgenerated explanations for fine-tuning smaller models has been successfully used in prior work (Li et al., 2022a).Also, a similar approach to ours is mentioned in (Huang et al., 2022); however we note that the focus of this concurrent work lies on using few-shot CoT to self-generate fine-tuning examples by and for very large proprietary models.The authors provide a brief glimpse into using zero-shot CoT to generate reasoning examples for fine-tuning smaller distilled models, but the results are limited to one dataset and very large models that are also inaccessible to the general community.In contrast, we provide a rich set of results and qualitative/quantitative analysis on a wide range of datasets, using open-source models that are small and accessible to everyone.
Knowledge distillation Typically, knowledge distillation (KD) refers to training small models derived from large models in order to reduce model size and latency, while still preserving accuracy and capacity to generalize (Hinton et al., 2015;Sanh et al., 2019).Essentially, KD is a form of model compression, making efficient deployment to capacity-limited devices possible (Bucilua et al., 2006).We note that our work could also be considered a distant variant of KD (see Gou et al. (2021) for a survey), similar to works on improving prompt-based methods such as Yoo et al. (2021); Schick and Schütze (2021b,a); Zelikman et al. (2022).It most closely resembles a form of data-free distillation (Micaelli and Storkey, 2019;Nayak et al., 2019;Shen et al., 2021), where the otherwise inaccessible transfer data is synthetically generated from a large teacher model.Similarly, sequence-level distillation, i.e. training a smaller student model on the output of beam search of a larger teacher, can make neural machine translation performance more efficient without significant loss (Kim and Rush, 2016).Related KD approaches have also been used to improve the performance of non-autoregressive translation by training on output generated by an autoregressive translation model (Zhou et al., 2020).Despite being similar in spirit, our method still distinguishes itself from such previous work.The role of the teacher model in our method is to teach the notion of intermediate reasoning.It is not the specific output that is the main supervising signal for reasoning, but rather the generation's structure.

Chain of Thought Fine-Tuning
We propose Fine-tune-CoT, a task-agnostic approach to enable chain-of-thought reasoning in small language models.The core idea is to generate reasoning samples from very large teacher models using prompt-based CoT methods and subsequently fine-tune small student models using the generated samples.This approach preserves the advantages of task-agnostic prompt-based CoT methods while overcoming their reliance on prohibitively large models.To maximize versatility, we use the most recent Zero-shot-CoT prompting method (Kojima et al., 2022) on teacher models, as it does not require any hand-annotated reasoning explanations.We note that our approach is not limited to this way of prompting the teacher model.In the following, we characterize Fine-tune-CoT in three distinct steps, as also shown in Figure 2.
Step 1. Reasoning generation First, we utilize a large teacher model to generate CoT reasoning explanations for a given task.Consider a standard sample S i consisting of a question q i and its true answer a i .Using Zero-shot-CoT4 , we prompt the teacher model to generate a reasoning explanation, or rationale, ri to solve question q i and make a final answer prediction âi .The resulting text sequence, including the prompt and generations, takes the following form: "Q: <q i >.A: Let's think step by step.<r i > Therefore, the answer is <â i >".
Step 2. Curation To prepare fine-tuning samples, we filter the generated samples and reformat them into prompt-completion pairs.For filtering, we simply compare the final prediction of the teacher model âi with the ground-truth answer a i , following previous works (Zelikman et al., 2022;Huang et al., 2022).For all such instances i where âi = a i , we repackage (S i , ri , âi ) into a reasoning sample S i = (p i , c i ), a prompt-completion pair.Since our method is aimed at training efficient taskspecific models, we use a special-character based text format to minimize token usage.Specifically, p i and c i each take the form of "<q i > ###" and "<r i > --> <a i > END", respectively.We note that filtering based on answer predictions does not ensure the correctness of the rationales, especially for multi-choice questions where random-guessing is likely.However, this has not been addressed in previous works.We provide an analysis on rationale filtering in Subsection 4.2, where we discuss a trade-off between sample quantity and quality.
Step 3. Fine-tune Finally, we fine-tune a small pre-trained student model on the assembled reasoning samples using the widely accessible OpenAI API.We use the same training objective of that used during pre-training, i.e., autoregressive language modeling objective, or next-token prediction (Radford et al., 2018).

Diverse reasoning
To maximize the sample efficiency of Fine-tune-CoT, we can generate multiple reasoning explanations for each training sample, thereby augmenting the fine-tuning data.We refer to this as diverse reasoning.In detail, for a given sample S i , instead of applying Zero-shot-CoT using greedy decoding to obtain a single explanationanswer pair (ê i , âi ), we use a stochastic sampling strategy, i.e., temperature sampling with large T , to obtain D distinct generations {(r ij , âij )} D j .Subsequent reasoning sample curation and fine-tuning then proceeds as before.We refer to D as the degree of reasoning diversity.Diverse reasoning is motivated by the intuition that multiple reasoning paths can be used to solve complex tasks, i.e., type-2 tasks (Evans, 2010).In sample studies, we confirm that diverse reasoning samples contain various reasoning paths as well as linguistic templates, which can also be observed in the fine-tuned students.This is similar to Wang et al. (2022b); Zelikman et al. (2022); Huang et al. (2022), where diverse reasoning paths are generated and marginal-ized to find the optimal answer.Diverse reasoning also draws parallels with Yoo et al. ( 2021) which utilizes the generative power of LLMs to augment training data with synthesized samples.
The student models are 25-500x smaller than the very large teacher model, thus considerably more feasible for real-world deployment.Table 1 provides a summary of the models used in our study.
Baseline methods We provide a comparison of our method, Fine-tune-CoT, with three baseline methods: Zero-shot, Fine-tune, and Zero-shot-CoT (Kojima et al., 2022).Fine-tune refers to finetuning with original training samples {(q i , a i )} i , specifically using prompts and completions in the form of "<q i > ###" and "<a i > END".An LM's fine-tune performance represents its ability to learn to solve complex tasks using only questionanswer pairs, without explicit supervision on reasoning.Zero-shot-CoT performance represents task-agnostic prompt-based CoT performance.The taxonomy of methods used in our study is outlined in Table 2 for clarity.

Results
In this section, we present results on the reasoning performance of small models using Fine-tune-CoT, across various model scales.We also demonstrate how the accuracy of our method scales with the number of diverse reasoning samples.
In a sample study, we further identify the overall strengths and weaknesses of our method, demonstrating failure patterns and factors for success.For more details on this line of analysis, we refer to Appendix C.
Fine-tuning elicits complex reasoning in small models Table 3 summarizes the accuracy of student models using the proposed Fine-tune-CoT, compared to the existing task-agnostic prompting baseline, Zero-shot-CoT (Kojima et al., 2022), as well as standard zero-shot prompting and finetuning using standard samples without any reasoning.While Zero-shot-CoT exhibits remarkable performance on the very large 175B model (Kojima et al., 2022), it fails to enable complex reasoning in smaller models, including the 6.7B model, showing near-negligible performance across all tasks.On the other hand, Fine-tune-CoT elicits notable  reasoning performance on the same tasks, demonstrating significant gains over Zero-shot-CoT using smaller models.For complex arithmetic, Fine-tune-CoT achieves a notable 33% accuracy on Multi-Arith while Zero-shot-CoT only reaches 5%.For two commonsense reasoning tasks, Fine-tune-CoT is shown to outperform the near-random performance of Zero-shot-CoT by 37%p and 5%p, respectively.Fine-tune-CoT performance is most notable in relatively simple tasks including other reasoning tasks (Date Understanding, Tracking Shuffled Objects) and symbolic reasoning (Last Letter Concatenation, Coin Flip), while Zero-shot CoT fails to overcome random-guess performance.
Small models can outperform very large teachers in reasoning Table 3 also shows that Finetune-CoT is highly effective on small models.In two tasks with less complexity, Shuffled Objects and Coin Flip, Fine-tune-CoT is shown to outperform the 175B teacher model using 1.3B and 6.7B parameters, respectively, i.e., reducing the number of required parameters by approx.25-100x.We also find that Fine-tune-CoT with the very small 0.3B model consistently outperforms the 6.7B model under Zero-shot-CoT.
Fine-tune vs Fine-tune-CoT Similarly, we find that Fine-tune-CoT outperforms vanilla fine-tuning across a wide range of various tasks, as shown in Table 3.This is most pronounced in Date Understanding and Shuffle Objects, for which fine-tuning with standard question-answer pairs results in random-  ing Shuffled Objects, where Fine-tune-CoT accuracy surpasses that of vanilla Fine-tune by nearly twofold.Moreover, Fine-tune-CoT has the unique advantage of being able to benefit from multiple teacher-generated reasoning paths for a given question, as we will discuss in the following paragraph.
Diverse reasoning substantially improves Finetune-CoT performance To examine the learning effects of diverse reasoning, we apply Finetune-CoT using 1-64 reasoning explanations per sample across three model scales on SVAMP.Table 4 shows that diverse reasoning can significantly improve the performance of student models using Fine-tune-CoT.With 64 reasoning explanations per sample, the 0.3B model shows a near five-fold improvement, surpassing the baseline Fine-tune-CoT performance of the 6.7B model.Moreover, we find that diverse reasoning can boost the performance of Fine-tune-CoT to surpass that of vanilla fine-tuning, across all model sizes.
Fine-tune-CoT with diverse reasoning is highly sample efficient We consider applying diverse reasoning in the few-shot data regime to maximize the utility of data samples for Fine-tune-CoT.Using as few as 32 samples, we see that the performance of the 6.7B model scales with the reasoning diversity, improving the efficacy of Fine-tune-CoT.We do however observe limited performance using very few, i.e., 8 samples even when using diverse reasoning.

Analysis
In this section, we bring to attention several critical nuances in fine-tuning for CoT which have not been addressed in previous or concurrent work (Zelikman et al., 2022;Li et al., 2022a;Huang et al., 2022).We provide a set of analyses to shed light on these issues.
Templated datasets Upon inspection, we found that many datasets contain groups of samples which share common templates.This brings into question the validity of naive samplewise data split, as it has the potential to leak the same templates into the train and test sets.To investigate whether the student models are truly learning to reason rather than matching simple patterns, we manually group samples by template and evaluate Fine-tune-CoT using a template-wise data split.Table 5 shows the performance of Fine-tune-CoT when using sample-wise vs template-wise split, using the same train-test ratio of 70:30.While student performance is typically lower with a template-wise split, it still significantly outperforms random guess performance, as well as zero-shot and vanilla fine-tuning baselines shown in Table 3.This reaffirms that Fine-tune-CoT is able to elicit complex reasoning capabilities in small language models.
Rationale filtering It is possible for the teacher model to answer correctly despite incorrect reasoning, especially in multi-choice questions where the random-guess probability is significant.To investigate the potential impact of a better filtering scheme (as opposed to our baseline answer-based filtering) we manually annotate the correctness of rationales from the teacher model and evaluate student performance when fine-tuning on correctly reasoned samples.based on answer predictions vs golden samples, hand-picked based on correctness of rationales.We find that 28% of correct samples have incorrect rationales-significantly more than the randomguess performance of 17.12%, indicating the importance of filtering.Surprisingly, we however find that answer-based filtering outperforms the more stringent human-filtering by 5-11%, given the same initial samples.When we match the number of samples post-filtering (via undersampling), we do find that fine-tuning on golden samples outperforms that on correct samples by 5-8%.These results suggest that there is a tradeoff between the quality and quantity of reasoning samples which must be addressed when considering sample-filtering methods.
Sequence length Following the canonical setting for Zero-shot-CoT, we limit the max sequence length, or max tokens, allowed for the teachergenerated rationale and student reasoning predictions, denoted L r , L p , to 128 initially, following (Kojima et al., 2022).However, we find that this can be insufficient in many datasets.Allowing for longer inference, we observe that model performance improves significantly on AQUA and commonsense reasoning tasks (Appendix Table 9).Sample inspection shows that rationales with over ∼500 tokens are typically repetitive or too digressive.To investigate the effect of the max length L r of the teacher rationale on fine-tuning, we compare student performance using L e = {128, 512} (Table 7).The effect of L r on student performance varies across datasets, and increased L e does not necessarily improve student performance on tasks that require longer rationales, such as AQUA.Finally, we examine the length distribution of the generated rationales from the teacher model and student trained on short (L r = 128) and long (L r = 512) reasoning samples, respectively (Appendix Figure 4).We find that the distribution is different for each dataset.Notably, we find that while the distributions from the long students were similar to that of the teacher, the generated rationale from the short students were typically limited to less than ∼128 tokens.These findings are in line with the intuition that different tasks require different lengths of rationales, and suggest that careful consideration is needed in determining parameters related to sequence length.

Discussion
Versatility and accessibility of Fine-tune-CoT Owing to the versatility of the underlying promptbased generation methods, e.g., Zero-shot-CoT, our method can be readily applied to any complex task without any task-specific engineering.In fact, it is possible to generate samples from very large teacher models using readily and publicly available APIs such as those provided by OpenAI, without the need to use proprietary or otherwise specialized models.Fine-tuning and inference on student models can also be performed on much more accessible hardware, in contrast to working with very large models.This can reduce long term computational costs and minimize environmental impact, while making our method fully accessible to a wide community.
Optimizing for efficiency in single tasks We note that being able to use small models to solve tasks also implies that these models do not need to excel at everything.When using large models, which are highly expensive to train and fine-tune, the requirement for their efficient use is that they need to perform well on any task.In contrast, using small models allows for a more task-specific approach and optimization if needed, given that these models are cheap to train and easy to deploy.
As an example, we can consider optimal generation length, which could differ between tasks (see Subsection 4.2).
Towards concise answers Sample studies show that rationales output from student models may occasionally be repetitive and digressive.This is undesirable not only in terms of inference-time efficiency, but also the interpretability and utility of the rationales.As a minor optimization to inference computation, we construct our fine-tuning sample templates using special-character based delimiters instead of natural language used in concurrent work (Huang et al., 2022) to minimize sequence length.Preliminary findings showed this had no significant impact on reasoning performance.More importantly, it is desirable to train student models to generate concise answers in terms of substance.Subsection 4.2 hints at the possibility for this, showing that fine-tuning on shorter reasoning samples causes the student model to also produce shorter rationales.We believe this is an important direction for future work.
Vanilla fine-tuning versus Fine-tune-CoT for complex tasks Fine-tuning on vanilla training samples actually shows meaningful performance in complex tasks.While the model is initially unable to solve tasks that require complex reasoning, but is able to learn this capability from simple questionanswer samples which do not explicitly contain reasoning explanations.This suggests that fine-tuning, i.e., the simple training objective of maximizing the likelihood of next-token prediction, enables the model to infer the reasoning process for solving tasks based on the final answer.We note that this success does not negate the motivation for using ft-CoT, seeing that our method still outperforms vanilla fine-tuning on the majority of benchmarks (and even more when we include diverse reasoning).Fine-tune-CoT also gives intermediate, easy to trace reasoning steps instead of a complete black box.This also fits with our previous observation that different sets of tasks may require different approaches for optimal performance when using small models.
Reasoning in small language models Table 4 shows that Fine-tune-CoT performance scales with diversity of reasoning and amount of available samples.We note here that this also factors into a tradeoff between quantity and quality when it comes to samples used for fine-tuning student models: as we have found in the corresponding ablation study (Subsection 4.2), having fewer but perfectly curated reasoning samples is not necessarily as helpful as having a larger amount of reasoning samples that might not always be fully correct.This hints at the true role of the fine-tuning samples when it comes to enabling CoT reasoning in small models: it appears that learning to use the reasoning process of the teacher model by a larger number of observations as cues is more important than learning from a smaller amount of perfect reasoning.That is, the student model imitates the process of the teacher of splitting large tasks into smaller sub-tasks, without having to rely on the teacher's actual predictions in the training set.Intuitively, the student model might not have the same memorization and abstraction skills as the large teacher, but with enough cues to work with (i.e. a large enough amount of finetuning samples demonstrating how to use reasoning to get to an answer), it can in fact gain capabilities reminiscent of larger models.This assumption is supported by the observation that student models using Fine-tune-CoT can in fact generalize to previously unseen tasks when it comes to the way they perform intermediate reasoning.
Limitations and future work We note that the performance of our method is currently not stateof-the-art.However, it can benefit from advances in teacher models, as well as different prompting methods used in teacher models.For example, future work should include a wider array of teachers, such as the highly versatile ChatGPT, or text-davinci-003, which builds on Instruct-GPT.Furthermore, previous work shows that Fewshot CoT (Wei et al., 2022b) can improve accuracy over Zero-shot-CoT by a wide margin , e.g., going from 78.7% to 93.0% when using only eight incontext reasoning samples on MultiArith (Kojima et al., 2022).Both of these avenues are promising for future work.Another potential improvement may lie in using a different knowledge distillation method, such as sequence-level distillation that trains on the output of beam search in the teacher (Kim and Rush, 2016).

Conclusion
In this work, we have demonstrated how the power of large language models can be used to teach much smaller student models how to reason step-by-step.We do this by prompting a large model for chain-of-thought rationales, and using its completions as samples for a smaller model to fine-tune on.Our results show that this method significantly improves the performance of small models on a range of different tasks with high sample efficiency, and can even reach or exceed the teacher performance in many cases.We add to these findings with a rich set of ablation studies and analysis.By leveraging publicly available models with zero-shot prompting, we demonstrate a task-agnostic method to elicit reasoning performance in small models, accessible to the broader community.
A Experimental Details

A.1 Generation
We use the publicly available OpenAI API to generate reasoning samples and reasoning predictions with our teacher and student models, respectively.
Maximum sequence length For the maximum sequence length of teacher-generated rationales, ri , we use L r = 128, following Kojima et al. (2022), unless stated otherwise.For the maximum sequence length of the student model predictions, we use L p = 1024, unless stated otherwise.We retroactively applied L p = 1024 as the default, after discovering that L p = 128 is insufficient for many tasks, as discussed in Subsection 4.2 Sampling temperature We use a sampling temperature of T = 0 for all generations, except diverse reasoning, to obtain deterministic results.For diverse reasoning, we use T = 0.7 to obtain unique generations, following a similar approach from Wang et al. (2022b).

A.2 Fine-tuning
We use the publicly available OpenAI API to finetune student models, based on GPT-3.We use the default parameters provided by the API.

A.3 Answer cleansing
We follow the method used in Kojima et al. (2022) to cleanse answers generated by models to assess their correctness.

B Datasets
We provide a summary of datasets used in our experiments in Table 8.We consider the 10 datasets from Kojima et al. (2022), used to measure reasoning performance.For Last Letter Concatenation and Coin Flip, we use the publicly available data provided by Kojima et al. (2022). Train

C Sample Study
To understand the strengths and weaknesses of our method, we randomly choose 50 samples per dataset and analyze the reasoning performance of Fine-tune-CoT.To do so, we compare its generations for these 50 samples with (1) the output of the large teacher model, (2) a student model using zero-shot-CoT and (3) a student model using fine-tuning without chain of thought reasoning.We show representative examples in Tables 11-14.

C.1 Weaknesses and error analysis
For our analysis of the method's weaknesses, we take a look at datasets where we find particularly bad performance compared to other methods, in particular fine-tuning.We summarize our observations below.First, we observe that the sets GSM8K and AQUA are too difficult for a small student model, in particular given that already the teacher model gets below 50% accuracy on both.In fact, even the tasks that the models answer correctly are dominated by bad reasoning and merely randomly correct answers due to the high complexity of the tasks (Tab. 11a,b).For AQUA in particular, we note that while we occasionally find meaningful reasoning in the 6.7B student model, students clearly cannot sufficiently learn to solve the tasks.A similar, if less salient, issue arises for StrategyQA.Here, the teacher also performs only 3% above the random guess accuracy of 50%.While the smaller student models actually manage to improve on this performance, in particular vanilla fine-tuning, the errors arising in fine-tune-CoT then often look very similar to the ones in the large teacher model.None of the models' capacities for identifying salient pieces of information and putting them together suffice in order to generate mostly correct answers.Very often, the models instead merely look up information related to the question, but cannot synthesize an answer from it (Tables 11c,12a).
Next, we note that small models exhibit weak arithmetic skills.This has already been discussed in previous literature, where calculation capability has been found to scale with model size (Wei et   2022a).Especially in SingleEq (Table 12b) and AddSub (Table 12c), a majority of errors simply arise from wrong calculations, less so bad reasoning.This is also a major factor in the bad performance our method exhibits on SVAMP as well as GSM8K; even correct multistep reasoning cannot compensate for the fact that the model's arithmetic tends to be wrong already on intermediate steps (Tables 12d, 13a).The teacher model then does better on these tasks, given its larger size.
We furthermore find that models seem sensitive to how a question is formulated.This is noticeable in all datasets, in particular in SVAMP.In particular, we observe this issue when there is redundant information present in the question (Table 13b).Such cases elicit wrong reasoning, or lead the model to become stuck on the question, similarly to what usually happens with zero-shot CoT in the student model.Other common sources of errors are when hidden variables make up the first part of the task (i.e.those tasks that force the model to calculate a previously unknown value that is described in the first sentence, see Table 13c), or when the model encounters overloaded words (e.g."landing", see Table 13d).Another common source of errors is We also observe samples where the model gets stuck on an intermediate result (Table 14a).This observation fits with previous findings that language models have a recency bias (Zhao et al., 2021).
Meanwhile, when looking at our method's performance in CommonsenseQA, we note that in fact its reasoning skills in this sample set are not the issue for many of the tasks.We find that the student model utilizing ft-CoT can generate logical reasoning paths for many of the samples that are marked as false (Table14b).Rather, the exact answer is often very subjective, making it very difficult to guess the correct output from logical reasoning alone (Ta-ble14c).CommonsenseQA thus is not an ideal benchmark when judged on accuracy, but gives insight into how well the model can reason.Also, comparing the negative samples from this dataset with the negative samples from StrategyQA, we note that while the sources of eventual error are usually different (in StrategyQA, answer synthesis from a number of different facts is more of an issue), the generation of the reasoning itself is very often correct for both of these datasets.
Importantly, we note that for each dataset, there seems to be a difference between "easy" and "hard" instances.When we consider the accuracy of the teacher and other student models (using fine-tuning or zero-shot-CoT) on tasks where our method fails, we find that it is always lower than on tasks where our method is successful.That is, successes tend to be aligned across the different methods, and so are failures.We can hypothesize that factors like content bias may play a role here; language models have been found to fail depending on context and content of the task, in a way similar to human reasoners (Dasgupta et al., 2022).We can identify samples that hint at this issue when we look at questions that include phrasing that seems contradictory or counterintuitive (Table14d).Additionally, previous work shows that GPT-3 exhibits a performance gap between instances including terms that are frequent in the pretraining corpus, and instances including less frequent terms (Razeghi et al., 2022).This kind of leakage can contribute to uneven performance on a multitude of (especially numerical) tasks across different methods and model sizes, as the presence of more frequent terms makes it easier to perform calculations.We can then surmise the observed differences in accuracy to stem from the various sources of errors for each method; we note that e.g.fine-tuning has much less room for error than fine-tune-CoT, which can additionally make mistakes on intermediate reasonings.

C.2 Strengths
Having analyzed the main source of errors, we can now focus on the datasets that elicit good performance from our method.As arithmetic errors are one of the main reasons for bad performance of small student models, it comes as little surprise that our method performs best on datasets that are mainly text-based and do not require actual calculation skills, such as DateUnderstanding, Coin Flip, ShuffledObjects, and Last Letter Concatenation.These datasets also have very clear patterns in their tasks, which helps fine-tune-CoT to perform well by providing cues on how to solve a specific task.We note that in contrast, classic finetuning does not have an advantage in these datasets, and it gets significantly lower accuracy than finetune-CoT on all four.The same is also true for MultiArith, which we have used as a benchmark in the main text.While arithmetic errors cause the absolute accuracy of our method to be lower than the teacher, it significantly outperforms fine-tuning on MultiArith even without using diverse reasoning.Indeed, we find that also in the presence of arithmetic errors, our model reasons correctly in many cases.We can surmise that the heavily patterned nature of the tasks in MultiArith helps the student model to understand what is asked of it, eliciting correct reasoning.Additionally, we note that the presence of such patterns in successful datasets does not mean that our method overfits to existing templates.In our template-split analysis (Subsection 4.2), we in fact show that while tasks look similar to one another in certain datasets such as Date Understanding, the student model's reasoning does not rely on simply matching templates or memorizing particular solutions.This implies that our method can generalize to previously unseen tasks; the patterns in the datasets do not produce overfitting, but can be surmised to act as cues for the model's understanding of its current task.
We can further compare fine-tune-CoT with the purely prompt-based zero-shot-CoT in the student model.Here, we observe that the reasoning skills of a student using fine-tune-CoT can overcome the smaller model capacity (which proves to be completely prohibitive for zs-CoT to have any success on the various tasks), due to having been trained on many reasoning samples.Where zero-shot-CoT fails to reason and simply gets prompted to repeat the question or come up with answers that only vaguely pertain to the question, our method can in fact reason, even when it cannot eventually grasp the correct answer.This is particularly noticeable in the to QA datasets (Commonsense and Strategy).While our method does not always produce the required ground truth answer, such that it effectively still underperforms vanilla fine-tuning, it can put together a logical path from the question to the prediction, and thus elicits performance similar to the teacher performance despite the small model size.Fine-tune-CoT hence combines many of the advantages of vanilla fine-tuning and CoT reasoning on smaller models.Figure 4: Distribution of the length of generated reasoning sequences from the 175B teacher model and fine-tuned 6.7B student models on four datasets.Student (Short) refers to baseline students that were fine-tuned on reasoning samples with maximum rationale sequence length of L r = 128, and Student (Long) refers to students that were fine-tuned on longer reasoning samples with L r = 512.

Figure 2 :
Figure 2: Detailed overview of the proposed Fine-tune-CoT method.Step 1: a very large teacher model is prompted to solve complex questions (yellow) by generating multi-step reasoning explanations (green).Step 2: generated completions are filtered based on the correctness of the final prediction (red).The question, rationale, and answer are used to compose a reasoning sample comprised of the prompt and a multi-step completion.Step 3: the curated reasoning samples are used to fine-tune a small, lightweight student to exhibit reasoning capabilities.The application of an LM-based teacher enables diverse reasoning-generating multiple distinct rationales for each original sample to enrich the fine-tuning data.This boosts the performance of student models without requiring human annotation.

Figure 3 :
Figure3: Sample efficiency of Fine-tune-CoT with diverse reasoning.Accuracy (%) of Fine-tune-CoT for 6.7B student models on SVAMP using the full datsaet or few-shot data, across varying degrees of diverse reasoning.

Table 1 :
(Kojima et al., 2022)asoning.ForLast Letters and Coin Flip, we use the publicly available data from(Kojima et al., 2022).We provide details on datasets used in Appendix B List of models used in our experiments.

Table 2 :
Taxonomy of methods used in our experiments.

Table 3 :
Fine-tune-CoT performance.Accuracy (%) of baseline zero-shot and fine-tune methods with and without CoT reasoning for student models on 12 tasks.'Random' refers to random-guess performance derived based on the number of choices in tasks comprised of multichoice questions, i.e., the performance of a model that is only capable of outputting a random answer in the correct format.

Table 4 :
Diverse reasoning performance.Accuracy (%) of Zero-shot-CoT, Fine-tune-CoT and Fine-tune-CoT with diverse reasoning samples for student models on SVAMP.'Usage' refers to the ratio (%) of original training samples that are used for fine-tuning, i.e., had at least one correct reasoning output from the teacher.
on simple question-answer pairs without explicit reasoning examples.Nevertheless, Fine-tune-CoT performance shows a more reliable scaling curve with model size and demonstrates a clear advantage in tasks that require multiple steps, such as Track-

Table 5 :
Sample-wise vs template-wise split.Accuracy (%) of Fine-tune-CoT for student models on two moderately templated datasets when using a samplewise vs template-wise train-test split.

Table 7 :
Longer reasoning samples for fine-tuning.
Table6compares the Fine-tune-CoT performance of student models on Date Understanding when using correct samples filtered Accuracy (%) of Fine-tune-CoT for student models on four datasets which require longer rationales, when trained on reasoning samples with maximum rationale sequence lengths of L e = 128, 512. al.,

Table 8 :
Description of datasets used in our study.

Table 9 :
Ablation on maximum sequence length.Accuracy (%) of Zero-shot-CoT on the teacher model and Finetune-CoT on student models, based on maximum sequence length.Values in parentheses refer to the percentage of generated rationales that were completed within the allotted maximum sequence length.