Symbolic Chain-of-Thought Distillation: Small Models Can Also “Think” Step-by-Step

Chain-of-thought prompting (e.g., “Let’s think step-by-ste”) primes large language models to verbalize rationalization for their predictions. While chain-of-thought can lead to dramatic performance gains, benefits appear to emerge only for sufficiently large models (beyond 50B parameters). We show that orders-of-magnitude smaller models (125M—1.3B parameters) can still benefit from chain-of-thought prompting. To achieve this, we introduce Symbolic Chain-of-Thought Distillation (SCoTD), a method to train a smaller student model on rationalizations sampled from a significantly larger teacher model. Experiments across several commonsense benchmarks show that: 1) SCoTD enhances the performance of the student model in both supervised and few-shot settings, and especially for challenge sets; 2) sampling many reasoning chains per instance from the teacher is paramount; and 3) after distillation, student chain-of-thoughts are judged by humans as comparable to the teacher, despite orders of magnitude fewer parameters. We test several hypotheses regarding what properties of chain-of-thought samples are important, e.g., diversity vs. teacher likelihood vs. open-endedness. We release our corpus of chain-of-thought samples and code.


Introduction
Empirical scaling laws suggest that the accuracy of Large Language Models (LLMs) on benchmark tasks can be improved by increasing model size and pre-training data volume (Hoffmann et al., 2022).Beyond these training-time improvements, however, an inference-time strategy dubbed "chain-ofthought" (CoT) prompting, 1 i.e., eliciting verbalizations of predictive processes via key-phrases like "Let's think step-by-step" (Kojima et al., 2022), can *Work done during an internship at AI2. 1 Sometimes called "self-rationalization" or "prompting with explanations."We will use these terms interchangeably in this paper.

🤖
Teacher Model 175B) Where is a well-used toy car likely to be found?

N samples (+Filters)
A used toy can usually be found in the home of a… Figure 1: Symbolic Chain-of-thought Distillation (SCoTD) applied to a student model, ranging in size from 125M-1.3Bparameters.We show that fine-tuning on a (diverse and potentially filtered) corpus of expert chain-of-thought demonstrations from a teacher model is an effective strategy to make smaller models capable of chain-of-thought reasoning.
similarly improve performance, e.g., Suzgun et al. (2022) demonstrate additional performance gains on a hard subset of the BigBench tasks (BIG-bench collaboration, 2022) using chain-of-thought.
However, chain-of-thought prompting has only been shown to be beneficial for models of sufficient scale (e.g., with more than 60B parameters (Wei et al., 2022b)).In this work, we study whether small language models can be "taught" the capacity for chain-of-thought reasoning by larger language models.We adopt a simple strategy, which we call Symbolic Chain-of-thought Distillation (SCoTD): first, we sample chain-of-thought rationales from large language model given (unlabeled) input instances from a dataset; then, we train a smaller language model to predict the sampled rationale CommonsenseQA: Q: The man working in the attic swatted away a bee, but soon the single bee was an entire what?(a) swarm; (b) pack; (c) countryside; (d) soft drink; (e) field of flowers GPT-3: The answer must be related to bees, but also connected with being numerous.Of the above answers, only swarm fits the scenario.So the answer is: (a) OPT-1.3B:The answer must be a swarm of bees.Of the above choices, only soft drink is used to describe a swarm of bees.So the answer is: (c) OPT-1.3B+ SCoTD: The answer must be the name of something that has bees.Of the above choices, only swarms have bees.So the answer is: (a) QuaRel: Q: Steve is waiting at the gym ...He starts hitting around a hockey puck with his stick.The gym has a carpeted floor and more friction than the ice rink he goes to.Which surface is rougher?(A) carpet (B) ice rink GPT-3: When something is smoother, it is easier to slide on and easier to pass through.So the carpet is rougher.So the answer is: (A) OPT-1.3B:_________ is rougher than carpet.Thus, the gym floor is rougher than the ice rink.So the answer is: (A) OPT-1.3B+ SCoTD: When something is rougher, it has more friction.Thus, the gym has more friction than the ice rink he goes to.So the answer is: (A) OpenBookQA: Q: Magnets may be attracted to (a) clay pots (b) roofing nails (c) paper plates (d) plastic cutlery GPT-3: Magnets are attracted to metal objects.These objects include roofing nails.So the answer is: (b) OPT-1.3B:Magnets are attracted to clay pots, roofing nails, paper plates, plastic cutlery.So the answer is: (d) OPT-1.3B+ SCoTD: Magnets may be attracted to some metals, but not to clay pots, roofing nails, paper plates or plastic cutlery.So the answer is: (b) Table 1: Few-shot chain-of-thoughts produced by GPT-3 (code-davinci-002, the teacher model), OPT-1.3B(the un-distilled student model), and OPT-1.3B+ SCoTD (ours), the student model trained using Symbolic Chainof-thought Distillation.§3 shows this process significantly improves the task-accuracy of the student model in a variety of settings, and in §3.1.1,human evaluations show that, even when the un-distilled student model happens to get the multiple choice question correct (see QuaRel example), humans tend to prefer OPT-1.3B+ SCoTD.and sampled label.This process follows the "symbolic knowledge distillation" paradigm as in West et al. (2022), wherein corpora are sampled from a larger language model to serve as training data for a smaller one.
We find that through SCoTD, smaller language models learn to self-rationalize and perform significantly better on 3 commonsense QA tasks compared to learning without rationalizations.This result holds for both supervised and few-shot settings, and across student models of varying scales (125M-1.3Bparameters).Performance gains are especially pronounced when applying distilled chain-ofthought models to difficult scenarios like: contrast sets (Gardner et al., 2020) ( §3.4; SCoTD significantly outperforms supervised learning on labels) and fully held-out tasks ( §3.5; few-shot SCoTD significantly outperforms in-context learning).
Key to the success of this process is sampling a relatively large number of rationales per example from the teacher model (e.g., 30 rationales/example) (Figure 2).This is different from many prior practices that train with one rationale per example (Camburu et al., 2018;Li et al., 2022a).In ablation studies, we investigate several competing hypotheses for what are the most important factors within the corpus: we filter the corpus to CoTs that are assigned high probability by GPT-3 vs. filtering to CoTs that are diverse vs. filtering to CoTs that explain more open-ended input instances.
While diversity and high probability are reasonable filters that on average perform well, the "null hypothesis" of random downsampling performs well, suggesting that the sheer volume of the rationales is also a key contributing factor.
We will release code and the corpus of sampled chain-of-thoughts at https://github.com/allenai/cot_distillation.

Symbolic Chain-of-Thought Distillation
Our primary goal is to improve the accuracy of a (relatively small) student language model S on a target classification2 task D Test = {(x i , y i )}. 3e assume access to 1) (an unlabeled) training set D Train = {(x i )}; and 2) a large teacher language model T (e.g., GPT-3 (Brown et al., 2020)), capable of generating chain-of-thoughts in a few-shot fashion.
Our first step is to curate a set of labeled chainof-thoughts to serve as few-shot Prompts for T .For each target task, we sample a small number (e.g., 10) of examples x i from D Train , provide a gold classification label y i , and manually author a chain-of-thought z i for each to form the prompt set P = {(x i , y i , z i )} 4 .Then, for each x i in D Train , we sample N chainof-thoughts zi along with the resulting prediction ỹi from the teacher model, i.e., The result of this sampling is a corpus )}, which contain teacherpredicted chain-of-thoughts/labels.Depending on the experimental setting (details in § 3), we sometimes filter the entries of C, e.g., in the fully supervised case where D Train instances have associated labels, we discard samples for which the sample the teacher model predicted an incorrect label.Next, we train the student model using the standard language modeling loss, i.e., we maximize After fine-tuning the student model on the corpus sampled from the teacher, to evaluate the model on a test instance (x test , y test ) from the target task, we decode both a chain-of-thought ztest and a predicted label ỹtest from the student and evaluate ỹtest versus the true label y test .We consider two strategies for decoding.(1) Predict the most likely chain-of-thought and the label ztest , ỹtest = argmax z,y S(z, y|x test ).This can be approximated by greedy decoding or beam search.(2) There may be different valid chainof-thoughts for a given question and as a result, large language models distribute probability mass for a certain label across many diverse chain-of-thoughts (Wang et al., 2022b).Thus, it is beneficial to marginalize out the reasoning paths to find the most consistent answer: ỹtest = argmax y E z∼S(z|xtest) S(y|z, x test ).This can be approximated by sampling multiple reasoning paths and take a majority vote among the predicted answers, dubbed "self-consistency" (Wang et al., 2022b).We experiment with both approaches and conduct a discussion in §3.2.
We sample from GPT-3 with a temperature of T = 1.0.For each training example, we sample N = 30 rationales.OPT is fine-tuned with a batch size of 32 and a learning rate of 2 × 10 −5 .We use HuggingFace transformers (Wolf et al., 2019), Pytorch (Paszke et al., 2019), and Accelerate6 for the implementation.Main experiments can be reproduced on one GPU with 48GB of memory.

Results in Default SCoTD Setting
We first consider both a few-shot learning setting and a supervised setting.For the few-shot setting, the only labeled examples available to our teacher/student models are contained in the prompt set P (but we use the unlabeled examples and teacher-generated chain-of-thoughts/labels for training). 7We also consider the supervised setting, where we assume access to labels in D Train .Supervised SCoTD involves simply discarding the samples within C that do not have the correct label prior to fine-tuning the student: for Common- Figure 2: For three commonsense QA tasks, accuracy (y-axis) improves significantly as the student is trained on more chain-of-thoughts sampled from the teacher (x-axis).Oversampling chain-of-thoughts is sometimes required to improve student performance beyond the supervised label-only baseline, e.g., as in OpenbookQA.
senseQA, OpenBookQA, and QuaRel, this results in discarding 40.4%, 45.0%, 34.2% of chain-ofthoughts.For the few-shot setting, we decode with the self-consistency approach; for the supervised setting, we decode with greedy decoding (introduced in § 2; see an discussion in § 3.2).We compare SCoTD to 2 baselines: 1) Label-Only, the student is fine-tuned on just the label (in the few-shot setting, the label comes from the teacher and could be wrong; in the supervised setting, we use the gold label), instead of also with CoT; 2) Greedy-CoT, we decode a single-CoT per example (instead of N = 30 samples) from T for each training example instead of sampling.For additional reference, Table 2 (a) reports the performance of the student (and teacher) in a variety of few-shot settings prior to applying any distillation: No CoT = few shot prompting with labeled instances from P but no z i , Greedy and Self-Consistency are prompting with CoT but with different decoding strategies ( § 2).
Table 2 (b) gives the performance of the student model after distillation in the supervised and fewshot settings.In all cases, distillation significantly improves the student model, and in all-but-one case, learning with CoT outperforms the label-only distillation baseline.While the student model initially fails to perform CoT through prompting (Table 2 (a)) it learns to do so through distillation.
The number of samples.In our default setting, to serve as our distillation corpus C, we sample N = 30 rationales from the teacher T for each (unlabelled) training instance.Figure 2 shows the performance of the student model when it is trained on corpora with fewer sampled CoT per instance: results suggest that learning with multiple sampled (albeit nosier) rationales/chain-of-thoughts per example is more beneficial than learning with one (most likely) rationale.Will more rationales bring more performance improvement?We sampled more rationales from GPT-3 to train the student model; however, this does not bring more performance gains.When N = 50, the performance is similar to N = 30: the model achieves 67.0 in accuracy on OpenBookQA (v.s.67.0), 67.2 on CommonsenseQA (v.s.67.0), 84.9 on QuaRel (v.s.83.8).

Human Evaluations
While SCoTD improves task accuracy significantly, we additionally conduct human evaluations to assess the generated chain-of-thoughts themselves (see Table 1 for samples).We sample instances from the CommonsenseQA, OpenBookQA, and QuaRel validation sets (300 instances per dataset), and conduct head-to-head human evaluations8 to assess: Q1: Does SCoTD result in higher-quality chainof-thoughts?Test: OPT-1.3Bversus OPT-1.3B+ SCoTD.Result: Yes.We assess this hypothesis on two subsets of instances: 1) a pure random sample (N=900); and 2) a set of instances for which both models eventually predicted the correct label (N=654).The second setting focuses more closely on the chain-of-thoughts themselves rather than the  predictive accuracy of the model.SCoTD is superior in both settings: for the random sample setting, SCoTD won in 59% of cases (p<.001), whereas in the correctness controlled setting, SCoTD won in 61% of cases (p<.001).Results hold with p < .05for each QA dataset individually.
Q2: Does a SCoTD student surpass the much larger teacher?Test: OPT-1.3B+ SCoTD versus text-davinci-002.While the task accuracy of the teacher is still higher in most cases, the studentgenerated CoT are comparable. 9We again evaluate on: 1) a pure random sample (N=900); and 2) a correctness-controlled setting (N=659).The 100x smaller SCoTD's generations are competitive in both cases; we can't reject the null hypothesis of the crowd having equal preferences (OPT-1.3B+ SCoTD wins in 47% and 51% of cases respectively, p > .01).Results hold for each dataset individually, as well.

Self-Consistency for the Student
Wang et al. (2022b) find that, for chain-of-thought prompted models, taking a majority vote over a large set of sample of predicted labels (resulting from a diverse range of CoTs) can improve performance.Our results regarding the effectiveness of sampling N = 30 rationales from the teacher during SCoTD are similar-in-spirit: i.e., we also show performance gains from sampling multiple rationalization chains per instance.
9 See §6 for more discussion about the disparity between CoT-quality and task accuracy.A natural question is, does the student model S exhibit the same phenomenon, i.e., can we sample multiple chain-of-thoughts from it and take a majority vote?We find that the student model can benefit from "self-consistency," but not in all cases.In Table 3, we report the performance with/without self-consistency (majority vote among 30 sampled reasoning paths with a temperature of 0.7).When training with filtered CoTs (Table 3 (a) bottom rows) or training with few CoTs per example (Table 3 (b), when #CoTs/Example is small), the student model does not benefit from self-consistency.Only when we train with multiple rationales per example without filtering (the few-shot setting), self-consistency is beneficial on CSQA and Open-BookQA.Overall, the results show that student models benefit from being shown a diverse/noisy set of rationales, and that self-consistency can be effectively applied after distillation.

SCoTD across Model and Dataset Sizes
We also verify the effectiveness of SCoTD across model and dataset sizes; in these experiments, we consider the supervised setting.Data scaling.Figure 3 shows the effect of varying the size of D Train (for simplicity, we show only performance on CSQA as an example).Learning with CoTs is beneficial under all data scales.Interestingly, SCoTD, trained with access to only 40% of the labelled data, can surpass the direct supervised label-only model with 100% of the labelled corpus; this result aligns with the argument in Zaidan et al. (2007) -providing more explanations from the teacher model could be more beneficial than providing more labels.
Student model size scaling.Figure 4 presents results when varying the size of the student model from 125M to 1.3B parameters for CSQA.For all model three model sizes, SCoTD outperforms the standard supervised fine-tuning baseline (Label Only).Sampling multiple rationales per input instance is an effective strategy for all model sizes.

SCoTD on Challenging Contrast Sets
Can learning with explanations help generalization, as hypothesized by (Zaidan et al., 2007)?As a preliminary study, we show that SCoTD enables better generalization to contrast sets.Contrast sets (Gardner et al., 2020) are proposed to evaluate a model's robustness to perturbations around the decision boundary, by asking annotators to modify the original test instances in small but meaningful ways that (typically) change the gold label.
We experiment on the IMDB (Maas et al., 2011) sentiment analysis task in the supervised setting; we consider the corresponding contrast set of IMDB proposed by Gardner et al. (2020).We train two models on the training set of IMDB: Label-Only and SCoTD.For efficiency, we sub-sample 100K examples from the training set of IMDB and truncate input sequences to 700 tokens.As shown in Figure 5, while both models with/without SCoTD achieve high performance on the original IMDB test set (96.1% v.s.95.5%, with the Label-Only model performing slightly better), the model with SCoTD achieves significantly higher performance on the contrast set: 92.0% vs. 81.6%.This result supports the hypothesis of (Zaidan et al., 2007); that explanations can support more robust generalization.

SCoTD on Unseen, Out-of-domain Tasks
Large language models can perform few-shot, incontext learning with chain-of-thought prompting, i.e., generating reasonable chain-of-thoughts on unseen tasks with a few demonstrations (Suzgun et al., 2022).We conduct a preliminary experiment, inspired by Min et al. (2021)'s MetaICL, to test whether student models trained with SCoTD acquire the same ability.We train a supervised SCoTD model on ANLI, CommonsenseQA, and OpenBookQA, and evaluate it on SST-2 (Socher et al., 2013), a sentiment analysis task.
The SCoTD model achieves a few-shot accuracy of 79.6% on the validation set (an example prediction is shown in Figure 6). 10Compared to a baseline model that learns with no CoT(i.e., a re-implementation of MetaICL trained on 3 source tasks); the baseline fails to recognize the input/output format of the new task and predicts answers out of the desired label set.It achieves (an effective) 0% accuracy on SST-2.This suggests the potential of including CoTs during instruction/incontext tuning (Wei et al., 2022a;Min et al., 2021).

What Factors are Important for
Distillation?
An important factor underlying the performance gains highlighted in §3 was the number of chain-ofthoughts we sampled from the teacher model perinstance (more samples = better; Figure 2).Here we ask: is data volume the key contributing factor to the performance improvement?Or, are specific aspects of chain-of-thought samples key for the performance improvements?
We design several filters to identify potentially important examples/CoTs among the correct rationales.We apply designed filters (to be introduced) to C ′ , the corpus sampled from the teacher (with wrong CoTs dropped), that operationalize different hypotheses about what factors are important to distill.We control for dataset size when filtering, i.e.,

Label Only SCoTD
The author said that they love this movie and they are never tired of watching it.
They say that the movie is wonderful and they are grateful to see such an outstanding picture.So the answer is: positive This was a wonderfully clever and entertaining movie that I shall never tire of watching many, many times… I can only be grateful when I see such an outstanding picture for most of the motion pictures made more This was a wonderfully thick as two short planks and soul-destroying movie that I shall never watch any number of times… I can only be sorry when I see such an abysmal picture just as most of the motion pictures …

IMDB Dataset
The author said that the movie was 'thick as two short planks and souldestroying', implying that the movie is bad.So the answer is: negative all filtered corpora have the same number of training CoTs.We downsample with a budget of 5 CoT per instance on average 11 .Then, we train the same student model on each of the filtered corpora, and compare on downstream tasks.If a student model trained on filtered corpus A tends to outperform the student model trained on filtered corpus B, then we argue that the property that produced corpus A is more important.The hypotheses we consider are: Null hypothesis: data volume.As a null hypothesis, we randomly sub-sample 5 CoT per instance; this filter operationalizes the assumption that an arbitrary set of samples is sufficient.

Diversity.
For each instance, we compute S-BERT (Reimers and Gurevych, 2019) embed-11 In rare cases, we may end up with less as there are less than 5 correct CoTs for the instance.dings12 of each of the chain-of-thoughts, and cluster the resulting embeddings using hierarchical clustering into k = 5 clusters.Then, we randomly sample a single instance from each cluster: the resulting sample covers all clusters, and thus represents a diverse+representative sample.
Teacher likelihood.For each instance, we keep the 5 CoT samples with the highest per-token loglikelihood according to the teacher model.
Open-endedness.Some instances in each dataset lead to a broader range of chain-of-thought samples than others.For example, on CommonsenseQA, the question "What form of alcohol is made from grapes?" leads to a narrower range of rationalizations vs. "Why might someone purposefully be going into trance?" We hypothesize that openended instances could benefit from relatively more sampled rationales.We sort instances into quintiles based on the unique bi-grams in their corresponding 30 CoTs; for high-ranking instances (more unique CoT bi-grams, like the "trance" example above), we keep more rationales and for low-ranking instances, we keep less rationales.We keep 1, 3, 5, 7, 9 rationales for instances of different bins (thus controlling for the total number of CoT).
Results Figure 7 reports the accuracy of the student model when fine-tuned on the different subsampled corpora for the three tasks we consider.Overall, random subsampling is a strong baseline, but, we see some evidence that diversity among the rationales is important.None of the models trained on the sub-sampled data could approach the model trained on the full 30x/instance CoT set.This suggests that the sheer volume of the CoTs is a key driving force for the performance improvement.

Related Work
Chain-of-thought prompting.As an extension of few-shot prompting (Brown et al., 2020) Learning with explanations.Hase and Bansal (2022) discuss how explanations can serve as inputs (Talmor et al., 2020), targets (Hendricks et al., 2016;Fidler et al., 2017;Camburu et al., 2018;Zhou et al., 2020;Narang et al., 2020;Kayser et al., 2021;Wiegreffe et al., 2022), and priors (Zhang et al., 2016;Srivastava et al., 2018) for machine learning models.Chain-of-thought extends earlier efforts which treat explanations as intermediate structures, generated at inference time (Rajani et al., 2019).Most related to our work is Li et al. (2022a), who do also learn with GPT-3 generated explanations; we show multiple samples improve significantly over their single-sample method, and also use chain-of-thought prompting at inference time vs. predicting explanations+labels via independent multitasking.
Contemporaneous work.There are several contemporaneous papers: Huang et al. (2022), Magister et al. (2022), andHo et al. (2022) all show that smaller models can benefit from large models' chains of thought.We contributes beyond these by: 1) showing that sampling a large number of chain-of-thoughts is paramount; 2) exploring transfer performance to challenge sets/unseen tasks; and 3) analysis that address what factors are important in the teacher corpus.

Conclusion
We demonstrate the effectiveness of Symbolic Chain-of-thought Distillation (SCoTD): a method that enables smaller language models to effectively use chain-of-thought-style reasoning.We demonstrate the method's effectiveness across several downstream tasks, different student model sizes, different levels of supervision, and in difficult settings (challenge sets, unseen tasks).Our ablations shed light on what factors are particularly important to distill in these chain-of-thoughts.Our concrete recommendations are: 1) sampling multiple and diverse CoTs for each input instance, and 2) performing self-consistency when the teacher CoTs are noisy.Several promising av-enues for future work include: 1. Exploring SCoTD for generation tasks in addition to classification tasks; 2. Scaling up the number of source tasks in § 3.5 to generalize to more tasks; 3. Using the down-sampling setup introduced in §4 to explore additional hypotheses about what other factors may be of importance in CoTs.

Limitations
Several limitations of our study include: 1.only English-language chain-of-thoughts/tasks considered; 2. reliance on GPT-3, which is a closed-source product with an unknown training set (which could itself include some explanations); and 3. focusing only on a single type of student model, OPT.
More broadly, learning from and with explanations carries some specific risks related to automation bias.While a model might rationalize its predictions using a seemingly coherent string of natural language steps, even if it eventually gets the prediction correct, there's no guarantee that the eventually predicted output actually results from a process represented by the rationalization.A user might assign excessive confidence to that system based on the chain-of-thought.We observed many cases where the chain of thought seemed promising only to result in models ultimately making incorrect predictions in the final few tokens.Caution should be taken when displaying chain-of-thoughts to users.
Because you can't buy a…If the toy car is used, then…The owner of the toy car…

Figure 3 :
Figure3: Performance on CSQA with different amount of training instances, from using only 20% of the x from D Train to using the full set (X-axis).Orange line is the Label Only baseline.Bottom blue line (marked with 1x) is SCoTD but with only 1 sampled rationale per instance; above are SCoTD with 5, 10, 20, 30 sampled rationales per instance, respectively.

Figure 4 :
Figure 4: Performance on CSQA with three different model sizes.

Figure 5 :Figure 6 :
Figure 5: Performance of SCoTD vs. label only supervision on the original and contrast IMDB dataset, along with sample predictions from SCoTD.

Figure 7 :
Figure7: Downsampling ablations: we subset our chainof-thought distillation corpus C with a fixed budget according to different criteria.In general, keeping a diverse set of rationales performs well, though a random sample often performs well too.
when available.
(b) Performance of the the student model after distillation.
Self-consistency is most helpful under the few-shot setting, where we train with unfiltered and noisy CoTs.

Table 3 :
Student performance with and without selfconsistency.