MCC-KD: Multi-CoT Consistent Knowledge Distillation

Large language models (LLMs) have showcased remarkable capabilities in complex reasoning through chain of thought (CoT) prompting. Recently, there has been a growing interest in transferring these reasoning abilities from LLMs to smaller models. However, achieving both the diversity and consistency in rationales presents a challenge. In this paper, we focus on enhancing these two aspects and propose Multi-CoT Consistent Knowledge Distillation (MCC-KD) to efficiently distill the reasoning capabilities. In MCC-KD, we generate multiple rationales for each question and enforce consistency among the corresponding predictions by minimizing the bidirectional KL-divergence between the answer distributions. We investigate the effectiveness of MCC-KD with different model architectures (LLaMA/FlanT5) and various model scales (3B/7B/11B/13B) on both mathematical reasoning and commonsense reasoning benchmarks. The empirical results not only confirm MCC-KD's superior performance on in-distribution datasets but also highlight its robust generalization ability on out-of-distribution datasets.


Introduction
Recently, large language models (LLMs) such as ChatGPT have exhibited impressive emergent capabilities, showcasing their competence in various tasks, including those demanding complex reasoning.While directly providing answers without generating intermediate steps may lead to errors and limited interpretability, chain of thought (CoT) (Wei et al., 2022) prompting enables LLMs to break down reasoning tasks into a series of intermediate steps, guiding the model to generate the subsequent steps before arriving at the final answer.The effectiveness of CoT prompting has been demonstrated on diverse reasoning tasks (Kojima et al., 2022).
Seal is paid 2*$2000=$4000 per month.In one year, there are 12 months.So He get $4000*12=$48,000 per year.Over the course of three years, he earns 3*$48,000=$144,000.Therefore, the answer is 144,000.Despite the effectiveness of CoT prompting, recent studies (Wei et al., 2022;Magister et al., 2022;Fu et al., 2023) have shown that these reasoning capabilities only manifest in language models with over 100 billion parameters, such as PaLM (540B) (Chowdhery et al., 2022) and GPT-3 (175B) (Brown et al., 2020).These LLMs with massive parameter sizes require significant computational resources during both training and inference, which restrict their deployment on resource-limited platforms.While LLMs could be accessed through API calls, there are still several challenges to overcome, including network instability, difficulty in customizing the models, and privacy concerns.
Therefore, an alternative approach is to deploy smaller language models such as LLaMA-7B/13B (Touvron et al., 2023) and FlanT5-XL/XXL (Chung et al., 2022), which have fewer than 13 billion parameters.Through knowledge distillation (KD) (Hinton et al., 2015), the reasoning capabilities can be transferred from LLMs to these smaller models.However, traditional KD techniques require the teacher model to provide output logits or hidden layer features, which cannot be readily applied to LLMs due to the limited accessibility of their internals.One potential solution is to leverage rationales generated by LLMs to train smaller mod-els, thereby acquiring their reasoning abilities (Ho et al., 2022;Magister et al., 2022;Fu et al., 2023).
However, these methods for rationale distillation also face several challenges.Firstly, the limited diversity in reasoning paths may lead to a dilemma where the student model simply mimics the superficial style of the teacher model's outputs (Gudibande et al., 2023) or overfits the training data, resulting in limited generalization capabilities.Secondly, despite the existence of multiple rationales leading to the same answer for each given question (as depicted in Figure 1), these methods neglect the consistency among different rationales in reaching the predicted answer when training the student model.Such oversights can undermine the stability of student models during training and impair their generalization capabilities.
To address these challenges, we propose Multi-CoT Consistent Knowledge Distillation (MCC-KD), a novel solution that incorporates two pivotal characteristics.Firstly, this approach leverages multiple diverse rationales for each given question and aims to improve their consistency in predicting the answer.This improvement is expected to enhance the stability and generalizability of the student models.Secondly, we introduce a similaritybased method to facilitate the selection of diverse rationales.MCC-KD draws inspiration from realworld teaching scenarios, where presenting multiple distinct solutions to one problem benefits the student's learning process.With these inherent advantages, MCC-KD enables the smaller models to acquire reasoning capabilities from larger models through effective knowledge distillation.
We conduct extensive experiments with LLaMA (Touvron et al., 2023) andFlanT5 (Chung et al., 2022) on both mathematical reasoning and commonsense reasoning benchmarks.The empirical results demonstrate the effectiveness and superiority of MCC-KD over previous CoT-based knowledge distillation methods.For example, MCC-KD achieves an accuracy improvement from point 38.01 to 41.58 on the GSM8K (Cobbe et al., 2021) dataset with LLaMA-7B.Moreover, the generalization experiments reveal that MCC-KD achieves a substantial accuracy improvement, raising the performance from point 47.69 to 49.52 on the outof-distribution dataset ASDiv (Miao et al., 2020) using FlanT5-XXL.These findings provide compelling evidence of the effectiveness and robustness of MCC-KD.

Related Work
In this section, we briefly review the related work on chain of thought and knowledge distillation.

Chain of Thought
The idea of using natural language rationales to solve mathematical problems through a series of intermediate steps is first pioneered by Ling et al. (2017).Then it has been further shown that natural language rationales or intermediate steps can improve language models' performance (Yao et al., 2021;Hase and Bansal, 2022) and robustness (Chen et al., 2022) on various reasoning tasks.Following this idea, chain of thought (Wei et al., 2022) prompting enables LLMs to generate CoTs or rationales themselves using in-context learning (Min et al., 2022) and few-shot prompting (Brown et al., 2020), thereby enhancing the models' capabilities to solve complex reasoning tasks.Wang et al. (2022b) introduce a multi-round voting mechanism to further improve the CoT prompting.On the other hand, Kojima et al. (2022) propose zero-shot-CoT prompting that leverages zero-shot prompting to guide LLMs, revealing their capabilities to generate CoTs or rationales without the need for manuallywritten contextual prompts.However, Hoffmann et al. (2022) and Chowdhery et al. (2022) unveil that CoT prompting requires the model's parameters to reach a certain scale to be effective.

Knowledge Distillation
Knowledge distillation (KD) (Hinton et al., 2015) aims to train smaller models by distilling knowledge from larger models, reducing model size while preserving high performance and generalization abilities.However, existing methods, such as response-based KD (Hinton et al., 2015;Turc et al., 2019), feature-based KD (Sun et al., 2019), and relation-based KD (Park et al., 2021), all require access to the internal parameters of the teacher model, which are often impractical for LLMs.
Considering that many LLMs are capable of generating high-quality rationales, an alternative approach to knowledge distillation is to leverage the rationales generated by LLMs as distillation training data.Motivated by this, previous works (Shridhar et al., 2022;Hsieh et al., 2023;Ho et al., 2022;Magister et al., 2022) employ LLMs as teacher models to generate chain of thought data as rationales, using the data to transfer the reasoning capabilities into smaller student models.Further-James creates a media empire.He creates a movie for $2000.Each DVD cost $6 to make.He sells it for ... How much profit does he make in 20 weeks?Large LM He sells each DVD for 6*2.5=$15.So he makes a profit of … Therefore, the answer is 448,000.
He sells each DVD for 6*2.5= $15.And he makes a profit of … Therefore, the answer is 448,000.
He makes a profit of $6* 1.5=$9 per DVD sold.So each DVD … Therefore, the answer is 448,000.
Small LM James creates a media empire.He creates a movie for $2000.Each DVD cost $6 to make.He sells it for ... How much profit does he make in 20 weeks?2023) explore a teacher-feedback mechanism relying on LLMs to generate rationales of challenging instructions, aiding student models to learn from difficult samples.Wang et al. (2022a) introduce a pipeline consisting of a rationalizing module and a reasoning module, resembling the teacher-student architecture, but it still requires LLMs to generate rationales for smaller models during the inference.The key distinction between these previous works and ours lies in that we explore the consistency among diverse rationales when training the student model to improve its stability.

Method
We introduce Multi-CoT Consistent Knowledge Distillation (MCC-KD), an approach designed to enhance the generalization and robustness of smaller student models during knowledge distillation.In particular, MCC-KD enforces the consistency among diverse chain of thoughts (CoTs) generated by the teacher LLMs in three key steps.Firstly, given an input question, we utilize zeroshot CoT prompting (Kojima et al., 2022) to obtain multiple CoTs as rationales from a LLM such as GPT-3.5.Secondly, we filter out rationales that exhibit limited diversity.Lastly, we impose constraints on the outputs from multiple rationales to facilitate their consistency.The overall framework of MCC-KD is illustrated in Figure 2.

Rationale Extraction
Following (Kojima et al., 2022), we prompt the teacher model to generate rationales for each question.Formally, let x represent a question, r denote a rationale generated by the teacher, and a indicate the final answer predicted by the teacher from x and r.A resulting training sample is constructed by concatenating x, r, and a in the following format: <x> <r> Therefore, the answer is <a>.
To ensure the abundance and diversity of the generated rationales, we employ a combination of techniques.Firstly, we increase the sampling temperature τ (τ = 1.3 in this study) along with the number of sampling iterations.This approach aids in generating a greater variety of rationales.Secondly, to ensure the accuracy of the generated samples, we conduct verification for both the answer and the rationale.For answer verification, we compare the predicted answer a with the ground truth answer to confirm its correctness.For rationale verification, we find that the correctness of rationales typically aligns with the correctness of answers under most circumstances.

Rationale Filtering
The diversity of reasoning rationales plays a crucial role in transferring reasoning capabilities from teacher LLMs to student models (Gudibande et al., 2023).However, based on our observations, the teacher model still tends to generate similar rationales even with different sampling temperatures (as shown in Table 16 in Appendix B.3).In order to obtain more diverse rationales, we develop a filtering strategy based on N-gram.For each rationale, we convert it into a set of N-gram (specifically, 3gram in this study) segments.Subsequently, we calculate the Jaccard similarity among these sets.To be more specific, considering that there are M rationales {r 1 , r 2 , . . ., r M } extracted for the input question, each rationale r i is decomposed into a set S i of segments.We then compare each pair of segment sets using the Jaccard similarity score to identify the most similar rationales: where k and l represent the indices of the selected rationale pair, r k and r l , respectively.Subsequently, we randomly retain one of the two rationales while discarding the other.This iterative process continues until we accumulate a predefined number, denoted as K (set to 5 in our experiments), of rationales for the given question.

Multi-CoT Consistent Distillation
As previously discussed, LLMs typically generate multiple valid rationales for a given input question.This work is built on the assumption that ensuring consistency among the predicted answers is crucial when training the student model with these rationales.For the input question and the K retained rationales (r 1 , r 2 , ..., r K ) after filtering, we ensure consistency in the predictions of the student model by minimizing the variations in the probabilities of the answers from these different rationales.
Single-token answer We first consider the scenario where the answer consists of a single token, where the prediction corresponds to a probability distribution over the vocabulary.For a given rationale r i , let p represent the predicted distribution obtained for the answer, and for another rationale r j , let q represent the predicted distribution.In order to ensure consistency between these two rationales, we apply bidirectional KL-divergence to their corresponding distributions as our training objective: where V denotes the size of the vocabulary.
Multi-token answer As for the answer consisting of T tokens, each token having its own distribution, we define P = {p 1 , p 2 , . . ., p T } as the set of predicted distributions for the answer obtained through rationale r i , where p t represents the probability distribution of the t-th token in the answer.Similarly, we use Q = {q 1 , q 2 , . . ., q T } to represent the predicted distributions obtained through rationale r j .To achieve multi-CoT consistency, we calculate the bidirectional KL-divergence for each token according to Equation 2 and take the average divergence to obtain the training objective: L kl (p t , q t ). (3) Pairwise rationale sampling Since there are K rationales available for each question, we randomly select two distinct ones from the set of rationales in each training epoch to compute the L kl loss.

Overall Objective
The overall objective function is defined as a combination of the cross-entropy loss (L ce ) in traditional causal language modeling, computed on rationale and answer tokens, and the multi-CoT consistent loss (L kl ), which ensures consistency in the model's answer distribution.The objective function can be represented as follows: where α is a hyperparameter used to adjust the strength of the KL-divergence constraint.

Experimental Setup
In this section, we present the datasets and backbone models utilized in our experiments.

Datasets
To evaluate our method, we adopt both mathematical reasoning and commonsense reasoning tasks, following Ho et al. (2022) and Fu et al. (2023).

Backbone Models
We use GPT-3.5-Turbo as the teacher model and prompt it to generate chain of thought samples (rationales).Following the filtering process introduced in Section 3.

Baseline Methods
In order to evaluate the effectiveness of MCC-KD, we conduct experiments and compare its performance with existing CoT-based distillation methods (Ho et al., 2022;Fu et al., 2023;Magister et al., 2022), which utilize LLMs as teacher models to generate rationales and distill their reasoning abilities directly into smaller student models.For a fair 1 https://huggingface.co/models comparison, we also implement a baseline method called Vanilla KD on our training datasets.Vanilla KD is a CoT-based distillation method that does not incorporate diversity filtering and the multi-CoT consistency constraint.

Results and Analysis
In this section, we present the main results, ablation studies, and additional experiments.

Main Results
The main results for mathematical and commonsense reasoning tasks are provided in Table 2.The baseline results are obtained from their respective original papers (Ho et al., 2022;Magister et al., 2022;Fu et al., 2023).We evaluate the performance of the teacher model through our own testing.It can be observed that MCC-KD outperforms current baseline methods in all mathematical reasoning tasks, namely GSM8K, ASDiv, and SVAMP, when compared with models of similar size.These results highlight significant improvements achieved by MCC-KD.The performance gap between the FlanT5 (Vanilla KD) that we implement and Fu et al. ( 2023) can be explained by the difference in teacher model selection.Fu et al. (2023) utilize code-davinci-002 as the teacher model, while we utilize GPT-3.5-turbo as the teacher model.For the commonsense reasoning tasks, MCC-KD surpasses current baseline methods and even exceeds the performance of the teacher model on the Com-monsenseQA dataset.This outcome clearly demonstrates the effectiveness of MCC-KD in addressing commonsense reasoning tasks.Notably, the distilled models are able to generate reasoning paths directly, eliminating the necessity for any CoT prompting throughout our experiments.

Ablation Study
This ablation study aims to examine the influence of components in MCC-KD.Results are averaged over three runs using randomly selected seeds.
Multi-CoT consistency constraint To assess the impact of the multi-CoT consistency constraint, we perform ablation experiments on different variants of MCC-KD using LLaMA-7B models without the consistency constraint.As presented in Table 3, we observe a significant decrease in performance on both mathematical and commonsense reasoning tasks when the consistency constraint is removed.

Rationale filtering
We then explore the effectiveness of rationale filtering in MCC-KD.In the experiment setting without rationale filtering, we randomly sample 5 rationales for each question, maintaining the same quantity as the experiment setting with rationale filtering.Additionally, we ensure that the correctness rate of the selected rationales remains consistent between both experiment settings.As demonstrated in Table 3, we observe a noticeable decline in performance after removing the rationale filtering process, highlighting the critical importance of rationale diversity.
Student model architecture Furthermore, we examine the effectiveness of MCC-KD on two distinct model architectures: FlanT5-XL for the encoder-decoder Transformer architecture (Vaswani et al., 2017) and LLaMA-7B for the decoder-only architecture.We compare MCC-KD with the vanilla knowledge distillation (KD) approach.As presented in Table 2, MCC-KD consistently enhances performance in comparison to vanilla KD across various model architectures.

Out-of-Distribution Generalization
In line with the work of Fu et al. ( 2023), we explore the ability of MCC-KD to enhance the generalization capabilities of models.We apply MCC-KD to the in-distribution mathematical reasoning dataset (GSM8K) and select the optimal checkpoints for evaluation on out-of-distribution mathematical reasoning datasets (ASDiv, SVAMP, Mul-tiArith, SingleEq, and AddSub).Similarly, we assess the generalization performance on commonsense reasoning tasks using both the in-distribution dataset (CommonsenseQA) and out-of-distribution datasets (StrategyQA, Date Understanding, Tracking Shuffled Objects, and Coin Flip).The results are presented in Table 4 and Table 5, respectively.In mathematical reasoning, MCC-KD demonstrates further improvements in models' generalization capabilities compared to the findings of Fu et al. (2023) as well as the vanilla KD approach.
In commonsense reasoning, MCC-KD exhibits a consistent trend of enhancing generalization capabilities when compared to the vanilla KD approach.

Diversity of Rationales
In this section, we delve into the significance of rationale diversity within the context of MCC-KD.We contend that diversity among the rationales generated by the teacher model is a crucial factor for effective reasoning learning by the student model.To measure the degree of diversity among rationales, we employ the Jaccard similarity metric, where a higher score indicates greater similarity and vice  versa.By manipulating the diversity of training instances, we assess the efficacy of MCC-KD.In our experiments, we utilize the LLaMA-13B model as the student model and incorporate two rationales for each question during training.As illustrated in Figure 3, the performance of MCC-KD exhibits a corresponding improvement with increasing diversity among the rationales, as observed on both the ASDiv and SVAMP development sets.

The Number of Rationales
We also examine the performance of MCC-KD with varying numbers of rationales on the SVAMP and ASDiv datasets.Note that for each training epoch, our method randomly selects two distinct rationales from a set of K rationales per question as the training instances.Hence, we modify the value of K for each question to assess its impact.To ensure sufficient training, we empirically set the number of training epochs to 24.The student model employed in these experiments is LLaMA-7B.As depicted in Figure 4, we observe that as the number of rationales increases, the model's performance on both the ASDiv and SVAMP datasets improves correspondingly.Specifically, when the number of rationales is increased from 2 to 5, there  is a significant enhancement in performance on both datasets.However, when the number is further increased from 5 to 10, the performance gains become less pronounced.Therefore, taking into account computational efficiency, we opt to use 5 rationales in our experiments.

Rationale Correctness
Ensuring LLMs to generate completely accurate rationales poses a challenge, especially when dealing with complex reasoning datasets.Striking a balance between a higher correctness rate and larger data quantity is crucial, considering the increased expenses associated with LLMs' API calls.To investigate the impact of the correctness rate for teacher-generated rationales, we randomly select two rationales from the raw rationales as training instances for each question.We approximate the correctness of rationales by comparing the predicted answers with the ground-truth answers.Figure 5 illustrates the impact of the correctness rate of the generated rationales on the ASDiv and SVAMP datasets, using LLaMA-7B as the student model.As depicted in the figure, there is minimal performance difference when the correctness rate exceeds 90%.However, a significant degradation in model performance is observed when the correctness rate falls around 80% or below.To maintain high performance, we ensure a correctness rate of over 90% throughout our experiments.

Consistency Weight
In the training objective of MCC-KD, we introduce hyperparameter α to balance the multi-CoT consistency constraint.To investigate its impact on the performance of LLaMA-13B, we present results for two different values of α on the GSM8K and CommonsenseQA datasets in Figure 6.The results indicate that our method shows a preference for a smaller α of 0.01 on the mathematical reasoning task of GSM8K, while it favors a larger α of 0.1 on the commonsense reasoning task of Common-senseQA.Therefore, we empirically select α values close to 0.01 for mathematical reasoning datasets (GSM8K, ASDiv, and SVAMP), and close to 0.1 for the commonsense reasoning dataset (Common-senseQA) throughout our experiments.

Combining with Self-Consistency
We note that self-consistency (Wang et al., 2022b) also employs a consistency strategy for maintaining consistency among diverse rationales, which shares similarities with our method.However, selfconsistency requires LLMs to generate multiple rationales and answers during the inference phase, determining the final answer based on the highest vote count.In contrast, our MCC-KD involves the imposition of consistency constraints on diverse rationales during the training phase.
In this section, we investigate the incorporation of self-consistency (SC, 5 rationales for voting) and MCC-KD on the GSM8K dataset, employing the LLaMA-7B model.As observed in Table 6, while both MCC-KD and self-consistency can improve the Vanilla KD approach, the performance of MCC-KD can be further enhanced through the application of self-consistency.Note that MCC-KD and selfconsistency work in distinct phases, making direct comparisons of their results inappropriate.

Conclusion
In this paper, we propose Multi-CoT Consistent Knowledge Distillation (MCC-KD) to transfer reasoning capabilities from larger language models (LLMs) to smaller models.The primary objective is to address the diversity and consistency challenges present in existing knowledge distillation methods for this purpose.Our approach leverages multiple rationales for each given question and focuses on improving their consistency in predicting the answer.Extensive experiments are conducted using different model architectures, including LLaMA and FlanT5, and a range of model scales, such as 3B, 7B, 11B, and 13B.The experiments cover both mathematical and commonsense reasoning benchmarks.The results clearly demonstrate the superior performance of MCC-KD on both in-distribution and out-of-distribution tasks.These findings confirm that MCC-KD enhances the stability and generalizability of the student models.

Limitations
There are three potential limitations of our work.First, the reliance on LLMs for generating rationales introduces a potential limitation in terms of cost associated with API calls.Second, there still exists a significant gap between the student model and the teacher model in mathematical reasoning tasks, requiring future efforts to reduce this disparity.Third, this work focuses solely on exploring only one single teacher model, overlooking the potential benefits and insights that could arise from considering different LLMs as the teachers.

A.1 Model Configurations
We employ four distinct models, namely FlanT5-XL, FlanT5-XXL, LLaMA-7B, and LLaMA-13B, as our student backbone models.The FlanT5 models utilize an encoder-decoder Transformer architecture, whereas the LLaMA models adopt a decoder-only Transformer architecture.Note that the FlanT5 models have undergone instruction tuning, whereas the LLaMA models have not.Based on our empirical observations, the FlanT5 models exhibit stronger performance in commonsense reasoning tasks, while the LLaMA models excel in mathematical reasoning tasks.The configurations of these models are provided in Table 7.

A.2 Experiment Settings
We train all the student models on GeForce RTX 4090 GPUs using the model parallelism technique.For further accelerating the training and saving memory, we utilize quantization techniques and LoRA (Hu et al., 2021).We apply LoRA to four weight matrices in the attention module (W q , W k , W v , W o ) and three weight matrices in the MLP module.Similar to the objective function proposed by Hinton et al. (2015), we search for the temperature parameter in the KL function.
Through all experiments, we use Adam (Kingma and Ba, 2014) as our optimizer, and we set the learning rate to be 1e-5 in most datasets.With LLaMA-7B as the backbone model, we employ gradient accumulation, with mini-batch size of 2 and accumulation steps of 2. For different datasets, the value of α ranges from 0.01 to 0.1.See Table 8 for more details.

A.3 Smaller Student Model
Prior endeavors (Sanh et al., 2019;Sun et al., 2019) in the field of knowledge distillation have primarily focused on using smaller models as the recipients of knowledge transfer.The primary focus of our research resides in the transference of reasoning capabilities from LLMs to more compact counterparts, employing a chain-of-thought (COT) prompting approach.Notably, existing research (Wei et al., 2022;Fu et al., 2023;Kojima et al., 2022) has demonstrated that the capacity for intricate reasoning within LLMs is typically inherent to larger models with over 100 billion parameters.
We conduct an experiment utilizing FlanT5-base (250M) on the SVAMP and ASDiv datasets.The results presented in Table 9 indicate that when employing a smaller model as the student, the improvements achieved through knowledge distillation are relatively modest in comparison to those observed with larger counterparts.This observation suggests that the limited improvement attained with a smaller student model stems primarily from the model's inherent reasoning capacity rather than the effectiveness of the distillation method.Since the MultiArith (Roy and Roth, 2015), Sin-gleEq (Koncel-Kedziorski et al., 2015) and AddSub (Hosseini et al., 2014) datasets are too small, we use these datasets for our out-of-distribution experiments only (Table 11).We don't train on the AQuA (Ling et al., 2017)

B.2 Commonsense Reasoning Datasets
For commonsense reasoning, we select Common-senseQA (Talmor et al., 2019) as our training dataset.We perform a random split of 1221 instances from the original training set to create the test set.This number corresponds to the size of the CommonsenseQA's original development set (Table 10).We do not use the Date Understanding, Tracking Shuffled Objects and Coin Flip as our training datasets due to the small size of those datasets (Table 11), but we use them to evaluate model's out-of-distribution performance.The strat-egyQA (Geva et al., 2021) dataset may contain private or personal information which GPT-3.5-Turbo refuses to answer (Table 12).For the Common-senseQA datasets, the GPT-3.5-Turbomay also refuse to answer certain questions (Table 13).However, the proportion of such cases is much smaller compared with StrategyQA, and it still provides useful information in the rationales even when the teacher model refuses to give a choice.For all commonsense reasoning datasets, we utilize choices-answer format for each question.The CommonsenseQA dataset has five choices per question, and the Tracking Shuffled Objects has three choices per question.See Table 14 for more de-tails.There are no choices for the answers in the CoinFlip and the StrategyQA datasets.Since the answers in CoinFlip and StrategyQA are in the form of "Yes" or "No", we transform them into binary choices-answer format, for example: (A) Yes and (B) No. Examples of the transformed datasets can be seen in Table 12 and Table 15.

B.3 Diversity of Generated Rationales
When the sampling temperature is too low (e.g., τ = 1), the teacher model tends to generate the same rationales for the same question repeatedly.On the other hand, when the temperature is too high (e.g., τ = 1.5), it can lead to a significant decrease in the quality of the generated rationales by the teacher model.Hence, we choose an appropriate temperature τ = 1.3 to balance the generation of diverse rationales while maintaining their quality.However, we observe that even with sampling temperature τ = 1.3, the teacher model may still exhibit a tendency to generate somewhat similar rationales occasionally, as shown in Table 16.We argue that the diversity among teacher-generated rationales is a crucial factor, as higher diversity implies more diverse of solutions.We believe that there should be minimal diversity between rationales in order for effective distillation.
Question Is Tange Sazen hypothetically an ideal choice for a secretary job?Answer Choices: A.
Yes B. No Response Sorry, as an AI language model, I cannot provide a definite answer as to whether Tange Sazen would be an ideal choice for a secretary job as I do not have sufficient knowledge on that individual.However, if provided with their relevant details, I could list out their qualities that may make them suitable for a secretary position.
Question Do Bing (search engine) searches earn the searcher more than competitors do?Answer Choices: A. Yes B. No Response As an AI language model, I do not have access to current market data or the ability to keep information about such kind of advertisement policies which exactly pay if someone searches.
Table 12: Examples of questions that GPT-3.5-Turborefuses to answer from the StrategyQA dataset.
Question Where do you put your grapes just before checking out?Answer Choices: A. mouth B. grocery cart C. super market D. fruit basket E. fruit market Response When grocery shopping and you find yourself with grapes you want to purchase, it is customary to put them in your grocery cart.The cart is usually provided by the store as a means to transport your selected items.Some stores may also offer fruit baskets or bags specifically made for holding grapes to prevent them from rolling around or getting squished.In the case of a fruit market, you may place your grapes in a basket provided by the vendors.As an AI language model, I do not have a personal choice, but I can provide information based on facts.
Question Before getting a divorce, what did the wife feel who was doing all the work?Answer Choices: A. harder B. anguish C. bitterness D. tears E. sadness Response It is difficult to determine the exact emotions felt by the wife who was doing all the work before getting a divorce as everyone's experiences and emotions are unique.However, she may have felt overwhelmed and frustrated by her spouse's perceived lack of effort or help in shouldering the household responsibilities.She may have also felt resentment and bitterness towards her spouse for not sharing the workload and contributing equally.Ultimately, these feelings could have led to sadness, anguish, or tears as the relationship deteriorated and ended in a divorce.As an AI language model, I do not have personal choices or emotions.
Table 13: Examples of questions that GPT-3.5-Turborefuses to answer from the CommonsenseQA dataset.

Figure 1 :
Figure 1: An example question from the GSM8K dataset (Cobbe et al., 2021) and the rationales generated by GPT-3.5-Turbo.Colored texts indicate the probabilities of the shared answer across different rationales.

Figure 2 :
Figure 2: Overview of the MCC-KD framework.Firstly, it leverages a LLM to generate multiple rationales.Subsequently, a filtering process is employed to preserve highly diversified rationales.Then the student model undergoes training by ensuring the consistency of the final answer predictions across various rationales.

Figure 3 :
Figure 3: The performance of MCC-KD on ASDiv and SVAMP development sets with different rationale diversities.JS stands for Jaccard similarity.

Figure 4 :
Figure 4: The performance of MCC-KD on ASDiv and SVAMP with different numbers of rationales.

Figure 5 :
Figure 5: The performances of MCC-KD on the ASDiv and SVAMP datasets with different correctness rates of the teacher generated rationales.

Table 1 :
(Hu et al., 2021)ons.For FlanT5 models, the encoder and decoder have input lengths of 196 and 384, respectively.For LLaMA models, the input length is 512.To optimize memory usage and accelerate training, we leverage LoRA(Hu et al., 2021)and employ varying data precision.All models are trained on multiple GPUs and adopt a greedy decoding strategy.
2, we retain K=5 rationales for each question across all our training datasets.As for the student models, we employ the instructiontuned FlanT5-XL/XXL (3B/11B) (Chung et al., 2022) and LLaMA-7B/13B (Touvron et al., 2023), which are initialized with pre-trained weights obtained from Hugging Face 1 .For the purpose of accelerating training and conserving GPU memory, we apply LoRA (Hu et al., 2021) throughout all of our experiments.The model configurations are summarized in Table 1, and additional details regarding the settings can be found in Appendix A.

Table 2 :
(Ho et al., 2022;Magister et al., 2022;Fu et al., 2023)asoning tasks.The reported results are averaged over three runs using randomly selected seeds.Baseline results from other studies(Ho et al., 2022;Magister et al., 2022;Fu et al., 2023)are included, while the performance of GPT-3.5-Turbo is assessed through our own evaluation.

Table 3 :
Results of ablation study of multi-CoT consistency and diversity filtering on development sets.

Table 4 :
Out-of-distribution performance of MCC-KD on mathematical reasoning.We train our models on the in-distribution dataset (GSM8K) and evaluate the best checkpoints on the out-of-distribution datasets (ASDiv, SVAMP, MultiArith, SingleEq, and AddSub).

Table 5 :
Out-of-distribution performance of MCC-KD on commonsense reasoning.We train our models on the in-distribution dataset (CommonsenseQA) and evaluate the best checkpoints on the out-of-distribution datasets (StrategyQA, Date Understanding, Tracking Shuffled Objects and Coin Filp).

Table 6 :
Results of combining our MCC-KD approach and the self-consistency (SC) method.

Table 8 :
Hyperparameter settings and training cost of our method with LLaMA-7B on different datasets.

Table 10 :
dataset, as it has 100,000 examples, which is too costly to infer with an LLM.Statistics of training, development and test sets.

Table 11 :
Datasets used to evaluate model's out-ofdistribution generalization ability and the dataset sizes.