The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

Language models (LMs) with less than 100B parameters are known to perform poorly on chain-of-thought (CoT) reasoning in contrast to large LMs when solving unseen tasks. In this work, we aim to equip smaller LMs with the step-by-step reasoning capability by instruction tuning with CoT rationales. In order to achieve this goal, we first introduce a new instruction-tuning dataset called the CoT Collection, which augments the existing Flan Collection (including only 9 CoT tasks) with additional 1.84 million rationales across 1,060 tasks. We show that CoT fine-tuning Flan-T5 (3B&11B) with CoT Collection enables smaller LMs to have better CoT capabilities on unseen tasks. On the BIG-Bench-Hard (BBH) benchmark, we report an average improvement of +4.34% (Flan-T5 3B) and +2.60% (Flan-T5 11B), in terms of zero-shot task accuracy. Furthermore, we show that instruction tuning with CoT Collection allows LMs to possess stronger few-shot learning capabilities on 4 domain-specific tasks, resulting in an improvement of +2.24% (Flan-T5 3B) and +2.37% (Flan-T5 11B), even outperforming ChatGPT utilizing demonstrations until the max length by a +13.98% margin. Our code, the CoT Collection data, and model checkpoints are publicly available.


Introduction
Language models (LMs) pre-trained on massive text corpora can adapt to downstream tasks in both zero-shot and few-shot learning settings by incorporating task instructions and demonstrations (Brown et al., 2020;Wei et al., 2021;Sanh et al., 2021;Mishra et al., 2022;Wang et al., 2022b;Iyer et al., 2022;Liu et al., 2022b;Chung et al., 2022;Longpre et al., 2023;Ye et al., 2023).One approach that has been particularly effective in enabling LMs to excel at a multitude of tasks is Chain-of-Thought (CoT) prompting, making LMs generate a rationale to derive its final prediction in a sequential manner (Wei et al., 2022b;Kojima et al., 2022;Zhou et al., 2022;Zhang et al., 2022;Yao et al., 2023).
While CoT prompting works effectively for large LMs with more than 100 billion parameters, it does not necessarily confer the same benefits to smaller LMs (Tay et al., 2022;Suzgun et al., 2022;Wei et al., 2022a;Chung et al., 2022).The requirement of a large number of parameters consequently results in significant computational cost and accessibility issues (Kaplan et al., 2020;Min et al., 2022;Liu et al., 2022b;Mhlanga, 2023;Li et al., 2023).
Recent work has focused on empowering relatively smaller LMs to effectively solve novel tasks as well, primarily through fine-tuning with rationales (denoted as CoT fine-tuning) and applying CoT prompting on a single target task (Shridhar et al., 2022;Ho et al., 2022;Fu et al., 2023).However, solving a single task does not adequately address the issue of generalization to a broad range of unseen tasks.While Chung et al. (2022) leverage 9 publicly available CoT tasks during instruction tuning to solve multiple unseen tasks, the imbalanced ratio compared to 1,827 tasks used for direct finetuning results in poor CoT results across smaller LMs (Longpre et al., 2023).In general, the community still lacks a comprehensive strategy to fully leverage CoT prompting to solve multiple unseen novel tasks in the context of smaller LMs.
To bridge this gap, we present the COT COL-LECTION, an instruction tuning dataset that augments 1.84 million rationales from the FLAN Collection (Longpre et al., 2023) across 1,060 tasks.We fine-tune Flan-T5 (3B & 11B) using COT COL-LECTION and denote the resulting model as CoT-T5.We perform extensive comparisons of CoT-T5 and Flan-T5 under two main scenarios: (1) zeroshot learning and (2) few-shot learning.
In the zero-shot learning setting, CoT-T5 (3B & 11B) outperforms Flan-T5 (3B & 11B) by +4.34% and +2.60% on average accuracy across 27 datasets from the Big Bench Hard (BBH) benchmark (Suzgun et al., 2022) when evaluated with CoT prompting.During ablation experiments, we show that CoT fine-tuning T0 (3B) (Sanh et al., 2021) on a subset of the CoT Collection, specifically 163 training tasks used in T0, shows a performance increase of +8.65% on average accuracy across 11 datasets from the P3 Evaluation benchmark.Moreover, we translate 80K instances of COT COLLECTION into 5 different languages (French, Japanese, Korean, Russian, Chinese) and observe that CoT fine-tuning mT0 (3B) (Muennighoff et al., 2022) on each language results in 2x ∼ 10x performance improvement on average accuracy across all 5 languages from the MGSM benchmark (Shi et al., 2022).
In the few-shot learning setting, where LMs must adapt to new tasks with a minimal number of instances, CoT-T5 (3B & 11B) exhibits a +2.24% and +2.37% improvement on average compared to using Flan-T5 (3B & 11B) as the base model on 4 different domain-specific tasks2 .Moreover, it demonstrates +13.98% and +8.11% improvement over ChatGPT (OpenAI, 2022) and Claude (Anthropic, 2023) that leverages ICL with demonstrations up to the maximum input length.
Our contributions are summarized as follows: • We introduce COT COLLECTION, a new instruction dataset that includes 1.84 million rationales across 1,060 tasks that could be used for applying CoT fine-tuning to LMs.
• With COT COLLECTION, we fine-tune Flan-T5, denoted as CoT-T5, which shows a nontrivial boost in zero-shot and few-shot learning capabilities with CoT Prompting.
• For ablations, we show that CoT fine-tuning could improve the CoT capabilities of LMs in low-compute settings by using a subset of COT COLLECTION and training on (1) smaller number of tasks (T0 setting; 163 tasks) and (2) smaller amount of instances in 5 different languages (French, Japanese, Korean, Russian, Chinese; 80K instances).
2 Related Works 2.1 Chain-of-Thought (CoT) Prompting Wei et al. (2022b) propose Chain of Thought (CoT) Prompting, a technique that triggers the model to generate a rationale before the answer.By generating a rationale, large LMs show improved reasoning abilities when solving challenging tasks.Kojima et al. (2022) show that by appending the phrase 'Let's think step by step', large LMs could perform CoT prompting in a zero-shot setting.Different work propose variants of CoT prompting such as automatically composing CoT demonstrations (Zhang et al., 2022) and performing a finegrained search through multiple rationale candidates with a tree search algorithm (Yao et al., 2023).While large LMs could solve novel tasks with CoT Prompting, Chung et al. ( 2022) and Longpre et al. (2023) show that this effectiveness does not necessarily hold for smaller LMs.In this work, we aim to equip smaller LMs with the same capabilities by instruction tuning on large amount of rationales.

Improving Zero-shot Generalization
Previous work show that instruction tuning enables generalization to multiple unseen tasks (Wei et al., 2021;Sanh et al., 2021;Aribandi et al., 2021;Ouyang et al., 2022;Wang et al., 2022b;Xu et al., 2022).Different work propose to improve instruction tuning by enabling cross-lingual generalization (Muennighoff et al., 2022), improving label generalization capability (Ye et al., 2022), and training modular, expert LMs (Jang et al., 2023).Meanwhile, a line of work shows that CoT fine-tuning could improve the reasoning abilities of LMs on a single-seen task (Zelikman et al., 2022;Shridhar et al., 2022;Ho et al., 2022;Fu et al., 2023).As a follow-up study, we CoT fine-tune on 1,060 instruction tasks and observe a significant improvement in terms of zero-shot generalization on multiple tasks.

Improving Few-Shot Learning
For adapting LMs to new tasks with a few instances, recent work propose advanced parameter efficient fine-tuning (PEFT) methods, where a small number of trainable parameters are added (Hu et al., 2021;Lester et al., 2021;Liu et al., 2021Liu et al., , 2022b;;Asai et al., 2022;Liu et al., 2022c).In this work, we show that a simple recipe of (1) applying LoRA (Hu et al., 2021) to a LM capable of performing CoT reasoning and (2) CoT fine-tuning on a target task results in strong few-shot performance.Figure 1: An illustration of the overall task group and dataset source of where we obtained the instances to augment the rationales in COT COLLECTION.Compared to the 9 datasets that provide publicly available rationales (included within 'Flan-T5 ExQA', 'Flan-T5 Arithmetic', 'Flan-T5 MCQA', 'Flan-T5 NLI' from the red box), we generate ∼51.29 times more rationales (1.84 million rationales) and ∼117.78 times more task variants (1,060 tasks).

The COT COLLECTION
Despite its effectiveness for CoT fine-tuning, rationale data still remains scarce.To the best of our knowledge, recent work mostly rely on 9 publicly available NLP datasets3 for fine-tuning with rationales (Zelikman et al., 2022;Shridhar et al., 2022;Chung et al., 2022;Ho et al., 2022;Longpre et al., 2023;Fu et al., 2023).This is due to the difficulty in gathering human-authored rationales (Kim et al., 2023).To this end, we create COT COLLECTION, an instruction-tuning dataset that includes 1.84 million rationales augmented across 1,060 tasks4 .In this section, we explain the datasets we select to augment into rationales and how we perform the overall augmentation process.
Broad Overview Given an input X = [I, z] composed of an instruction I, and an instance z along with the answer y, we obtain a rationale r by applying in-context learning (ICL) with a large LM.Note that this differs from previous works which focused on generating new instances z using large LMs (West et al., 2022;Liu et al., 2022a;Kim et al., 2022;Honovich et al., 2022;Wang et al., 2022a;Taori et al., 2023;Chiang et al., 2023) while we extend it to generating new rationales r.
Source Dataset Selection As a source dataset to extract rationales, we choose the Flan Collection (Longpre et al., 2023), consisting of 1,836 diverse NLP tasks from P3 (Sanh et al., 2021), Super-NaturalInstructions (Wang et al., 2022b), Flan (Wei et al., 2021), and some additional dialogue & code datasets.We choose 1,060 tasks, narrowing our focus following the criteria as follows: • Generation tasks with long outputs are excluded since the total token length of appending r and y exceeds the maximum output token length (512 tokens) during training.
• Datasets that are not publicly available such as DeepMind Coding Contents and Dr Repair (Yasunaga and Liang, 2020) are excluded.
• Datasets where the input and output do not correspond to each other in the huggingface datasets (Lhoest et al., 2021) are excluded.
• When a dataset appears in common across different sources, we prioritize using the task from P3 first, followed by SNI, and Flan.
• During preliminary experiments, we find that for tasks such as sentiment analysis, sentence completion, coreference resolution, and word disambiguation, rationales generated by large LMs are very short and uninformative.We exclude these tasks to prevent negative transfer during multitask learning (Aribandi et al., 2021;Jang et al., 2023).
Creating Demonstrations for ICL We first create prompts to apply in-context learning (ICL) with large LMs for augmenting the instances in the selected tasks with rationales.Preparing demonstrations D t for each task t is the most straightforward, but it becomes infeasible to prepare demonstrations for each task as the number of tasks gets larger.Instead, we assign each task t to T k , a family of tasks that shares a similar task format such as multiple choice QA, closed book QA, and dialogue generation.Each family of tasks share D T k , which consists of 6 ∼ 8 demonstrations.These 6 ∼ 8 demonstrations for each task group T k is manually created by 3 of the authors in this paper.Specifically, given 136 instances sampled from Flan Collection, two annotators are assigned to write a rationale, and the other third annotator conducts an A/B testing between the two options.We manually create D T k across k = 26 task groups.We include the prompts for all of the different task groups in Appendix D.

Rationale Augmentation
We use the OpenAI Codex 5 to augment rationales.Formally, given (X t i , y t i ), the i th instance of a task t, the goal is to generate corresponding rationale r t i .Note that during preliminary experiments, we found that ordering the label in front of the rationale within the demonstration D T k was crucial to generate good quality rationales.We conjecture this is because ordering the label in front of the rationale loosens the need for the large LM to solve the underlying task and only focus on generating a rationale.However, we also found that in some tasks such as arithmetic reasoning, large LMs fail to generate good-quality rationales.To mitigate this issue, we apply filtering to the augmented rationales.We provide the criteria used for the filtering phase and 5 The use of Codex was largely due to limited academic budget (OpenAI supported Codex with no cost for researchers up to June 2023).Moreover, other LLM services such as Bard (Google, 2023) and Claude (Anthropic, 2023) were not supported during the period of COT COLLECTION augmentation.To address the concern of reproducibility, analysis on quality of rationales from Codex, Bard and Claude is included in Appendix A, the filtered cases at Appendix B. Also, we include analysis of the diversity and quality of COT COL-LECTION compared to the existing 9 CoT tasks and human-authored rationales in Appendix A.

Experiments
For our main experiments, we use Flan-T5 (Chung et al., 2022)

Evaluation
We evaluate under two different evaluation methods: Direct Evaluation and CoT Evaluation.For Direct Evaluation on classification tasks, we follow previous works using verbalizers, choosing the option with the highest probability through comparison of logit values (Schick and Schütze, 2021;Sanh et al., 2021;Ye et al., 2022;Jang et al., 2023), and measure the accuracy.For generation tasks, we directly compare the LM's prediction with the answer and measure the EM score.
When evaluating with CoT Evaluation, smaller LMs including Flan-T5 often do not generate any rationales even with the trigger phrase 'Let's think step by step'.Therefore, we adopt a hard constraint of requiring the LM to generate r t i with at least a minimum length of 8 tokens.In classification tasks, we divide into two steps where the LM first generates r t i , and then verbalizers are applied with a indicator phrase '[ANSWER]' inserted between r t i and the possible options.For generation tasks, we extract the output coming after the indicator phrase.Accuracy metric is used for classification tasks while EM metric is used for generation tasks.

Zero-shot Generalization
In this subsection, we show how training with COT COLLECTION could effectively improve the LM's ability to solve unseen tasks.We have three difference experimental set-ups, testing different aspects: Setup #1: training on the entire 1060 tasks in COT COLLECTION and evaluating the reasoning capabilities of LMs with the Bigbench Hard (BBH) benchmark (Suzgun et al., 2022), Setup #2: training only on 163 tasks that T0 (Sanh et al., 2021) used for training (a subset of the COT COLLECTION), and evaluating the linguistic capabilities of LMs with the P3 evaluation benchmark (Sanh et al., 2021), and Setup #3: training with a translated, subset version of COT COLLECTION for each five different languages and evaluating how LMs could perform CoT reasoning in multilingual settings using the MGSM benchmark (Shi et al., 2022).

Setup #1: CoT Fine-tuning with 1060 CoT Tasks
We first perform experiments with our main model, CoT-T5, by training Flan-T5 on the entire COT COLLECTION and evaluate on the BBH benchmark (Suzgun et al., 2022).In addition to evaluating Flan-T5, we compare the performances of different baselines such as (1) T5-LM (Raffel et al., 2020): the original base model of Flan-T5, (2) T0 (Sanh et al., 2021): an instruction-tuned LM trained with P3 instruction dataset, (3) Tk-Instruct (Wang et al., 2022b): an instruction-tuned LM trained with SNI instruction dataset, and (4) GPT-3 (Brown et al., 2020): a pre-trained LLM with 175B parameters.For ablation purposes, we also train T5-LM with COT COLLECTION (de-  1 and Table 2.In Table 1, CoT-T5 (3B & 11B) achieves a +4.34% and +2.60% improvement over Flan-T5 (3B & 11B) with CoT Evaluation.Surprisingly, while CoT-T5-3B CoT performance improves +4.34% with the cost of 0.96% degradation in Direct Evalution, CoT-T5-11B's Direct Evaluation performance even improves, resulting in a +2.57% total average improvement.Since COT COLLECTION only includes instances augmented with rationales, these results show that CoT fine-tuning could improve the LM's capabilities regardless of the evaluation method.Also, T5-3B + COT FT and T5-11B + COT FT outperforms FLAN-T5-3B and FLAN-T5-11B by a +1.45% and +3.89% margin, respectively, when evaluated with CoT evaluation.Moreover, T5-3B + COT FINE-TUNING outperforms ∼4 times larger models such as T0-11B and Tk-Instruct-11B in both Direct and CoT Evaluation.The overall results indicate that (1) CoT fine-tuning on a diverse number of tasks enables smaller LMs to outperform larger LMs and (2) training with FLAN Collection and CoT Collection provides complementary improvements to LMs under different evaluation methods; CoT-T5 obtains good results across both evaluation methods by training on both datasets.
In Table 2, CoT-T5-11B obtains same or better results on 15 out of 23 tasks when evaluated with Direct evaluation, and 17 out of 23 tasks when evaluated with CoT Evaluation compared to Flan- Table 3: Evaluation performance on 11 different unseen P3 dataset (Sanh et al., 2021) categorized into 4 task categories.We report the direct performance of the baselines since they were not CoT fine-tuned on instruction data.The best comparable performances are bolded and second best underlined.We exclude Flan-T5 and CoT-T5 since they were trained on the unseen tasks (tasks from FLAN and SNI overlap with the P3 Eval datasets), breaking unseen task assumption.
T5-11B.Interestingly, Vicuna (Chiang et al., 2023), a LM trained on long-form dialogues between users and GPT models, perform much worse compared to both CoT-T5 and Flan-T5.We conjecture that training on instruction datasets from existing academic benchmarks consisting CoT Collection and Flan Collection is more effective in enabling LMs to solve reasoning tasks compared to chat LMs.
Setup #2: CoT Fine-tuning with 163 CoT Tasks (T0 Setup) To examine whether the effect of CoT fine-tuning is dependent on large number of tasks and instances, we use the P3 training subset from the COT COLLECTION consisted of 644K instances from 163 tasks, and apply CoT finetuning to T0 (3B) (Sanh et al., 2021) and T5-LM (3B) (Raffel et al., 2020).Note that T0 is trained with 12M instances, hence ∼18.63 times larger.Then, we evaluate on the P3 evaluation benchmark which consists of 11 different NLP datasets.In addition to the baselines from the previous section (T5-LM, T0, and GPT-3), we also include LMs that are trained on the same T0 setup for comparison such as, (1) RoE (Jang et al., 2023): a modular expert LM that retrieves different expert models depending on the unseen task, (2) KiC (Pan et al., 2022): a retrieval-augmented model that is instruction-tuned to retrieve knowledge from a KB memory, and (3) Flipped (Ye et al., 2022): an instruction-tuned model that is trained to generate the instruction in order to resolve the LM overfitting to the output label as baseline models.
The results are shown in Table 3. Surprisingly, T5-3B + COT FT outperforms T0-3B by a +8.24% margin when evaluated with CoT Evaluation, while using ∼18.63 times less instances.This supports that CoT fine-tuning is data efficient, being effective even with less number of instances and tasks.Moreover, T0-3B + COT FT improves T0-3B by +8.65% on average accuracy.When compared with T0-11B with ∼4 times more number of parameters, it achieves better performance at sentence completion, and word sense disambiguation (WSD) tasks, and obtains similar performances at natural language inference and coreference resolution tasks.
Setup #3: Multilingual Adaptation with CoT Fine-tuning In previous work, Shi et al. (2022) proposed MGSM, a multilingual reasoning benchmark composed of 10 different languages.In this subsection, we conduct a toy experiment to examine whether CoT fine-tuning could enable LMs to reason step-by-step in multilingual settings as well, using a subset of 5 languages (Korean, Russian, French, Chineses, Japanese) from MGSM.
In Table 4, current smaller LMs can be divided into three categories: (1) Flan-T5, a LM that is CoT fine-tuned with mostly English instruction data, (2) MT5 (Xue et al., 2021), a LM pretrained on diverse languages, but isn't instruction tuned or CoT finetuned, (3) MT0 (Muennighoff et al., 2022), a LM that is instruction-tuned on diverse languages, but isn't CoT fine-tuned.In relatively underrepresented languages such as Korean, Japanese, and Chinese, all three LMs get close to zero accuracy.
A natural question arises whether training a multilingual LM that could reason step-by-step on different languages is viable.As a preliminary research, we examine whether CoT Fine-tuning on a single language with a small amount of CoT data could enable LMs to avoid achieving near zero score such as Korean, Chinese and Japanese subsets of MGSM.Since there is no publicly available multilingual instruction dataset, we translate 60K ∼ 80K instances from COT COLLECTION for each 5 languages using ChatGPT (OpenAI, 2022), and CoT fine-tune mT5 and mT0 on each of them.
The results are shown in about non-trivial gains in performance.Even for relatively low-resource languages such as Korean Japanese, and Chinese, CoT fine-tuning on the specific language allows the underlying LM to perform mathematical reasoning in the target language, which are considered very difficult (Shi et al., 2022).Considering that only a very small number of instances were used for language-specific adaptation (60k-80k), CoT fine-tuning shows potential for efficient language adaptation.However, it is noteworthy that we limited our setting to training/evaluating on a single target language, without exploring the cross-lingual transfer of CoT capabilities among varied languages.The chief objective of this experimentation was to ascertain if introducing a minimal volume of CoT data could facilitate effective adaptation to the target language, specifically when addressing reasoning challenges.Up to date, no hypothesis has suggested that training with CoT in various languages could enable cross-lingual transfer of CoT abilities among different languages.We identify this as a promising avenue for future exploration.

Few-shot Generalization
In this subsection, we show how CoT-T5 performs in a few-shot adaptation setting where a limited number of instances from the target task can be used for training, which is sometimes more likely in real-world scenarios.

Dataset Setup
We choose 4 domain-specific datasets from legal and medical domains including LEDGAR (Tuggener et al., 2020), Case Hold (Zheng et al., 2021), MedNLI (Romanov and Shivade, 2018), and PubMedQA (Jin et al., 2019).To simulate a few-shot setting, we randomly sample 64 instances from the train split of each dataset.We report the average accuracy across 3 runs with different random seeds.We augment rationales for the 64 training instances using the procedure described in Section 3 for the rationale augmentation phase, utilizing the MCQA prompt from P3 dataset.In an applied setting, practitioners could obtain rationales written by human experts.
Training Setup We compare Flan-T5 & CoT-T5, across 3B and 11B scale and explore 4 different approaches for few-shot adaptation: (1) regular finetuning, (2) CoT fine-tuning, (3) LoRA fine-tuning, and (4) LoRA CoT fine-tuning.When applying Lora, we use a rank of 4 and train for 1K steps following Liu et al. (2022b).This results in training 2.35M parameters for 3B scale models and 4.72M parameters for 11B scale models.Also, we include Claude (Anthropic, 2023) and ChatGPT (OpenAI, 2022) as ICL baselines by appending demonstrations up to maximum context length 6 .Specifically, For CoT prompting, the demonstrations are sampled among 64 augmented rationales are used.
Effect of LoRA The experimental results are shown in Table 5.Overall, CoT fine-tuning CoT-T5 integrated with LoRA obtains the best results overall.Surprisingly for Flan-T5, applying full fine-tuning obtains better performance compared to its counterpart using LoRA fine-tuning.However, when using CoT-T5, LoRA achieves higher performance compared to full fine-tuning.We conjecture this to be the case because introducing only 6 Full context length was 4k tokens for ChatGPT and 9k tokens for Claude.a few parameters enables CoT-T5 to maintain the CoT ability acquired during CoT fine-tuning.
Fine-tuning vs. CoT Fine-tuning While CoT fine-tuning obtains similar or lower performance compared to regular fine-tuning in Flan-T5, CoT-T5 achieves higher performance with CoT finetuning compared to Flan-T5 regular fine-tuning.This results in CoT-T5 in combination with CoT fine-tuning showing the best performance in fewshot adaptation setting.
Fine-tuning vs. ICL Lastly, fine-tuning methods obtain overall better results compared to ICL methods utilizing much larger, proprietary LLMs.We conjecture this to be the case due to the long input length of legal and medical datasets, making appending all available demonstrations (64) impossible.While increasing the context length could serve as a temporary solution, it would still mean that the inference time will increase quadratically in proportion to the input length, which makes ICL computationally expensive.

Analysis of of CoT Fine-tuning
In this section, we conduct experiments to address the following two research questions: • For practitioners, is it more effective to augment CoT rationales across diverse tasks or more instances with a fixed number of tasks?
• During CoT fine-tuning, does the LM maintain its performance on in-domain tasks without any catastrophic forgetting?

Scaling the number of tasks & instances
In our main experiments, we used a large number of instances (1.84M) across a large number of tasks (1,060) to apply CoT fine-tuning.A natural question arises: "Is it more effective to increase the number of tasks or the number of instances?"To address this question, we conduct an experiment of randomly sampling a small number of instances within the COT COLLECTION and comparing the BBH performance with (1) a baseline that is only CoT fine-tuned with the existing 9 CoT tasks and (2) COT-T5 that fully utilizes all the 1.84M instances.Specifically, we sample 10K, 100K instances within the COT COLLECTION and for the 9 CoT tasks, we fully use all the 180K instances.As COT-T5, we use Flan-T5 as our base model and use the same training configuration and evaluation setting (CoT Eval) during our experiments.
The results are shown in Figure 3, where surprisingly, only using 10K instances across 1,060 tasks obtains better performance compared to using 180K instances across 9 tasks.This shows that maintaining a wide range of tasks is more crucial compared to increasing the number of instances.

In-domain Task Accuracy of CoT-T5
It is well known that LMs that are fine-tuned on a wide range of tasks suffer from catastrophic forgetting (Chen et al., 2020;Jang et al., 2021Jang et al., , 2023)), a phenomenon where an LM improves its performance on newly learned tasks while the performance on previously learned tasks diminishes.While COT-T5 uses the same tasks as its base model (Flan-T5), we also check whether CoT finetuning on a wide range of tasks could possibly harm performance.For this purpose, we use the test set of 5 tasks within the COT COLLECTION, namely ANLI-R1, ANLI-R2, ANLI-R3, RTE, and Winogrande.Note that this differs with the Setup #2 in the main experiments in that we use different base models (T0 vs Flan-T5), and the tasks are already used for CoT fine-tuning.
Results are shown in Figure 4, where COT-T5 consistently improves in-domain accuracy on the learned tasks as well.However, we conjecture that this is because we used the exact same task that Flan-T5 used to CoT fine-tuned COT-T5.Adding additional tasks that were not used to train Flan-T5 and COT-T5 could show different results, and we leave additional exploration of catastrophic forgetting during CoT fine-tuning to future work.

Conclusion
In this work, we show that augmenting rationales from an instruction tuning data using LLMs (Open AI Codex), and CoT fine-tuning could improve the reasoning capabilities of smaller LMs.Specifically, we construct COT COLLECTION, a largescale instruction-tuning dataset with 1.84M CoT rationales extracted across 1,060 NLP tasks.With our dataset, we CoT fine-tune Flan-T5 and obtain CoT-T5, which shows better zero-shot generalization performance and serves as a better base model when training with few number of instances.We hope COT COLLECTION could be beneficial in the development of future strategies for advancing the capabilities of LMs with CoT fine-tuning.

Limitations
Recently, there has been a lot of focus on distilling the ability to engage in dialogues with longform outputs in the context of instruction following (Taori et al., 2023;Chiang et al., 2023).Since our model COT-T5 is not trained to engage in dialogues with long-form responses from LLMs, it does not necessarily possess the ability to be applied in chat applications.In contrast, our work focuses on improving the zero-shot and few-shot capabilities by training on academic benchmarks (COT COLLECTION, Flan Collection), where LMs trained with chat data lack on.Utilizing both longform chat data from LLMs along with instruction data from academic tasks has been addressed in future work (Wang et al., 2023).Moreover, various applications have been introduced by using the FEEDBACK COLLECTION to train advanced chat models 7 .
Also, since COT-T5 uses Flan-T5 as a base model, it doesn't have the ability to perform stepby-step reasoning in diverse languages.Exploring how to efficiently and effectively train on CoT data from multiple languages is also a promising and important line of future work.While Shi et al. ( 2022) has shown that large LMs with more than 100B parameters have the ability to write CoT in different languages, our results show that smaller LMs show nearly zero accuracy when solving math problems in different languages.While CoT finetuning somehow shows slight improvement, a more comprehensive strategy of integrating the ability to write CoT in diverse language would hold crucial.
In terms of reproducibility, it is extremely concerning that proprietary LLMs shut down such as the example of the Codex, the LLM we used for rationale augmentation.We provide additional analysis on how different LLMs could be used for this process in Appendix A. Also, there is room of improvement regarding the quality of our dataset by using more powerful LLMs such as GPT-4 and better prompting techniques such as Tree of Thoughts (ToT) (Yao et al., 2023).This was examined by later work in Mukherjee et al. (2023) which used GPT-4 to augment 5 million rationales and Yue et al. (2023) which mixed Chain-of-Thoughts and Program of Thoughts (PoT) during fine-tuning.Using rationales extracted using Tree of Thoughts (Yao et al., 2023) could also be explored in future work.

A Analysis of COT COLLECTION
Non-cherry picked rationales within COT COL-LECTION are shown in Table 8.We perform an analysis regarding the quality, diversity, and reproducibility of rationale within the COT COLLECTION.

Diversity of Rationales
To take a look into the diversity of COT COLLECTION, we use Berkeley Neural Parser (Kitaev and Klein, 2018;Kitaev et al., 2019) and parse rationales.More specifically, the verb which is closest to the root of the parse tree along the noun object is extracted.We compare this with the rationales from the 9 CoT datasets used in Chung et al. (2022).As shown in Figure 5, COT COLLECTION have diverse textual formats included compared to the 9 existing CoT datasets that have a high proportion assigned to 'answer question' and 'consider following'.

Quality of Rationales
To ensure the quality of COT COLLECTION, we use ROSCOE (Golovneva et al., 2022), a suite of metrics designed to evaluate rationales under different criteria within semantic alignment, semantic similarity, logical inference, language coherence.We compare with human-authored rationales obtained during Prompt Creation in Section 3. The 13 ROSCOE scores are shown in Table 6.The results show that COT COLLECTION include CoT rationales that are faithful, less repetitive, informative, and logical even when compared to human-authored rationales.Yet, we find that machine-generated rationales tend to have higher perplexity, leading to lower language coherence scores.We conjecture this is because including diverse textual formats leads may result in relatively higher perplexity (Holtzman et al., 2019).
Is COT COLLECTION Reproducible?One could doubt whether COT COLLECTION is reproducible due to the usage of OpenAI model in the process of CoT rationale augmentation 8 .In this section, we test different LLMs to generate 150 rationales randomly sampled from COT COLLEC-TION, and compare the ROSCOE score (Golovneva et al., 2022) in order to assess the quality.We use Bard (Google, 2023), Claude (Anthropic, 2023), for comparing with OpenAI Codex.The comparison of quality is shown in Figure 6.The results show that different LLMs are able to produce high quality rationales in terms of semantic alignment and language coherence.

B Filtering COT COLLECTION
Filtering After generating multiple rationales, we filter to ensure high quality.We apply the following criteria to filter instances: • We exclude rationales that do not include the ground truth answer when splitted by white spaces.While a rationale that doesn't include the answer isn't necessarily a bad rationale, we found it is effective to exclude inconsistent ones.
• We exclude CoT rationales that exceed the maximum output length, where we constrain the sum of r and y to be shorter than 512 tokens.
• We exclude rationales that are identical to previously augmented ones during our process.
• We exclude rationales that include repetitive sentences within the context.

C Training and Evaluation Details of
CoT-T5 We mostly follow the fine-tuning details of Chung et al. (2022) to train CoT-T5.The hyperparameters used for training CoT-T5 are shown in 7. We find 3B and 11B sized LMs converge well using different optimizers.While CoT-T5-3B tends to converge well using AdamW, CoT-T5-11B is well optimized using Adafactor.For both sizes, we train with 1 epoch using COT COLLECTION which takes 1 day (3B) and 7 days (11B) when 8 A 100 (80GB) GPUs are used.For both settings, we use a gradient accumulation step of 8.
During evaluation, we found that using nucleus sampling (Holtzman et al., 2019)

Question & Instruction Answer Rationale
Read the following context and choose the best option to answer the question.
Context: From the triple loop she executed after five minutes, he suspected that she had managed to sneak off for flying more often than during the summers and wondered just how diligent George was about practicing on his own time.
Question: What is a triple loop ?
Options: A. None of the above choices .B. 3 round trips on a plane route .
C. 3 neckties worn on flights D. A feature pulled off by making 3 loops in the air.

D
A triple loop is a maneuver in which an aircraft makes three loops.
Based on the context, it can be inferred that a triple loop involves making 3 loops in the air.Since this is what A means, D should be chosen as the correct answer.
Read the passage below and choose the right answer to the following question (choices are more or less ): On a hot day, an endotherm may give off more heat by increasing blood flow to the surface of the body.
After Sarah finishes her walk, she notices that her blood is not really pumping that hard.Sarah can know that since it's a cool day, she's giving off what kind of heat compared to usual more or less?less The given passage states that an endotherm gives off more heat by increasing blood flow to the surface of its body.
As such, it can be inferred that decreased blood flow corresponds with less heat being given off.Since Sarah notices that her blood is not really pumping hard on a cool day, she is giving off less heat compared to usual.
Read this and answer the question.Pubertal development also affects circulatory and respiratory systems as an adolescents' heart and lungs increase size and capacity.These changes lead to increased and for exercise.Sex differences are apparent as males tend to develop "larger hearts and lungs, higher systolic pressure, a resting heart rate, a capacity carrying to the a greater power for neutralizing chemical products of muscular exercise, higher blood hemoglobin and more red blood cells ¨.
Which sex to develop larger hearts lungs?males The last of paragraph that males tend to develop larger hearts and lungs.
In this you are given a of characters.You need to remove duplicate characters from the if any, return resulting string.
"nXnGFzHFYudOiHyWLZXQfpw" nXGFzHYudOiyWLZQfpw There are no duplicate characters the given string, so the resulting string will be the same as the input.Therefore, the output will be In this task you will be given list of numbers you should remove all duplicates list.If every number is repeated in the list an list should be returned.
Your list should be numbers brackets, just like the given list.

{5, 0}
The given list is {3, 1, 0, We can see that '3' the list twice.Therefore, we need to keep only one of the '3's and remove the one.
How to keep your phone cool in the car Keep your phone on a mount with air conditioning on if you need to use it.If you need to use phone while in your car, purchase that attaches to one of the dashboard air vents.
Turn on the air conditioner so it keeps your phone cold.
Can we draw the following conclusion?keep your phone and tablet cool while mounted on an air conditioner vent if you need to use them while in the car.

OPTIONS:
-Yes -It's impossible to say -No It's impossible to say While the passage does mention that your phone should be mounted on an air vent, it doesn't specify whether this is referring to a car or home AC.
Additionally, we don't know from the context if other devices such as tablets can also be kept cool using this method.So, "It's impossible to say" Give me a possible correct answer to the question "who nominated stephen g. breyer?" Bill Clinton Stephen G. Breyer was a nominee for the Supreme Court by Bill Clinton and was confirmed on July 29, 1994.
Although he wasn't nominated in 1998 or 1999, this is still a valid answer because it happened within 5 years of those dates before.The context provided contradicts this hypothesis by stating that the FBI began to maintain this public list in the 1950s, under the direction of J. Edgar Hoover.The context goes on to list individuals who were first placed on the 10 Most Wanted list during the 1950s, further supporting the notion that the list was not in existence prior to that decade.So the answer is "No".
[Example 6] [Instruction and Question] Q: If The plaintiffs claim Penrose's design is distinguished by its aperiodicity (the pattern almost but never quite repeats itself) and its five−fold symmetry (a trait that at the time was thought not to exist in nature but has since been identified in certain crystal formations)., can we say that "The design has no identified traits that would distinguish it."?\nOPTIONS:\n− yes\n− it is not possible to tell\n− no\n\nA: [Answer] no [Rationale] The plaintiffs are stating that the design has unique characteristics that set it apart, which directly contradicts the statement that the design has no identified traits that would distinguish it.Therefore, the answer is "no".
[Example 7] [Instruction and Question] Does "The first docking of Soviet spacecraft was finally realised in January 1969 by the Soyuz 4 and Soyuz 5 missions."contain the correct answer to "When was the first successful docking of a two man space crew?"\nOPTIONS:\n− yes\n− no\n\nA: [Answer] yes [Rationale] The passage talks about the first successful docking of a two−man space crew, as both Soyuz 4 and Soyuz 5 had two cosmonauts on board at the time of their docking.Therefore, the statement contains the correct answer to the question.So, the answer is "yes".
[ Example 8] [Instruction and Question] input question: Is it true that "I shall have to guard The foxes." if "The foxes are getting in at night and attacking the chickens.I shall have to guard them." is true?OPTIONS:\n− no\n− yes\noutput answer: [Answer] no [Rationale] The statement "I shall have to guard The foxes" does not accurately reflect the situation described in the second sentence "I shall have to guard them".In this context, "them" refers to the foxes, so the correct statement would be "I shall have to guard them", not "I shall have to guard the foxes".So, the answer is "no".
[Example 9] [Instruction and Question] Question:\nPremise: A man in a blue shirt and orange shorts looks on at a small child wearing a yellow shirt and blue shorts with trees comprising the backdrop.\n\nHypothesis:There are a lot of trees behind the small child.\nIs the hypothesis true?\nOPTIONS:\n− yes\n− it is not possible to tell\n− no\n\nAnswer:\n [Answer] it is not possible to tell [Rationale] While it is mentioned that trees make up the backdrop, it is not specified whether there are a lot of trees or not.Additionally, the focus of the premise is on the man and child and not on the trees in the background.Hence, it is not possible to determine whether the hypothesis is true or not.There are 15 trees in the grove.Grove workers will plant trees in the grove today.After they are done, there will be 21 trees.How many trees did the grove workers plant today?
[Answer] 6 [Rationale] We start with 15 trees.Later we have 21 trees.The difference must be the number of trees they planted.So, they must have planted 21 − 15 = 6 trees.
[Example 2] [Instruction and Question] If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
[ [Answer] 9 [Rationale] He has 5 toys.He got 2 from mom, so after that he has 5 + 2 = 7 toys.Then he got 2 more from dad, so in total he has 7 + 2 = 9 toys.

[Example 6] [Instruction and Question]
There were nine computers in the server room.Five more computers were installed each day, from monday to thursday.How many computers are now in the server room?
[Answer] 29 [Rationale] There are 4 days from monday to thursday.5 computers were added each day.That means in total 4 * 5 = 20 computers were added.There were 9 computers in the beginning, so Figure2: MCQA Prompt used to augment rationales from P3 dataset.Through ICL, the large LM generates a rationale that is conditioned on the ground-truth label.

Figure 3 :
Figure 3: Scaling plot of increasing the number of instances within the COT COLLECTION compared to using the existing 9 CoT datasets.Even with less number of instances, maintaining a wider range of tasks is crucial to improve the CoT abilities of an underlying LLM.

Figure 4 :
Figure 4: In-domain task accuracy with CoT evaluation.CoT Fine-tuning with the COT COLLECTION also improves accuracy on in-domain tasks as well.

Figure 5 :Figure 6 :
Figure 5: The top 20 common root verbs (inner circle) and their top 4 noun objects (outer circle) within the rationales of the 9 CoT tasks used in Chung et al. (2022) (left side) and our COT COLLECTION with 1,060 tasks (right side).

Table 1 :
Evaluation performance on all the 27 unseen datasets from BBH benchmark, including generation tasks.All evaluations are held in a zero-shot setting.The best comparable performances are bolded and second best underlined.
noted as 'T5 + CoT FT').Note that FLAN Collection includes 15 million instances, hence ∼8 times larger compared to our COT COLLECTION.The results on BBH benchmark are shown across Table

Table 2 :
Evaluation performance on 23 unseen classification datasets from BBH benchmark.Scores of Vicuna, ChatGPT, Codex (teacher model of COT-T5), GPT-4 are obtained from Chung et al. (2022) and Mukherjee et al. (2023).Evaluations are held in a zero-shot setting.The best comparable performances are bolded and second best underlined among the open-sourced LMs.

Table 4 .
Across all the 5 different languages, CoT fine-tuning brings

Table 5 :
Evaluation performance on 4 domain-specific datasets.FT. denotes Fine-tuning, COT FT. denotes CoT fine-tuning, and COT PT. denotes CoT Prompting.The best comparable performances are bolded and second best underlined.For a few-shot adaptation, we use 64 randomly sampled instances from each dataset.
with p =0.8 and no_repeat_n_gram =3 was very effective in generating good-quality rationales.

Table 8 :
Example of rationales within COT COLLECTION including tasks such as Multiple choice QA (MCQA), Extractive QA (ExQA), Closed-book QA (CBQA), Formal Logic, Natural Language Inference (NLI), and Arithmetic.Today, two Biff Burger restaurants remain.\nBiffBurger came about in Clearwater Florida.OPTIONS:\n− Yes\n− It's impossible to say\n− No\nA: To validate the claim "Biff Burger came about in Clearwater Florida", we should find a supporting evidence within the context.We could see that "Biff Burger (Best In Fast Food) was a burger fast food franchise that originated in Clearwater, Florida."supports the claim we are trying to validate.So the answer is "Yes".In the decade before the 1950s, the United States FBI began to maintain a public list of the people it regarded as Ten Most Wanted Fugitives\nContext: In the 1950s, the United States FBI began to maintain a public list of the people it regarded as the Ten Most Wanted Fugitives.Following is a brief review of FBI people and events that place the 1950s decade in context, and then an historical list of individual fugitives whose names first appeared on the 10 Most Wanted list during the decade of the 1950s, under FBI Director J. Edgar Hoover.\nOPTIONS:\n−Yes\n− It's impossible to say\n− No\ntrue or false: Example 3] [Instruction and Question] Leah had 32 chocolates and her sister had 42.If they ate 35, how many pieces do they have left in total?Leah had 32 chocolates and Leah's sister had 42.That means there were originally 32 + 42 = 74 chocolates.35 have been eaten.So in total they still have 74 − 35 = 39 chocolates.[Example 4] [Instruction and Question] Jason had 20 lollipops.He gave Denny some lollipops.Now Jason has 12 lollipops.How many lollipops did Jason give to Denny?Jason had 20 lollipops.Since he only has 12 now, he must have given the rest to Denny.The number of lollipops he has given to Denny must have been 20 − 12 = 8 lollipops.[Example 5] [Instruction and Question] Shawn has five toys.For Christmas, he got two toys each from his mom and dad.How many toys does he have now?