Self-Instruct: Aligning Language Models with Self-Generated Instructions

Large “instruction-tuned” language models (i.e., finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks. Nevertheless, they depend heavily on human-written instruction data that is often limited in quantity, diversity, and creativity, therefore hindering the generality of the tuned model. We introduce Self-Instruct, a framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off their own generations. Our pipeline generates instructions, input, and output samples from a language model, then filters invalid or similar ones before using them to finetune the original model. Applying our method to the vanilla GPT3, we demonstrate a 33% absolute improvement over the original model on Super-NaturalInstructions, on par with the performance of InstructGPT-001, which was trained with private user data and human annotations. For further evaluation, we curate a set of expert-written instructions for novel tasks, and show through human evaluation that tuning GPT3 with Self-Instruct outperforms using existing public instruction datasets by a large margin, leaving only a 5% absolute gap behind InstructGPT-001. Self-Instruct provides an almost annotation-free method for aligning pre-trained language models with instructions, and we release our large synthetic dataset to facilitate future studies on instruction tuning.


Introduction
The recent NLP literature has witnessed a tremendous amount of activity in building models that 1 Unless otherwise specified, our comparisons are with the text-davinci-001 engine. We focus on this engine since it is the closest to our experimental setup: supervised fine-tuning with human demonstrations. The newer engines are more powerful, though they use more data (e.g., code completion or latest user queries) or algorithms (e.g., PPO) that are difficult to compare with. 2 Code and data will be available at https://github. com/yizhongw/self-instruct.
can follow natural language instructions (Mishra et al., 2022;Wei et al., 2022;Sanh et al., 2022;Wang et al., 2022;Ouyang et al., 2022;Chung et al., 2022, i.a.). These developments are powered by two key components: large pre-trained language models (LM) and human-written instruction data. PROMPTSOURCE (Bach et al., 2022) and SUPER-NATURALINSTRUCTIONS (Wang et al., 2022) are two notable recent datasets that use extensive manual annotation for collecting instructions to construct T0 (Bach et al., 2022;Sanh et al., 2022) and T -INSTRUCT (Wang et al., 2022). However, this process is costly and often suffers limited diversity given that most human generations tend to be popular NLP tasks, falling short of covering a true variety of tasks and different ways to describe them. Given these limitations, continuing to improve the quality of instruction-tuned models necessitates the development of alternative approaches for supervising instruction-tuned models.
In this work, we introduce SELF-INSTRUCT, a semi-automated process for instruction-tuning a pretrained LM using instructional signals from the model itself. The overall process is an iterative bootstrapping algorithm (see Figure 1), which starts off with a limited (e.g., 175 in our study) seed set of manually-written instructions that are used to guide the overall generation. In the first phase, the model is prompted to generate instructions for new tasks. This step leverages the existing collection of instructions to create more broad-coverage instructions that define (often new) tasks. Given the newlygenerated set of instructions, the framework also creates input-output instances for them, which can be later used for supervising the instruction tuning. Finally, various measures are used to prune lowquality and repeated instructions, before adding them to the task pool. This process can be repeated for many interactions until reaching a large number of tasks.
To evaluate SELF-INSTRUCT empirically, we run 175 seed tasks with 1 instruction and 1 instance per task Task Pool Step 1: Instruction Generation

No
Step 4: Filtering Output-first Input-first Step 2: Classification Task Identification Step 3: Instance Generation Instruction : Give me a quote from a famous person on this topic.

Task
Yes Task Instruction : Give me a quote from a famous person on this topic.
Input: Topic: The importance of being honest. Output: "Honesty is the first chapter in the book of wisdom." -Thomas Jefferson Task   Task Instruction : Find out if the given text is in favor of or against abortion.
Class Label: Pro-abortion Input: Text: I believe that women should have the right to choose whether or not they want to have an abortion.
Task LM LM LM Figure 1: A high-level overview of SELF-INSTRUCT. The process starts with a small seed set of tasks (one instruction and one input-output instance for each task) as the task pool. Random tasks are sampled from the task pool, and used to prompt an off-the-shelf LM to generate both new instructions and corresponding instances, followed by filtering low-quality or similar generations, and then added back to the initial repository of tasks. The resulting data can be used for the instruction tuning of the language model itself later to follow instructions better. Tasks shown in the figure are generated by GPT3. See Table 10 for more creative examples. this framework on GPT3 (Brown et al., 2020), which is a vanilla LM ( §4). The iterative SELF-INSTRUCT process on this model leads to about 52k instructions, paired with about 82K instance inputs and target outputs. We observe that the resulting data provides a diverse range of creative tasks and over 50% of them have less than 0.3 ROUGE-L overlaps with the seed instructions ( §4.2). On this resulting data, we build GPT3 SELF-INST by fine-tuning GPT3 (i.e., the same model used for generating the instructional data). We evaluate GPT3 SELF-INST in comparison to various other models on both typical NLP tasks included in SUPER-NATURALINSTRUCTIONS (Wang et al., 2022), and a set of new instructions that are created for novel usage of instruction-following models ( §5). The SUPERNI results indicate that GPT3 SELF-INST outperforms GPT3 (the original model) by a large margin (+33.1%) and nearly matches the performance of InstructGPT 001 . Moreover, our human evaluation on the newly-created instruction set shows that GPT3 SELF-INST demonstrates a broad range of instruction following ability, outperforming models trained on other publicly available instruction datasets and leaving only a 5% gap behind InstructGPT 001 .
In summary, our contributions are: (1) SELF-INSTRUCT, a method for inducing instructionfollowing capability with minimal human-labeled data; (2) We demonstrate its effectiveness via extensive instruction-tuning experiments; (3) We release a large synthetic dataset of 52K instructions and a set of manually-written novel tasks for building and evaluating future instruction-following models.

Related Work
Instruction-following language models. A series of works have found evidence that vanilla language models can be effective at following general language instructions if tuned with annotated "instructional" data -datasets containing language instructional commands and their desired outcome based on human judgement (Weller et al., 2020;Mishra et al., 2022;Wang et al., 2022;Wei et al., 2022;Sanh et al., 2022;Ouyang et al., 2022;Parmar et al., 2022;Scialom et al., 2022;Chung et al., 2022;Luo et al., 2022;Puri et al., 2022;Yin et al., 2022;Chakrabarty et al., 2022;Lin et al., 2022;Gupta et al., 2022;Muennighoff et al., 2022). Additionally, they show a direct correlation between the size and diversity of the "instructional" data and the generalizability of resulting models to unseen tasks. Since these developments depend on human annotated "instructional" data, this poses a bottleneck for progress toward more generalizable models (for example see Fig. 5a in Wang et al., 2022). Our work aims to tackle this bottleneck by reducing the dependence on human annotators.
Additionally, despite the remarkable performance of models like InstructGPT (Ouyang et al., 2022), their construction process remains quite opaque. In particular, the role of data has remained understudied due to limited transparency and data released by major corporate entities behind these key models. Addressing such challenges necessitates the creation of a large-scale, public dataset covering a broad range of tasks.
Instruction-following models have also been of interest in the multi-modal learning literature (Fried et al., 2018;Shridhar et al., 2020;Min et al., 2022;Weir et al., 2022). SELF-INSTRUCT, as a general approach to expanding data, can potentially also be helpful in those settings; however, this is out of the scope of this work.
Language models for data generation and augmentation. A variety of works have relied on generative LMs for data generation (Schick and Schütze, 2021;Wang et al., 2021;Liu et al., 2022;Meng et al., 2022) or augmentation (Feng et al., 2021;Yang et al., 2020;Mekala et al., 2022). For example, Schick and Schütze (2021) propose to replace human annotations of a given task with prompting large LMs and use the resulting data for fine-tuning (often smaller) models in the context of SuperGLUE tasks (Wang et al., 2019). While our work can be viewed as a form of "augmentation," our work differs from this line in that it is not specific to a particular task (say, QA or NLI). In contrast, a distinct motivation for SELF-INSTRUCT is to bootstrap new task definitions that may not have been defined before by any NLP practitioner (though potentially still important for downstream users).

Self-training.
A typical self-training framework (He et al., 2019;Xie et al., 2020;Du et al., 2021;Amini et al., 2022;Huang et al., 2022) uses trained models to assign labels to unlabeled data and then leverages the newly labeled data to improve the model. In a similar line, Zhou et al. (2022a) use multiple prompts to specify a single task and propose to regularize via prompt consistency, encouraging consistent predictions over the prompts. This allows either finetuning the model with extra unlabeled training data, or direct application at inference time. While SELF-INSTRUCT has some similarities with the self-training literature, most selftraining methods assume a specific target task as well as unlabeled examples under it; in contrast, SELF-INSTRUCT produces a variety of tasks from scratch.
Knowledge distillation. Knowledge distillation (Hinton et al., 2015;Sanh et al., 2019;West et al., 2021;Magister et al., 2022) often involves the transfer of knowledge from larger models to smaller ones. SELF-INSTRUCT can also be viewed as a form of "knowledge distillation", however, it differs from this line in the following ways: (1) the source and target of distillation are the same, i.e., a model's knowledge is distilled to itself; (2) the content of distillation is in the form of an instruction task (i.e., instructions that define a task, and a set of examples that instantiate it).
Bootstrapping with limited resources. A series of recent works use language models to bootstrap some inferences using specialized methods. NPPrompt (Zhao et al., 2022) provides a method to generate predictions for semantic labels without any fine-tuning. It uses a model's own embeddings to automatically find words relevant to the label of the data sample and hence reduces the dependency on manual mapping from model prediction to label (verbalizers). STAR (Zelikman et al., 2022) iteratively leverages a small number of rationale examples and a large dataset without rationales, to bootstrap a model's ability to perform reasoning. Self-Correction (Welleck et al., 2022) decouples an imperfect base generator (model) from a separate corrector that learns to iteratively correct imperfect generations and demonstrates improvement over the base generator. Our work instead focuses on bootstrapping new tasks in the instruction paradigm.

Instruction generation.
A series of recent works (Zhou et al., 2022b;Ye et al., 2022;Singh et al., 2022;Honovich et al., 2022) generate instructions of a task given a few examples. While SELF-INSTRUCT also involves instruction generation, a major difference in our case is it is task-agnostic; we generate new tasks (instructions along with instances) from scratch.

Method
Annotating large-scale instruction data can be challenging for humans because it requires 1) creativity to come up with novel tasks and 2) expertise for writing the labeled instances for each task. In this section, we detail our process for SELF-INSTRUCT, which refers to the pipeline of generating tasks with a vanilla pretrained language model itself and then conducting instruction tuning with this generated data in order to align the language model to follow instructions better. This pipeline is depicted in Figure 1.

Defining Instruction Data
The instruction data we want to generate contains a set of instructions { }, each of which defines a task in natural language. Each task has one or more input-output instances ( , ). A model is expected to produce the output , given the task instruction and the instance input : ( , ) = , for ( , ) ∈ ( , ). Note that the instruction and instance input does not have a strict boundary in many cases. For example, "write an essay about school safety" can be a valid instruction that we expect models to respond to directly, while it can also be formulated as "write an essay about the following topic" as the instruction, and "school safety" as an instance input. To encourage the diversity of the data format, we allow such instructions that do not require additional input (i.e., is empty).

Automatic Instruction Data Generation
Our pipeline for generating the instruction data consists of four steps: 1) instruction generation, 2) identifying whether the instruction represents a classification task or not, 3) instance generation with the input-first or the output-first approach, and 4) filtering low-quality data.
Instruction Generation. SELF-INSTRUCT is based on a finding that large pretrained language models can be prompted to generate new and novel instructions when presented with some existing instructions in the context. This provides us with a way to grow the instruction data from a small set of seed human-written instructions. We propose to generate a diverse set of instructions in a bootstrapping fashion. We initiate the task pool with 175 tasks (1 instruction and 1 instance for each task) written by our authors. For every step, we sample 8 task instructions from this pool as in-context examples. Of the 8 instructions, 6 are from the human-written tasks, and 2 are from the model-generated tasks in previous steps to promote diversity. The prompting template is shown in Table 6.
Classification Task Identification. Because we need two different approaches for classification and non-classification tasks, we next identify whether the generated instruction represents a classification task or not. 3 We prompt vanilla GPT3 few-shot to determine this, using 12 classification instructions and 19 non-classification instructions from the seed tasks. The prompting template is shown in Table 7.
Instance Generation. Given the instructions and their task type, we generate instances for each instruction independently. This is challenging because it requires the model to understand what the target task is, based on the instruction, figure out what additional input fields are needed and generate them, and finally complete the task by producing the output. We found that pretrained language models can achieve this to a large extent when prompted with instruction-input-output in-context examples from other tasks. A natural way to do this is the Input-first Approach, where we can ask a language model to come up with the input fields first based on the instruction, and then produce the corresponding output. This generation order is similar to how models are used to respond to instruction and input, but here with in-context examples from other tasks. The prompting template is shown in Table 8.
However, we found that this approach can generate inputs biased toward one label, especially for classification tasks (e.g., for grammar error detection, it usually generates grammatical input). Therefore, we additionally propose an Output-first Approach for classification tasks, where we first generate the possible class labels, and then condition the input generation on each class label. The prompting template is shown in Table 9. 4 We apply the outputfirst approach to the classification tasks identified in the former step, and the input-first approach to the remaining non-classification tasks.

Filtering and Postprocessing.
To encourage diversity, a new instruction is added to the task pool only when its ROUGE-L overlap with any existing instruction is less than 0.7. We also exclude instructions that contain some specific keywords (e.g., images, pictures, graphs) that usually can not be processed by language models. When generating new instances for each instruction, we filter out instances that are exactly the same or those with the same input but different outputs.

Finetuning the LM to Follow Instructions
After the creation of the large-scale instruction data, we use this data to finetune the original language model (i.e., SELF-INSTRUCT). To do this, we concatenate the instruction and instance input as a prompt and train the model to generate the instance output in a standard supervised way. To make the model robust to different formats, we use multiple templates to encode the instruction and instance input together. For example, the instruction can be prefixed with "Task:" or not, the input can be prefixed with "Input:" or not, "Output:" can be appended at the end of the prompt, and different numbers of break lines can be put in the middle, etc.

SELF-INSTRUCT Data from GPT3
In this section, we apply our method for inducing instruction data to GPT3 as a case study. We use the largest GPT3 language model ("davinci" engine) accessed through the OpenAI API 5 . The parameters for making queries are described in Appendix A.1. Here we present an overview of the generated data. Table 1 describes the basic statistics of the generated data. We generate a total of over 52K instructions, and more than 82K instances corresponding to these instructions after filtering.

Diversity
To study what types of instructions are generated and how diverse they are, we identify the verb-noun structure in the generated instructions. We use the Berkeley Neural Parser 6 (Kitaev and Klein, 2018;Kitaev et al., 2019) to parse the instructions, and then extract the verb that is closest to the root of the parse tree as well as its first direct noun object. 26,559 out of the 52,445 instructions contain such structure; other instructions usually contain more complex clauses (e.g., "Classify whether this tweet contains political content or not.") or are framed as questions (e.g., "Which of these statements are true?"). We plot the top 20 most common root verbs and their top 4 direct noun objects in Figure 2, which accounts for 14% of the entire set. Overall, we see quite diverse intents and textual formats in these instructions.
We further study how the generated instructions differ from the seed instructions that are used to prompt the generation. For each generated instruction, we compute its highest ROUGE-L overlap with the 175 seed instructions. We plot the distribution of these ROUGE-L scores in Figure 3, indicating a decent number of new instructions that do not have much overlap with the seeds. We also demonstrate diversity in length of the instructions, instance inputs, and instance outputs in Figure 4.

Quality
So far, we have shown the quantity and diversity of the generated data, but its quality remains uncertain. To investigate this, we randomly sample 200 instructions and randomly select 1 instance per instruction. We asked an expert annotator (co-author of this work) to label whether each instance is correct or not, in terms of the instruction, the instance input, and the instance output. Evaluation results in Table 2 show that most of the generated instructions are meaningful, while the generated instances may contain more noise (to a reasonable extent). However, we found that even though the generations may contain errors, most of them are still in the correct format or even partially correct, which can provide useful guidance for training models to follow instructions. We listed a number of good generations and bad generations in Table 10 and  Table 11 respectively.

Experimental Results
We conduct experiments to measure and compare the quality of models under various instruction tuning setups. We first describe our models and other baselines, followed by our experiments. Is the input appropriate for the instruction? 79% Is the output a correct and acceptable response to the instruction and input? 58% All fields are valid 54% Table 2: Data quality review for the instruction, input, and output of the generated data. See Table 10 and Table 11 for representative valid and invalid examples.

GPT3 SELF-INST : fine-tuning GPT3 on its own instruction data
With the instruction-generated instruction data, we conduct instruction tuning for the GPT3 model itself (the "davinci" engine). As we described in §3.3, we use various templates to concatenate the instruction and input, and train the model to generate the output. This finetuning is done through the OpenAI finetuning API 7 . We use the default hyper-parameters, except that we set the prompt loss weight to 0, and we train the model for 2 epochs. We refer the readers to Appendix A.2 for additional finetuning details. The resulting model is denoted as GPT3 SELF-INST .

Baselines
Off-the-shelf language models. We evaluate T5-LM (Lester et al., 2021;Raffel et al., 2020) and GPT3 (Brown et al., 2020) as the vanilla LM baselines (only pre-training, no additional fine-tuning). These baselines will indicate the extent to which offthe-shelf LMs are capable of following instructions naturally immediately after pretraining.

Publicly-available instruction-tuned models.
T0 and T -INSTRUCT are two instruction-tuned models proposed in Sanh et al. (2022) and Wang et al. (2022) respectively, and are demonstrated to be able to follow instructions for many NLP tasks. Both of these models are finetuned from the T5 (Raffel et al., 2020) checkpoints and are publicly available 89 . For both of these models, we use their largest version with 11B parameters.
Instruction-tuned GPT3 models. We evaluate InstructGPT (Ouyang et al., 2022), which is developed by OpenAI based on GPT3 to follow human instructions better and has been found by the community to have impressive zero-shot abilities. There are various generations of these models, where newer ones use more expansive data or algorithmic novelties 10 . For our SUPERNI experiments in §5.3, we only compare with their text-davinci-001 engine, because their newer engines are trained with the latest user data and are likely to already see the SUPERNI evaluation set. For our human evaluation of these models on newly written instructions, we include their 001, 002 and 003 engines for completeness. Additionally, to compare SELF-INSTRUCT training with other publicly available instruction tuning data, we further finetune GPT3 model with data from PROMPTSOURCE and SUPERNI, which are used to train the T0 and T -INSTRUCT models. We call them T0 training and SUPERNI training for short, respectively. To save the training budget, we sampled 50K instances (but covering all their instructions) for each dataset, which has a comparable size to the instruction data we generated. Based on the findings from Wang et al. (2022) and our early experiments, reducing the number of instances per task does not degrade the model's generalization performance to unseen tasks.

Experiment 1: Zero-Shot Generalization on SUPERNI benchmark
We first evaluate the models' ability to follow instructions on typical NLP tasks in a zero-shot fashion. We use the evaluation set of SUPERNI (Wang et al., 2022), which consists of 119 tasks with 100 instances in each task. In this work, we mainly focus on the zero-shot setup, i.e., the model is prompted with the definition of the tasks only, without incontext demonstration examples. For all our requests to the GPT3 variants, we use the deterministic generation mode (temperature as 0 and no nucleus sampling) without specific stop sequences.

Results.
We make the following observations from the results in Table 3. SELF-INSTRUCT boosts the instruction-following ability of GPT3 by a large margin. The vanilla GPT3 model basically cannot follow human instructions at all. Upon manual analysis, we find that it usually generates irrelevant and repetitive text, and does not know when to stop generation. Compared with other models that are not specifically trained for SUPERNI, GPT3 SELF-INST achieves better performance than T0 or the GPT3 finetuned on the T0 training set, which takes tremendous human labeling efforts. Notably, GPT3 SELF-INST also nearly matches the performance of InstructGPT 001 , which is trained with private user data and human-annotated labels. Models trained on SUPERNI training set still achieve better performance on its evaluation set, which we attribute to the similar instruction style and formatting. However, we show that SELF-INSTRUCT still brings in additional gains when combined with the SUPERNI training set, proving its value as complementary data.

Experiment 2: Generalization to User-oriented Instructions on Novel Tasks
Despite the comprehensiveness of SUPERNI in collecting existing NLP tasks, most of these NLP tasks were proposed for research purposes and skewed toward classification. To better access the practical value of instruction-following models, a subset of the authors curate a new set of instructions motivated by user-oriented applications. We first brainstorm different domains where large LMs may be useful (e.g., email writing, social media, productivity tools, entertainment, programming), then craft instructions related to each domain along with an input-output instance (again, input is optional). We aim to diversify the styles and formats of these tasks (e.g., instructions may be long or short; input/output may take the form of bullet points, tables, codes, equations, etc.). In total, we create 252 instructions with 1 instance per instruction. We believe it can serve as a testbed for evaluating how instructionbased models handle diverse and unfamiliar instructions.  G P T 3 T 0 T r a i n i n g S u p e r N I T r a i n i n g S e l f -I n s t r u c t + S u p e r N I S e l f -I n s t r u c t I n s t r u c t G P T -0 0 1 I n s t r u c t G P T -0 0 2 I n s t r u c t G P T -0 0 3 correct and satisfying response acceptable response with minor imperfections responds to the instruction but has significant errors irrelevant or invalid response V a n i l l a G P T 3 G P T 3 + T 0 T r a i n i n g G P T 3 + S u p e r N I T r a i n i n g G P T 3 S e lf In st ru c t + S u p e r N I quire different expertise. Indeed, many of these tasks cannot be measured by automatic metrics or even be judged by normal crowdworkers (e.g., writing a program, or converting first-order logic into natural language). To get a more faithful evaluation, we asked the authors of the instructions to judge model predictions. The evaluators were asked to rate the output based on whether it accurately and effectively completes the task. We implemented a four-level rating system for categorizing the quality of the models' outputs, defined as follows:

G P T 3 S e lf
• RATING-A: The response is valid and satisfying.
• RATING-B: The response is acceptable but has minor errors or imperfections that can be improved. • RATING-C: The response is relevant and responds to the instruction, but it has significant errors in the content. For example, GPT3 might generate a valid output first, but continue to generate other irrelevant things. • RATING-D: The response is irrelevant or invalid, including repetition of the input, totally irrelevant output, etc.
Results. Figure 5 provides the performance of GPT3 model and its instruction-tuned counterparts on this newly written instruction set. As anticipated, the vanilla GPT3 language model is largely unable to respond to instructions, and all instruction-tuned models demonstrate comparatively higher performance, Nonetheless, GPT3 SELF-INST (i.e., GPT3 model fine-tuned with SELF-INSTRUCT) outperforms those counterparts trained on T0 or SUPERNI by a large margin, demonstrating the value of the generated data despite the noise. Compared with InstructGPT 001 (c.f. footnote 1), GPT3 SELF-INST is quite close in the performance-if we count acceptable response with minor imperfections (RATING-3) as valid, GPT3 SELF-INST is only 5% behind InstructGPT 001 . Lastly, our evaluation confirms the impressive instruction-following ability of InstructGPT 002 & InstructGPT 003 models. Although there are many factors behind this success, we conjecture that future work can largely benefit from improving the quality of our generated data by using human annotators or training a reward model to select better generations, similar to the algorithm used in Ouyang et al. (2022).

Example Predictions from GPT3 SELF-INST
We present a selection of user-oriented tasks, the corresponding GPT3 SELF-INST -produced responses and annotator ratings in Table 4. We see that even for responses rated as level 2, the model demonstrates extensive steps in solving the task, even though its final output is incorrect.

Why does SELF-INSTRUCT work?
It is worthwhile to reflect on the role that highquality human feedback plays in enabling the recent successes on instruction-tuning LMs ( ( 1 ) Human feedback is a necessary and indispensable aspect of instruction-tuning as LMs need to learn about issues that were not quite learned during pre-training.
( 2 ) Human feedback is an optional aspect of instruction-tuning as LMs are already quite familiar with instructions from their pre-training.
Observing the human feedback is merely a lightweight process for aligning their pretraining distribution/objective which might be replaceable with a different process.
While the reality probably lies somewhere in between these two extremes, we conjecture that it is closer to 2 , particularly for larger models. This intuition, that LMs already know much about language instructions, is a key motivation for SELF-INSTRUCT and is also supported by its empirical success.

Broader Impact
Beyond the immediate focus of this paper, we believe that SELF-INSTRUCT may help bring more transparency to what happens "behind the scenes" of widely-used instruction-tuned models like InstructGPT. Unfortunately, such industrial models remain behind the API walls as their datasets are not released, and hence there is little understanding of their constructions and why they demonstrate impressive capabilities. The burden now falls on academia to better understand the source of success in these models and strive for bet- ter -yet open -models. We believe our findings in this paper demonstrate the importance of diverse instruction data, and our large synthetic dataset can be the first step toward higher-quality data for building better instruction-following models.

Limitations of SELF-INSTRUCT
Here, we discuss some limitations of this work to inspire future research in this direction.
There is someone (x) such that if he is a soldier, then he is general of everyone.
There exists a soldier such that for every general, he is a general.
C Instruction: Design an outline for a blog post based on the given information and list the sections accordingly.
Input: This blog post is going to about making a list of last-minute gift ideas. Also, the tone of the post is going to be relaxed and casual.
-Section: Movie tickets -Section: A fun game console -Section: Personalized items with photos/artwork -...(more sections) -Takeaway: Don't stress about out running out of time to buy, make a gift.
-  Table 4: Examples in the User-oriented Instructions dataset ( §5.4) and predictions from GPT3 SELF-INST . The right column indicates one of the four quality ratings assigned to the model's response, with "A" indicating "valid and satisfying" responses (highest) and "D" indicating "irrelevant or invalid response" (lowest). models. If true, this may create barriers to access for those who may not have large computing resources. We hope future studies will carefully study the gains as a function of model size or various other parameters. It is worthwhile to note that instruction-tuning with human annotation also suffers from a similar limitation: gains of instruction-tuning are higher for larger model (Wei et al., 2022).

Reinforcing LM biases.
A point of concern for the authors is the unintended consequences of this iterative algorithm, such as the amplification of problematic social biases (stereotypes or slurs about genders, races, etc.). Relatedly, one observed challenge in this process is the algorithm's difficulty in producing balanced labels, which reflected models' prior biases. We hope future work will hash out such details to better understand the pros and cons of the approach.
We introduce SELF-INSTRUCT, a task-agnostic method to improve the instruction-following capabilities of language models via its own generation of instruction data (instruction, input, and output samples) and bootstrapping with it. Our method conducts instruction-tuning of the original model on the pruned subset of generated samples. On experimenting with vanilla GPT3, we observe a 33% absolute improvement over the original model on SUPER-NATURALINSTRUCTIONS. This performance is on par with InstructGPT 001 , which is trained with private user data and expensive human annotations. Furthermore, we curate a set of expert-written instructions for novel tasks. Human evaluation on this set shows that tuning GPT3 with SELF-INSTRUCT outperforms using existing public instruction datasets by a large margin, leaving only a 5% absolute gap behind InstructGPT 001 . We hope SELF-INSTRUCT can serve as the first step to align pretrained language models to follow human instructions, and future work can build on top of this data to improve instruction-following models.

A.1 Querying the GPT3 API
We use different sets of hyper-parameters when querying GPT3 API for different purposes. These hyperparameters are found to work well with the GPT3 model ("davinci" engine) and the other instruction-tuned GPT3 variants. We listed them in Table 5.

A.2 Finetuning GPT3
GPT3 SELF-INST and some of our baselines are finetuned from GPT3 model ("davinci" engine with 175B parameters). We conduct this finetuning via OpenAI's finetuning API 11 . While the details of how the model is finetuned with this API are not currently available (e.g., which parameters are updated, or what the optimizer is), we tune all our models with the default hyper-parameters of this API so that the results are comparable. We only set the "prompt_loss_weight" to 0 since we find this works better in our case, and every finetuning experiment is trained for two epochs to avoid overfitting the training tasks. Finetuning is charged based on the number of tokens in the training file. Finetuning GPT3 SELF-INST from the GPT3 model took $338.

B Prompting Templates for Data Generation
SELF-INSTRUCT relies on a number of prompting templates in order to elicit the generation from language models. Here we provide our four templates for generating the instruction (Table 6), classifying whether an instruction represents a classification task or not (Table 7), generating non-classification instances with the input-first approach (Table 8), and generating classification instances with the output-first approach (Table 9).
Come up with a series of tasks:  Table 6: Prompt used for generating new instructions. 8 existing instructions are randomly sampled from the task pool for in-context demonstration. The model is allowed to generate instructions for new tasks, until it stops its generation, reaches its length limit or generates "Task 16" tokens.
Can the following task be regarded as a classification task with finite output labels?   Table 8: Prompt used for the input-first approach of instance generation. The model is prompted to generate the instance first, and then generate the corresponding output. For instructions that don't require additional input, the output is allowed to be generated directly.
Given the classification task definition and the class labels, generate an input that corresponds to each of the class labels. If the task doesn't require input, just generate the correct class label.
Task: Classify the sentiment of the sentence into positive, negative, or mixed. Class label: mixed Sentence: I enjoy the flavor of the restaurant but their service is too slow. Class label: Positive Sentence: I had a great day today. The weather was beautiful and I spent time with friends. Class label: Negative Sentence: I was really disappointed by the latest superhero movie. I would not recommend it.
Task: Given a dialogue, classify whether the user is satisfied with the service. You should respond with "Satisfied" or "Unsatisfied". Class label: Satisfied Dialogue: -Agent: Thank you for your feedback. We will work to improve our service in the future.
-Customer: I am happy with the service you provided. Thank you for your help. Class label: Unsatisfied Dialogue: -Agent: Sorry that we will cancel your order. You will get a refund within 7 business days.
-Customer: oh that takes too long. I want you to take quicker action on this.
Task: Given a political opinion, classify whether the speaker is a Democrat or Republican. Task: Does the document supports the claim? Answer with "Support" or "Unsupport". Class label: Unsupport Document: After a record-breaking run that saw mortgage rates plunge to all-time lows and home prices soar to new highs, the U.S. housing market finally is slowing. While demand and price gains are cooling, any correction is likely to be a modest one, housing economists and analysts say. No one expects price drops on the scale of the declines experienced during the Great Recession. Claim: The US housing market is going to crash soon. Class label: Support Document: The U.S. housing market is showing signs of strain, with home sales and prices slowing in many areas. Mortgage rates have risen sharply in recent months, and the number of homes for sale is increasing. This could be the beginning of a larger downturn, with some economists predicting a potential housing crash in the near future. Claim: The US housing market is going to crash soon.  Table 9: Prompt used for the output-first approach of instance generation. The model is prompted to generate the class label first, and then generate the corresponding input. This prompt is used for generating the instances for classification tasks.

C Task and Instance Examples from the Generated Instruction Data
Instruction + Instance Input Target Output Validity Table 10: Representative valid tasks generated by GPT3. As is discussed in §4, these generated tasks cover a broad range of formats, text types, and underlying expertise, while being correct on more than half of all the generated tasks.

Target Output Validity
Instruction: Given a set of words = { 1 , 2 , ..., } where each word has letters, write a function that takes a target length and returns a new set of words such that | | = and |{ ∶ ∈ }| ≤ ∕| |.
def wordSubsetSum(w, l): s = [] for i in range(len(w)): if len(s) + 1 < l: s.append(w[i]) return s Instruction: Input: Output: Instruction: The house on fire! go! -Go to the kitchen and take a pan.
-Take water from the tap.
-Pour water into the pan.
-Put out the fire with the pan.
Instruction: Input: Output: Instruction: A number of people are lined up for an interview. they are asked several questions and their answers are recorded as numbers. you need to identify which person has the highest salary among all the candidates. Input: Person 1: 2, 3, 4, 5, 6 Person 2: 7, 8,9,10,11 Person 3: 12,13,14,15,16 Person 4: 17,18,19,20,21 Person 5: 22,23,24,25,26 Person 6: 27,28,29,30,31 Highest salary Instruction: Input: Output: Instruction: Given a list of words, generate a story about a group of people who live on a planet. the story should contain at least 10 sentences. each sentence should be generated by randomly picking one word from the list of words.
The story starts with a group of people who live on a planet. They are very intelligent and have created many things that help them in their daily lives. One day, they decided to create a machine that can make food for them. The machine was able to produce food but it also produced something else. It produced a liquid that could be used as fuel for the machines. The people were happy because they didn't need to go out and look for food or fuel anymore. They had everything they needed right there in their homes. However, one day, the machine stopped working. The people tried to fix it but couldn't figure out what went wrong. They tried to find someone who could help them but no one knew how to fix the machine. After a few days, the people realized that they would die if they didn't do anything about it. So, they decided to send some of their best scientists to space to find a new planet where they could live. The scientists traveled for a long time before they found a planet that looked like a good place to live. They sent a message back to the people telling them about the planet. The people were excited and started preparing for the journey. They built spaceships and flew to the new planet. When they arrived, they saw that the planet was beautiful. There were trees and rivers and mountains. The people were so happy that they finally found a new home. They started building houses and planting crops. They lived happily ever after.
Instruction: Input: Output: Instruction: What is your salary? class Solution(object): def getSalary(self): return self.salary Instruction: Input: Output: Instruction: A person wants to buy a car from another person. he has some bitcoins, but he doesn't want to reveal his identity. how can they make the transaction happen?
The buyer and seller should agree on a price for the car. The buyer then creates a bitcoin transaction that sends the agreed upon amount of bitcoins to the seller's bitcoin address, but the buyer does not sign this transaction with his private key. Instead, he gives the unsigned transaction to the seller. The seller can now sign the transaction using his private key, and broadcast it to the network. Once the transaction is confirmed by the network, the buyer has no way to get his money back, so he will be forced to go through with the purchase.
Instruction: Input: Output: Table 11: Representative invalid tasks generated by GPT3. The problematic fields are indicated in the validity column. As discussed in §4.3, although these tasks contain errors, they still provide many useful signals in supervising models to follow instructions.