Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation

Instruction tuning has emerged to enhance the capabilities of large language models (LLMs) to comprehend instructions and generate appropriate responses. Existing methods either manually annotate or employ LLM (e.g., GPT-series) to generate data for instruction tuning. However, they often overlook associating instructions with existing annotated datasets. In this paper, we propose Dynosaur, a dynamic growth paradigm for the automatic curation of instruction-tuning data. Based on the metadata of existing datasets, we use LLMs to automatically construct instruction-tuning data by identifying relevant data fields and generating appropriate instructions. By leveraging the existing annotated datasets, Dynosaur offers several advantages: 1) it reduces the API cost for generating instructions (e.g., it costs less than $12 USD by calling GPT-3.5-turbo for generating 800K instruction tuning samples; 2) it provides high-quality data for instruction tuning (e.g., it performs better than Alpaca and Flan on Super-NI and Longform with comparable data sizes); and 3) it supports the continuous improvement of models by generating instruction-tuning data when a new annotated dataset becomes available. We further investigate a continual learning scheme for learning with the ever-growing instruction-tuning dataset, and demonstrate that replaying tasks with diverse instruction embeddings not only helps mitigate forgetting issues but generalizes to unseen tasks better. Code and data are available at https://github.com/WadeYin9712/Dynosaur.


Introduction
Instruction tuning (Sanh et al., 2022;Ouyang et al., 2022;Wei et al., 2022) enables large language models (LLMs) (Raffel et al., 2020;Brown et al., 2020;Touvron et al., 2023) to provide appropri-ate output according to input instructions.Existing approaches compile instruction-tuning datasets mainly by 1) manual annotations or 2) distillate from a larger size of LLM.For example, SUPER-NATURALINSTRUCTION (SUPER-NI) (Wang et al., 2022b) and DOLLY (Databricks, 2023) recruit experts to manually annotate task instructions and related task data.Despite their high quality, this approach is labor-intensive and costly (Honovich et al., 2022a).Recent efforts (Wang et al., 2022a;Taori et al., 2023) leverage GPT-series to distill instruction tuning data to train smaller models.However, subsequent studies (Gudibande et al., 2023) argue that these methods merely help smaller models learn to mimic the style of teacher LLMs without inheriting their true capabilities, such as factuality and problem solving skills.We suspect it is mainly due to the instructions not ground to actual data.
In this paper, we propose DYNOSAUR, a dynamic growth paradigm to convert high quality annotations from dataset repositories into instructiontuning data.In particular, DYNOSAUR generates instructions based on the metadata of existing datasets in the dynamically growing Huggingface Datasets Platform (Lhoest et al., 2021).As shown in Figure 1, metadata covers essential information about a dataset, including dataset description ("A collection of ... ebooks ..."), dataset name ("Gutenburg_English"), data fields ("title", "text", ..., "issued") and dataset annotations.Guided by metadata, our method can generate multiple tasks applicable for forming instruction-tuning data with instances in NLP datasets.We leverage LLMs to harvest task instructions and their corresponding input/output fields with a single prompt.Prompted with dataset description involving ebooks and data fields about the book published information, LLMs can synthesize instructions such as "Given a Gutenburg passage, generate its title" and "Predict the year when the book is published based -A collection of … ebooks … Data Fields: -"title", "text", "author", "subjects", "issued" Dataset Annotations: -{"title": …, "text": …, "author": …, …} -{"title": …, "text": …, "author": …, …} -… Step 1: Metadata Collection … design up to three different tasks based on this dataset.Each task should still be a dictionary, including the task instruction, input fields and output field.… on book title and authors".These instructions reflect the original data domain and use multiple dataset components.
In the meantime, LLMs also determine which data fields should be used to construct corresponding task inputs/outputs according to generated instructions.As illustrated in Figure 1, LLMs capture corresponding input fields "title" and "author" and output field "issued" for the generated task about predicting issued years given book title and authors.Subsequently, all the data under "title" and "author" fields are used as the final inputs of the generated task, and the data under "issued" are treated as final outputs.Suppose that we generate N instructions based on the metadata of a dataset which contains M instances, our method can synthesize N × M instruction-tuning data.
DYNOSAUR offers several advantages: Low Conversion Cost.As DYNOSAUR leverages existing annotated data, it reduces the number of queries to larger LLMs for generating instructions.For example, it costs only $11.5 USD to query GPT-3.5-turbo (OpenAI, 2023) and generate 800K instruction-tuning data based on annotated datasets.In contrast, both ALPACA and INSTRUCTION GPT-4 cost around $500 USD to generate a significantly smaller dataset of 52K instances.Despite the lower cost of querying LLMs, DYNOSAUR generates high-quality data by effectively leveraging existing annotations.
Effectiveness of Instruction-Tuning Data.We evaluate the data effectiveness by studying whether models trained with DYNOSAUR can achieve competitive performance on SUPER-NI, LONGFORM (Köksal et al., 2023) and USER-INSTRUCTION-252 (Wang et al., 2022a).On SUPER-NI, both T5-3B and LLAMA-7B models fine-tuned with DYNOSAUR outperform AL- An ever-growing instruction-tuning dataset provides an opportunity to continuously improve instruction-following models.Suppose we have a model trained with K tasks (M K ) and newly obtain L training tasks.How can we train M K with the L new tasks to 1) achieve better generalization on unseen tasks and the new L tasks and 2) suffer less from forgetting the previous K training tasks?We propose several continual learning strategies specifically for instruction tuning which select replay tasks based on the diversity of instruction and data representations.Experiments with SUPER-NI and DYNOSAUR show that replaying is effective to improve generalization and mitigate forgetting.Besides, once L new tasks are used for training, replaying previous tasks with the least similar instructions to the L tasks performs the best.

Collection of DYNOSAUR Data
In this section, we introduce how to construct the DYNOSAUR dataset.As shown in Figure 1, we first collect metadata from existing datasets, then prompt LLM to create tasks based on the metadata, and filter out invalid ones.

Metadata Collection
Metadata contains key information about an NLP dataset that contributes to instruction-tuning data generation.It covers the following elements: Dataset Name.Dataset name sometimes provides useful information to help us identify the domain and task category of a dataset.For example, dataset names with "bio" usually indicate that the dataset is in the biological domain; names with "nli" may suggest that the dataset is originally designed for natural language inference tasks.
Dataset Description.Dataset description offers more detailed information about the motivation for building a dataset, the summary of dataset contents, and its supported tasks.It facilitates LLM to create instructions by supplying extra information about the dataset domain and initial dataset design.
All the metadata components are collected from the Huggingface Datasets Platform.We only col-lect the metadata from datasets whose licenses allow adaptation.More details are in Appendix A.

Instruction and Input/Output Field Generation
For each dataset accompanied by processed metadata, we then deploy LLM to generate multiple tasks associated with it.For each task, LLM generates a specific task instruction and designates its input/output fields simultaneously.As exemplified in Figure 1, LLM is expected to generate an instruction "Given a Gutenburg passage, generate its title", its input field "text", and the output field "title".
To accomplish this, we harness the power of incontext learning (Brown et al., 2020).Concretely, we wrap the information of each dataset into a dictionary format and construct four demonstrations manually.Due to the length limitation of the LLM, we use two of them each time as part of the input.Depending on whether or not the incorporating dataset descriptions in the input prompt, we consider the following two configurations: Description-Aware Generation.To maximize the utilization of information present in the dataset description, we incorporate metadata of the two demonstration datasets as well as the new dataset where we plan to generate new tasks as input.The benefit is that LLM can infer the underlying purpose of the dataset creation, thereby generating the most aligned tasks with the original intent.In this setup, LLM generates new tasks, with the input prompt being: "Now given a dictionary as input, please help us to generate new tasks.You may stop when there is no more plausi ble task." and requirements being "Note that the input and output fields should not be duplicated and should both appear in [data fields].Each task should still be a dictio nary, containing no text or explanations outside the dictionary."The full prompt is shown in Appendix B. This setting, however, still has limitations: firstly, comprehensive metadata may not be available for certain datasets; secondly, LLM exhibits a proclivity towards dataset descriptions, leading to homogenization of the generated tasks.To mitigate these issues, we additionally introduce the following setup.
Description-Unaware Generation.To fully exploit the annotations and distinct data fields, we exclude the dataset description from the input, thereby allowing the LLM to freely generate diverse task instructions and input/output fields.In this scenario, the dataset can be perceived as a description-less database, with the LLM generating diverse potential tasks based on the valid fields within it.For instance, the data fields in the Wikipedia-based QA dataset may encompass "title", "context", "question", and "answers".
By integrating these two settings, we ensure the preservation of the original intent of all datasets, while leveraging the creativity of LLM to delve deeper into the inherent potential in existing data.

Post-Processing
Filtering Invalid Tasks.Even though we describe the requirements for a valid task in the prompt, LLM sometimes neglects the requirements and generate invalid tasks.We filter out tasks with three criteria: 1) tasks with non-existent data fields (for instance, a task with the output field "content" is invalid given the data in Figure 1); 2) tasks with more than one output fields; 3) tasks whose input/output fields overlap.Moreover, we remove duplicate tasks created during both the descriptionaware and -unaware generation.
Organizing Instruction Data.We organize the instruction data in the form of "instruction", "input", and "output".Given an instance of a dataset and a generated task containing the instruction, input fields, and the output field, the "instruction" is the generated instruction and the "output" is the value of the output field.If there is only one input field, the "input" is the value of the input field; otherwise, the "input" describes all the input fields with the format "The [field name] is [value of the field]." Adding Label Spaces for Classification Tasks.As we only showcase several dataset instances to LLMs, it does not know the entire label space when generating a classification task.As a result, the generated instruction may not contain the label space knowledge adequately.To overcome this issue, we automatically add the label space information in the instruction of classification tasks.We simply treat a task with less than 10 distinct outputs as a classification task, and add "Answers must be one of [distinct outputs]."to the end of the instruction.We also discard classification tasks with extremely imbalanced distributions (e.g., only one distinct output value) in this step.

Statistics and Cases
In total, we collect 2,911 English datasets from the Huggingface Datasets Platform as of Feb 23, 2023.We then feed them to GPT-3.5-turbo (OpenAI, 2023) and generate 13,610 tasks, of which 5,740 are valid and distinct.For each task, we sample up to 200 instances, ending in 801,900 instances that form the DYNOSAUR dataset.The diversity of the instructions is shown in Figure 3.Following the approach of Wang et al. (2022a), we plot the top 20 most prevalent root verbs and their top 4 direct nouns, each of which appears at least 5 times.The instructions are quite diverse, especially when considering we only have a total of 4 demonstrations.
Figure 2 demonstrates examples of datasets and corresponding tasks.The dataset name, dataset description, data fields, and annotations are all used by LLM to design the tasks.LLM infers from the dataset name that it is about anaphor agreement and include this information in the instruction.In Example 2, LLM creates the task of paraphrase identification by understanding the relationship between the fields "sentence1" and "sentence2" implied in the dataset description.Under the descriptionunaware setting like Example 3, tasks can be generated based on the names of data fields.

Experiments
We conduct two sets of experiments to evaluate the quality of DYNOSAUR.We first evaluate models trained with DYNOSAUR on SUPER-NI and LONGFORM to examine its ability to solve NLP tasks.Then we run a human evaluation to examine if DYNOSAUR helps in user-oriented situations.
To alleviate the effect of data size disparity, instead of training models with the entire DYNOSAUR, we sample a subset that shares a similar data scale with other instruction-tuning datasets.Specifically, we select 681 tasks from DYNOSAUR as training tasks and sample mostly 100 instances for each selected task, resulting in 66,695 instances in total.For SUPER-NI training set, we also select 681 tasks which are 90% out of all SUPER-NI training tasks and 67,825 instances.The rest 10% tasks are left as the validation set for SUPER-NI evaluation experiments.We also sample 67K data from PROMPTSOURCE and FLAN.
During task selection of SUPER-NI , we ensure that all the selected tasks have distinct categories from SUPER-NI test tasks.Concretely, we use GPT-3.5-turbo as task category classifier1 to categorize each task into one of 76 task categories in SUPER-NI and avoid selecting tasks with test task categories.Details about fine-tuning hyperparameters and training task selection are shown in Appendix C, E.1 and E.2.Following the original evaluation on SUPER-NI and LONGFORM, we leverage ROUGE-L (Lin, 2004) and METEOR (Banerjee and Lavie, 2005) as the metrics.
For all the evaluation experiments, we follow the Self-Instruct paper's setting and exclude all the positive and negative examples written in SUPER-NI instructions.It is for fair comparison with the datasets that contain instructions without any examples, such as ALPACA, INST.GPT-4 and DOLLY.    1.This suggests that DYNOSAUR can be considered as a useful supplement for existing instruction-tuning data to further enhance model generalizability.

DYNOSAUR vs.
Other Instruction-Tuning Datasets on LONGFORM.To further compare DYNOSAUR and other instruction-tuning datasets that are constructed with existing data, we evaluate them on LONGFORM, a recently released instruction tuning benchmark for evaluating models' instruction-following ability on long text generation tasks.LONGFORM is equally unseen to all these datasets.As shown in Table 3, DYNOSAUR largely outperforms the other three datasets SUPER-NI, PROMPTSOURCE, and FLAN.In particular, with LLAMA-7B as the base model, DYNOSAUR outperforms the other datasets with 7.5-12.8METEOR score.LLAMA-7B trained with DYNOSAUR even surpasses other 11B instruction-tuned models such as T0++ and Flan-T5 by a large margin.
Ablation Studies.We first evaluate how well models perform when only using either descriptionaware or -unaware instructions as training data.As shown in Table 4, considering both types of in-  structions can produce better results than merely relying on description-aware/unaware instructions.We also study if there exists performance drop after we remove the label space descriptions in the instructions.From Table 4, the performance drops 2.6 and 4.2 ROUGE-L for T5-3B and LLAMA-7B.
DYNOSAUR vs. Larger Models.From Table 1, we observe that T5-3B and LLAMA-7B with DYNOSAUR are comparable with some greater models.For example, our models are competitive with T0++ trained with orders of magnitude more data and 175B GPT-3 w/ SELF-INSTRUCT.This further shows the effectiveness brought from DYNOSAUR and implies decent quality of DYNOSAUR.

Human Evaluation on User Instructions
Experimental Settings.We conduct human evaluation on USER-INSTRUCTION-252, a user-oriented dataset to test the generation quality in practical domains such as email writing.As there is no test category constraint, we resample 67K data from all the task categories in DYNOSAUR.
We fine-tune LLAMA-7B with the resampled data, and keep fine-tuning hyperparameters the same as SUPER-NI evaluation.We recruit annotators from Amazon Mechanical Turk, and ask them to compare two models' outputs from helpfulness, honesty, and harmless (three criteria proposed by Askell et al. (2021)).See details about sampling tasks for USER-INSTRUCTION-252 evaluation and human evaluation interface in Appendix E.3 and F.
DYNOSAUR as Augmentation Data to Automatically Generated Instructions.Admittedly, compared to automatically generated instructions whose seed tasks are closer to the ones for daily usage, DYNOSAUR is built upon data from existing NLP tasks and is less involved in user scenarios.However, DYNOSAUR can be used as a supplement to the automatically generated instructions.As shown in Table 2a, training together with DYNOSAUR data outperforms solely trained on AL-PACA or INSTRUCTION GPT-4 in the majority of aspects.In particular harmlessness gains a steady boost after incorporating DYNOSAUR.
DYNOSAUR vs. SUPER-NI.We also compare DYNOSAUR with SUPER-NI, as both of them are constructed from existing task data.Table 2b manifests that the model trained with DYNOSAUR exceeds SUPER-NI on all the three aspects.Moreover, DYNOSAUR is an effective addition to automatically generated instructions like INST.GPT-4 than SUPER-NI.

Unveiling More Benefits of DYNOSAUR
Beyond the evident advantages in data quality, which correspondingly enhance model performance, we elucidate the additional merits of DYNOSAUR from three perspectives: the validity of data, the cost-efficiency in data construction, and the potential for dynamic data expansion.
Data Validity.We conduct human evaluation to scrutinize the validity of DYNOSAUR.We randomly select 200 task instructions and recruit evaluators from Amazon Mechanical Turk to confirm the data validity.Each evaluator is instructed to choose from four options for each sample: "completely reasonable", "incorrect input", "incorrect

Continual Learning with Dynamically Growing Datasets
As DYNOSAUR can expand over time as new tasks come in, an important question is how to adapt an instruction-tuned model to new tasks without suffering from catastrophic forgetting.In this section, we examine continual learning as an approach for learning instruction-following models with dynamically growing datasets.We focus on one of the common continual learning techniques (Biesialska et al., 2020), replay methods, which select previously trained tasks for further training stage.We aim to provide an analysis of how to most effectively select the tasks to replay.We want to answer the following questions: 1) Do we need to replay history tasks?2) Shall we replay tasks based on instructions or data? 3) Which tasks to replay?.2 Replay Methods.We compare the following replay strategies: 1) No Replay: Train models without any replay tasks; 2) Instr.Diverse: Replay last stage's tasks that diverge most from ones in the current stage based on instruction representations; 3) Instr.Similar: Replay last stage's tasks that are most similar to tasks in the current stage; 4) Instr.Support: Replay the most representative tasks in the last stage; 5) Data Diverse: Replay diverse tasks based on similarity of example data.Suppose there are L tasks in the current stage, and K tasks in the previous stage, we use Sentence Transformer (Reimers and Gurevych, 2019) based on RoBERTa-large (Liu et al., 2019) to obtain the instruction representation matrix I c ∈ R L×d for the current stage and I p ∈ R K×d for the previous stage, where d is the representation dimension.Then, we compute the cosine similarity between I c and I p , and I p itself: S cp = cos (I c , I p ) ∈ R L×K , S pp = cos (I p , I p ) ∈ R K×K .Then, Instr.Diverse replays the tasks with the least column sum in S cp .Instr.Similar replays the tasks with the largest column sum in S cp .Instr.Support replays the tasks with the largest row sum in S pp .
One branch of instruction-tuning data are constructed with existing human annotations.The instructions in PROMPTSOURCE (Bach et al., 2022) and FLAN (Wei et al., 2022) are created with human-designed templates for limited task categories.NI (Mishra et al., 2022) and SUPER-NI (Wang et al., 2022b) are annotated by NLP practitioners from GitHub and NLP courses.Most recent attempts distill instruction-tuning data from LLMs.The methods proposed in Self-Instruct (Wang et al., 2022a) and Unnatural Instruction (Honovich et al., 2022a) generate novel tasks by prompting LLMs with seed instructiontuning tasks.Other works (Honovich et al., 2022b;Zhou et al., 2022) study instruction generation upon input/output data.There are another type of works simply using structured metadata as instructions (Yin et al., 2023).Different from those works, when we generate DYNOSAUR instructions, the inputs/outputs for the generated tasks are unknown to LLMs.LLMs need to generate instructions from metadata and determine which part of the dataset annotations are task inputs/outputs simultaneously.

Conclusions
We propose DYNOSAUR, an automatic paradigm for instruction data construction.We utilize metadata from existing NLP datasets and generate various tasks upon them.DYNOSAUR generation costs significantly lower than other methods, while models trained on DYNOSAUR data outperform models trained on existing human-curated and machinegenerated instruction datasets on SUPER-NI and LONGFORM.Taking advantage of the dynamic growth nature of DYNOSAUR, we further explore specific replay methods for instruction tuning that are effective in mitigating forgetting.

Limitations
Limited Language Scope.DYNOSAUR is only built upon English datasets in Huggingface Datasets.Whereas, multilingual NLP datasets take up a large proportion in the platform.We plan to further curate a multilingual version of DYNOSAUR and conduct comprehensive experiments for evaluating generalization in multilingual settings.
Errors in Generated Instruction Data.Although the data validity of DYNOSAUR is high, there are still 16% invalid data present in DYNOSAUR.We conduct error analysis (Appendix D) on the 200 instances used for human evaluation in §3.3 and notice that there are still multiple types of errors that have not been resolved yet.We expect to seek better methods to improve the quality of generated instruction data in future works.
Limited Sampled Dataset Instances.Due to the limits of data storage, we only sample at most 200 instances from each dataset for instruction-tuning data generation.We plan to consider more available instances from selected datasets and further scale up DYNOSAUR.

Difficulty in Evaluation.
It is hard to comprehensively assess the capabilities of instruction-tuned models (Zheng et al., 2023).We make our best efforts to evaluate models on a large-scale benchmark SUPER-NI with diverse tasks, along with human evaluation of user instructions.

Ethics Statement
Our work is based on annotations of existing datasets.As these data may contain selection bias or annotation bias, the bias may be inherited in our paradigm.We recruit annotators for human evaluation of data validity and task category classification from Amazon Mechanical Turk.All annotators are fairly paid approximately $12 per hour.

Figure 1 :
Figure 1: Overall pipeline of collecting DYNOSAUR data."d" in Step 4 means each instance in Gutenberg dataset.

Figure 2 :
Figure2: Examples of the datasets and generated tasks.We only demonstrate one task based on each dataset for simplicity.We highlight the parts in metadata that benefit instruction generation.

Figure 3 :
Figure 3: The top 20 most prevalent root verbs and their top 4 direct nouns in the instructions of DYNOSAUR.

Table 2 :
Human evaluation on LLAMA-7B with user instructions.The percentages in columns with dataset name A indicate how many of the generations produced by models trained with A are better than the ones produced by the other data B on USER-INSTRUCTION-252. "Tie" means that the generations of the two models have similar quality.

Table 3 :
Köksal et al. (2023)n LONGFORM.The performance of models with ‡ are the reported results inKöksal et al. (2023).Note that the listed existing baselines with suffix "-11B" indicate that their base model size is 11B.

Table 5 :
The generation cost of different instruction tuning datasets.* indicates that the cost estimation for INSTR.GPT-4 only involves output data generation, as it uses the same instruction and input data as ALPACA.
, we design three metrics to quantify to what extent models generalize to new tasks, how well models perform on the training tasks in current stage, and how much models forget the previously trained tasks -Test: ROUGE-L on the test set of SUPER-NI, which represents unseen tasks; Holdout: ROUGE-L on the holdout data of training tasks in current stage; Previous: ROUGE-L on the holdout data of training tasks in previous stages.As mentioned in §3.3, 16% of DYNOSAUR data are invalid.To avoid evaluating models on invalid holdout data, we do not report Holdout and Previous results for DYNOSAUR experiments.Continual learning results of T5-3B trained with SUPER-NI.We divide the training set into three stages.For each stage, we report ROUGE-L on the test set, holdout data in current stage, and holdout data in previous stages.

Table 6 :
Continual learning results of T5-3B trained with SUPER-NI and DYNOSAUR."Full" denotes training with entire SUPER-NI and DYNOSAUR at once.