Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLMs

Most NLP tasks are modeled as supervised learning and thus require labeled training data to train effective models. However, manually producing such data at sufficient quality and quantity is known to be costly and time-intensive. Current research addresses this bottleneck by exploring a novel paradigm called zero-shot learning via dataset generation. Here, a powerful LLM is prompted with a task description to generate labeled data that can be used to train a downstream NLP model. For instance, an LLM might be prompted to"generate 500 movie reviews with positive overall sentiment, and another 500 with negative sentiment."The generated data could then be used to train a binary sentiment classifier, effectively leveraging an LLM as a teacher to a smaller student model. With this demo, we introduce Fabricator, an open-source Python toolkit for dataset generation. Fabricator implements common dataset generation workflows, supports a wide range of downstream NLP tasks (such as text classification, question answering, and entity recognition), and is integrated with well-known libraries to facilitate quick experimentation. With Fabricator, we aim to support researchers in conducting reproducible dataset generation experiments using LLMs and help practitioners apply this approach to train models for downstream tasks.


Introduction
In recent years, natural language processing (NLP) has witnessed remarkable progress due to the introduction of pre-trained language models (PLMs) (Devlin et al., 2019;Liu et al., 2019;Conneau and Lample, 2019;He et al., 2021).These PLMs are typically fine-tuned on large human-annotated datasets, resulting in state-of-the-art performance in tasks such as text classification, token classification, and question answering.However, real-world applications of this approach face the bottleneck that sufficient amounts of human-annotated data are often unavailable and too costly to produce manually, especially when domain expertise is required.Dataset generation with teacher LLMs.Recently, a paradigm called zero-shot learning via dataset generation (Meng et al., 2022;Ye et al., 2022a,b) has emerged, potentially obviating the need for human-annotated data.This approach leverages the generation capability of large language models (LLMs) to create class-conditioned texts guided by label-descriptive prompts and, optionally, few-shot examples of instances of the desired classes.The generated dataset is then used to train a smaller student PLM.
Refer to Figure 1 for an illustration of this process: In this example, an LLM is instructed to write 500 positive and 500 negative movie reviews.To guide the process, we include an example of a positive and negative review in the prompt.With this 1.Generate unlabeled data.The first generation target is to produce unlabeled data.For instance, during the development of a question answering system, we might require a corpus of example questions or a corpus of texts on a particular topic.For this scenario, users provide a prompt w (such as "Generate a text in the domain of history that contains facts someone can ask questions about."),and the auto-regressive LLM G θ generates appropriate text x g .
2. Generate label-conditioned data.The second generation target is generating data belonging to a pre-defined class, such as classification tasks.The LLM generates a text x g corresponding to a specific label y from a set of labels.
As discussed in the introduction, one example is to generate training data for a binary sentiment classifier.To achieve this, one must define a set of labels (y = {positive, negative}) and a prompt w y such as "Generate a <y> movie review: ".The generated sequence x g will be paired with the label y to form a training pair (x g , y) for fine-tuning.
3. Annotate unlabeled data.The third generation target holds if an unlabeled text dataset for a domain is already available and only training labels are missing.For instance, a corpus of movie reviews might already be available, but sentiment labels are missing.
In FABRICATOR, researchers can add labels to an existing corpus by extending prompt w with fixed label options y to form w y like "Annotate the movie review either as: positive, negative."The generated label y is then paired with the unlabeled data point x u to form a data pair (x u , y).
The generation targets defined above will be executed multiple times to generate a corpus of a specified size.The prompt may also be extended to include few-shot examples of each class, as shown in Figure 1.The prompt can also handle multiple inputs (for example, for tasks like textual similarity) using pre-defined interfaces in FABRICATOR.In all cases, the correct prompt is composed and executed in our backend.

Classes and Concepts
As Figure 2 illustrates, the key module in our approach is the DatasetGenerator class, which acts as an orchestrator between the LLM (PromptNode), the prompt (BasePrompt), and optionally, the fewshot examples and unlabeled datasets.
The generate() function within the DatasetGenerator class converts the BasePrompt and the provided few-shot and unlabeled data into a processable prompt for the LLM.The method offers various arguments to steer the generation process.Users can specify parameters like the maximum number of API calls, the sampling strategy of few-shot examples (uniform vs. stratified), or the number of few-shot examples to use in a single prompt.Our repository contains documentation with details on all available customization options.

HuggingFace Interoperability through Dataset Class
FABRICATOR operates on the Dataset class from HuggingFace's DATASETS library.By default, generate() produces the generated data as a Dataset instance.This allows generated datasets to be directly used in existing training scripts of the TRANSFORMERS library (Wolf et al., 2020) and to be shared among researchers via the Huggingface dataset hub.
An existing dataset may also be used as input to the generate() method.Since the DATASETS library supports a wide range of standard benchmarks and their formats, existing datasets can be easily loaded and used as input.For instance, in some generation workflows, we would like to add labels to an existing corpus or use instances as fewshot examples within a prompt.

Prompt Class
Prompting is crucial when operating on large language models as it guides the auto-regressive generation process.While in the simplest case, a prompt is a single textual string, we find that many scenarios require more complex prompts and customization options.For instance, when including few-shot examples in a prompt, questions include how many examples to include in each prompt and how these are sampled (uniform vs. stratified) from available few-shot data across different prompt calls.Similarly, the complexity increases for tasks such as textual entailment (requiring multiple inputs) and entity recognition (potentially requiring transformation of token-level BIOES tags into span-level prompting queries).
To address these challenges, FABRICATOR introduces a simple yet powerful BasePrompt class that offers clear interfaces for customizing prompts for various dataset generation tasks.The interface includes attributes to specify pre-defined label options for label-conditioned generation, and support for having few-shot examples or unlabeled datasets by selecting the relevant columns for generation and few-shot information in the prompt.
Since the prompt class directly operates on the dataset columns, FABRICATOR enables a sophis-1 import os 2 from datasets import load_dataset 3 from haystack .nodes import PromptNode 4 from fabricator import DatasetGenerator , BasePrompt ticated and flexible prompt design.To illustrate, when performing a textual similarity task, the user can specify the first sentence and the label as the few-shot information and prompt the LLM to generate a second sentence corresponding to the given sentence and label.

LLMs
The LLM interface must be stable and ideally compatible with models hosted as APIs or selfhosted LLMs.We leverage the HAYSTACK 2 framework (Pietsch et al., 2019), specifically the PromptNode class, for interactions with LLMs.The PromptNode implementation allows users to select and use LLMs from various model providers, including HuggingFace, OpenAI, Azure, Anthropic, and Cohere.

Example Script
In Listing 1, we introduce an example script in which FABRICATOR is used to generate additional movie reviews for training a binary sentiment classification model (refer to generation workflow 2 as defined in Section 2.1).To implement this, we define: 2 https://github.com/deepset-ai/Haystack • a pre-processed few-shot dataset (dataset, line 6) having labels in natural language form (e.g., 0 becomes "negative").These examples are used to augment the generation prompt, • a prompt template (prompt, line 8) specifying the instruction to the LLM, • an LLM to use as teacher model (prompt_node, line 14), • a DatasetGenerator to execute the generation process with all parameters (generator, line 20).
The prompt is configured in the constructor of the BasePrompt class (lines 8-12): We set a task_description with a placeholder for label_options that we provide as a separate argument.We also specify for which column in the loaded dataset to predict labels.
We then define a teacher LLM (lines 14-18) and pass datasets, prompt, and LLM to the DatasetGenerator orchestrator class (lines 20-27).Here, we specify a few-shot strategy to sample one label from the "label" column uniformly during generation.We do so to generate either a positive or a negative review.Upon completion, the generate function returns the annotated Dataset instance.

Experiments
To illustrate how FABRICATOR could be used in research, we conduct an exploratory evaluation of two scenarios: (1) how models trained on generated datasets compare to models trained on humanannotated datasets, and (2) whether few-shot examples in the prompt improve generated datasets.
To do so, we train smaller PLMs on generated datasets and evaluate them on the humanlabeled test split of the respective benchmark.For question answering, we fine-tune a roberta-base PLM (Liu et al., 2019).For all other tasks, we finetune a bert-base-uncased PLM (Devlin et al., 2019).The hyperparameters are listed in Appendix A.2.We report the score and standard deviation averaged over 5 random seeds for each experiment.

Experiment 1: Comparison of Generated and Human-Annotated Datasets
We re-annotate existing benchmark datasets with generated labels in the first experiment.This experiment aims to measure the difference in accuracy of downstream task models trained on humanannotated data compared to models trained on generated data.We evaluate text classification, textual similarity, and extractive question answering tasks.Experimental setup.We conduct this evaluation on 5 datasets spanning 3 NLP tasks: We use IMDB (Maas et al., 2011), a binary sentiment classification benchmark, and TREC-6 (Li and Roth, 2002), Results (Table 1).For all datasets, we compare a generated dataset of 50, 500, 1k and the full dataset (limited to 10k if it is larger) to gold-annotated data of the same size.For question answering, models need to be trained on at least 1k to obtain representative results, so we do not report scores for 50 or 500 examples for SQuAD.We find that for simple tasks such as binary sentiment classification (IMDB), models trained on the annotations by LLMs achieve similar accuracy on the gold-labeled test split (↓1.0 pp. in accuracy with 10k training examples).However, we as the complexity of datasets increases (text classification with more classes and extractive question answering), we observe that the performance of models trained on LLM-annotated datasets falls short (↓19.0 pp. for SNLI and ↓16.3 pp. for SQuAD, with 10k training examples).
These performance gaps indicate that the usefulness of LLMs as teacher models depends on the specific task.In the next section, we present an experiment that explores how to close this gap by using additional few-shot examples.1).

Experiment 2: Impact of Few-Shot Examples
In the second example experiment, we re-annotate TREC-6 using a varying number of few-shot examples.This experiment aims to determine whether adding few-shot examples for each class improves dataset generation with FABRICATOR.We investigate two variables: (1) The total number of fewshot examples available per class, (2) How many of these are included per prompt.For instance, there might be 8 few-shot examples available in total, but only 3 are randomly sampled to be included in each prompt call.
Results (Table 2).We note a generally positive trend in that both increasing the number of of available few-shot examples (column # few-shot examples per class) and increasing the number of examples used in each prompt (column # examples per class used in prompt) improves model performance.
In particular, we find many settings that outperform the numbers of our previous experiment (where we sampled 2 examples per prompt out of a total of 6), highlighted bold in Table 2.However, we also find that improvements become uneven when # examples per class used in prompt is increased above 3, indicating prompts should not be overloaded with too many examples.
However, we note a lack of accessible frameworks that facilitate straightforward and reproducible dataset generation using teacher LLMs.While existing open-source toolkits like OpenPrompt (Ding et al., 2022) partially extend to dataset generation scenarios, our approach stands apart by having lightweight, dedicated interfaces for the introduced generation tasks, supporting a wide range of LLMs using haystack, and integrating with HuggingFace DATASETS for easy evaluation.
Prompt-based learning (Liu et al., 2021;Gao et al., 2021;Schick and Schütze, 2021a;Le Scao and Rush, 2021) is another line of research that has proven useful in improving downstream tasks in zero-and few-shot settings by leveraging LLMs' pre-training objectives (Brown et al., 2020;Ouyang et al., 2022;Zhang et al., 2022;Scao et al., 2023;Touvron et al., 2023).However, the availability of training data in low-resource scenarios is still crucial (Perez et al., 2021;Sahu et al., 2022).Therefore, our method also seeks to fill this gap by providing a comprehensive and easily reproducible dataset generation toolkit.

Conclusion
We introduced FABRICATOR, a user-friendly library for dataset generation utilizing LLMs.With FABRICATOR, researchers access a highly customizable interface that enables efficient research on zero-shot and few-shot learning via dataset generation.Further, we implemented various baselines using generated datasets to illustrate potential applications of our repository and plan to support further downstream tasks in the future.We believe that FABRICATOR will be a valuable tool for the NLP community, facilitating advancements in dataset generation and fostering research in various natural language processing domains.

Limitations
While our paper aims to address dataset creation for a wide range of downstream tasks, it is important to acknowledge certain limitations in our study.Firstly, during our repository's evaluation phase, we could only test and assess a subset of tasks due to resource and time constraints.Our evaluation may only cover a portion of the tasks researchers and practitioners commonly encounter in their work.Future work must expand the evaluation to include a broader range of tasks to provide a more comprehensive understanding of the repository's effectiveness.
Additionally, despite our best efforts in designing the repository layout to be versatile and adaptable, there might be specific tasks or domains where our repository's structure or features may not be directly applicable.We acknowledge that the landscape of downstream tasks is diverse and constantly evolving, which may require tailored approaches or extensions to our existing framework.We encourage open-source contributions and active engagement from the community to address these limitations.By involving a more comprehensive range of perspectives and expertise, we aim to consistently improve the repository and enhance its suitability for various task requirements.
Furthermore, while we have endeavored to provide thorough documentation and guidelines within the repository, there is always a possibility of overlooked issues or unforeseen challenges that may arise during dataset creation.

Ethics Statement
While large language models have shown remarkable advancements in natural language understanding and generation, their capabilities also raise important ethical considerations.One prominent concern is the potential for hallucination, where the models may generate false or misleading information.This aspect can have serious implications, especially when datasets are created for critical domains such as medicine, law, or journalism.It is crucial to exercise caution and verify the accuracy and reliability of outputs generated by our repository, particularly when making decisions that have real-world consequences.
Another ethical concern is the presence of biases in language models, which can perpetuate and amplify societal prejudices and inequalities.These biases can arise from biased training data or biased patterns in human-generated text that the models learn from.Since our repository is in an early stage, we emphasize to carefully inspect created datasets to identify and rectify biases that may be present.
To ensure a responsible dataset creation process, it is essential to engage in thorough data validation, including identifying and addressing potential biases, checking data sources for reliability and credibility, and involving diverse perspectives in dataset collection and annotation processes.Moreover, continuous monitoring and auditing of the models' outputs and performance can help identify and rectify any ethical concerns arising during deployment.

A.1 Screencast
A screencast about the FABRICATOR framework can be found on Vimeo.

A.2 Hyperparameters for Experiments
We used AdamW (Loshchilov and Hutter, 2019) as our optimizer with a batch size of 16.Furhter, we used a linear warm-up for 10% of the optimization steps.We fine-tune roberta-base for question answering with a learning rate of 1e −5 for two epochs without early stopping.For the bert-base-uncased PLM, we fine-tune using a learning rate of 2e The results are depicted in Table 3.We observe significant performance drops compared to the re-annotation experiments for TREC from Section 3.1.For instance, using 10k generated examples achieves a performance level similar to using 50 human-annotated examples (compare to Table 1).However, we note that we performed no prompt optimization techniques or hyperparameter searches in all experiments.Additionally, we generated a uniform distribution of classes, while the gold-labeled dataset is skewed towards certain categories.It is worth mentioning that this class distribution information may not be available in real-world few-shot settings.

A.4 Impact of Few-Shot Examples on Label-Conditioned Generation
In this experiment, we generated 500 labelconditioned data pairs for the TREC dataset, following the approach described Section 3.

A.5 Instruction-tuning open-source models
In this experiment, we compare the annotation performance of OpenAI's GPT-3.5 with an instructiontuned open-source LLaMA model.To conduct this evaluation, we choose the token classification task on the CoNLL-03 dataset (Tjong Kim Sang and De Meulder, 2003), which generates one label for each token in the input, making it a structured task.
The results are shown in Table 5.We observe that using the dataset as-is results in often unusable annotation outputs, primarily due to imprecise formatting.To address this, we convert the token-level labels into spans and prompt the LLM to extract all named entities for the relevant categories.We then transform the found entities into token-level tags by searching for the annotations as substrings of the input text.We compare the performance of this approach with a instruction-tuned LLaMA model on the entire training split of CoNLL-03 by letting both LLMs annotate the validation set.
Unlike the previous evaluation, we did not train and evaluate a smaller PLM on the gold-labeled test set.Instead, we assess the performance between the gold-annotated validation split and the

Data
# Training examples 50 500 1000 all TREC-6 Gold 42.7 ± 9.6 93.8 ± 0.3 95.1 ± 0.6 97.1 ± 0.3 Generated 27.5 ± 11.0 56.2 ± 3.3 57.9 ± 1.6 62.6 ± 3.4 Table 3: Results on TREC-6 with generated questions by GPT-3.5 using 3 few-shot examples (uniformly sampled from 8 possible few-shot examples per class).We observe that the generation performance is worse compared to an equally sized human-annotated dataset.However, the performance increases with the number of examples generated.annotations made by the LLM.Our findings indicate that the annotation quality of instruction-tuned LLMs can significantly improve over OpenAI's GPT, as evident from the higher F1 score.This finding suggests that instruction-tuned models for dataset generation have the potential to facilitate the generation process for complex downstream tasks in future research endeavors.

Figure 1 :
Figure 1: The process of learning via dataset generation.A teacher model (LLM) is prompted to generate 500 movie reviews for each sentiment (positive, negative).A smaller student PLM is trained on the generated dataset.

Figure 2 :
Figure 2: With FABRICATOR, the generation process involves a prompt template that creates the final prompt using all provided arguments.The generator class creates training examples until the maximum number of prompt calls is reached, or the unlabeled dataset is fully annotated.Ultimately, the generator class produces a HuggingFace Dataset instance.
a 6-class question type categorization dataset to evaluate text classification tasks.We use the 2class MRPC(Dolan and Brockett, 2005) and the 3-class SNLI(Bowman et al., 2015)) datasets to evaluate textual similarity tasks.Finally, we use SQuAD-v2(Rajpurkar et al., 2016)) to evaluate extractive question answering.We use generation prompts augmented by 2 examples per prompt sampled from 6 possible few-shot examples per class.

Table 1 :
Results on re-annotation experiments using 2 few-shot examples per prompt (uniformly sampled from 6 few-shot examples per class).We report accuracy except for SQuAD, where we report F1, and highlight bold those experiments where generated data yielded similar scores as human-annotated data.We observe that GPT-3.5 is not able to annotate on human-level performance except for simple classification tasks such as IMDB.

Table 2 :
Results on 500 annotated TREC-6 examples using varying amounts of few-shot examples.We sweep over the number of few-shot examples and the number of few-shot examples used in the actual prompt.We highlight bold where increasing few-shot examples improves over the 79.3 TREC-6 score of Experiment 1 (Table −5 for either 5 (if training data has more than 1000 examples), 10 (if training dataset has at least 500 but less than 1001 examples) or 20 epochs (if training data is less than 501 examples).Further, across all experiments, we use 10% of the data as a validation split for model selection.
24.6 pp. in accuracy).Additionally, using more examples in a distinct prompt slightly improved the model performance.We encountered one outlier when using 16 examples per class and including five examples in the prompt for generation, which resulted in lower performance than sampling from 8 few-shot examples per prompt.It is important to note that during this experiment, we did not adjust any hyper-parameters of the LLM for generation, such as temperature or top-k sampling.

Table 4 :
± 1.5 58.8 ± 1.0 58.2 ± 1.0 64.0 ± 2.0 16 -58.3± 0.8 59.8 ± 2.5 58.7 ± 1.1 54.8 ± 1.5 Results on 500 generated TREC-6 examples with different sizes of few-shot examples and number of few-shot examples included in the prompt.We observe that more few-shot examples result in better performance on the gold annotated test split.

Table 5 :
Comparison of instruction-tuned LLaMA models with 3-shot GPT-3.5 based on the training split of CoNLL-03.We report accuracy and span-level F1 score the annotation on the validation split.* : We convert tag sequences to spans in order to prompt the LLM with strings rather than sequence.However, 38% of the validation split annotations have different lengths after tokenization which have been filtered out for a fair comparison.