LLMaAA: Making Large Language Models as Active Annotators

Prevalent supervised learning methods in natural language processing (NLP) are notoriously data-hungry, which demand large amounts of high-quality annotated data. In practice, acquiring such data is a costly endeavor. Recently, the superior few-shot performance of large language models (LLMs) has propelled the development of dataset generation, where the training data are solely synthesized from LLMs. However, such an approach usually suffers from low-quality issues, and requires orders of magnitude more labeled data to achieve satisfactory performance. To fully exploit the potential of LLMs and make use of massive unlabeled data, we propose LLMaAA, which takes LLMs as annotators and puts them into an active learning loop to determine what to annotate efficiently. To learn robustly with pseudo labels, we optimize both the annotation and training processes: (1) we draw k-NN examples from a small demonstration pool as in-context examples, and (2) we adopt the example reweighting technique to assign training samples with learnable weights. Compared with previous approaches, LLMaAA features both efficiency and reliability. We conduct experiments and analysis on two classic NLP tasks, named entity recognition and relation extraction. With LLMaAA, task-specific models trained from LLM-generated labels can outperform the teacher within only hundreds of annotated examples, which is much more cost-effective than other baselines.


Introduction
Large language models (LLMs) have exhibited remarkable few-shot performance in a wide range of tasks, with only a few demonstrations and welldesigned prompts (Brown et al., 2020;Ding et al., 2022;Liu et al., 2023).However, with rapid ad-Figure 1: Comparing LLMAAA with other frameworks.We actively acquire annotations from LLM for efficiency, requiring little human effort.
vancements comes vast potential risks in adopting LLMs for widespread downstream production applications.One of the main concerns is about data privacy and security.Under the prevalent "Language-Model-as-a-Service" (LMaaS, Sun et al., 2022) setting, users are required to feed their own data, potentially including sensitive or private information, to third-party LLM vendors to access the service, which increases the risk of data leakage (Lyu et al., 2020;Yu et al., 2022;Li et al., 2023).Besides, LLMs usually consume abundant tokens by continuous requests to APIs, where the marginal cost and latency become substantial in large-scale or real-time applications, hindering LLMs' practical deployment in cost-sensitive scenarios (Goyal et al., 2020;Cao et al., 2023).
On the other hand, training task-specific models (TAMs) for NLP tasks necessitates extensive amounts of labeled data.Due to the superior generative capacity of LLMs, some researchers attempt to synthesize training data with text generation (Meng et al., 2022;Ye et al., 2022), as depicted in Figure 1.However, the generated text usually struggles with low-quality issues and may exhibit domain shifts with test data (Gao et al., 2023).To exploit the abundant unlabeled corpus, an alternative is to employ LLMs as annotators, which generate labels in a zero-shot or few-shot manner.While this approach seems promising, it is important to acknowledge that LLM-generated labels inevitably contain noise, especially when applied to challenging tasks and domain-specific data (Agrawal et al., 2022;Kazemi et al., 2023).Besides, larger models come with heavier expenses, and it is also crucial to reduce the annotation cost when the budget is restricted.
To enhance the reliability (i.e.accuracy) of TAMs' performance as well as to ensure the data efficiency in annotation cost, we propose LL-MAAA, an innovative framework that integrates active learning into the LLM annotation process, i.e., making LLMs as Active Annotators.By exploring different active acquisition strategies, LL-MAAA enables the LLM to annotate more informative instances that benefit model performance more.To train TAMs reliably, we optimize both the annotation and training processes within LLMAAA framework.Firstly, we employ prompt engineering techniques to enhance LLMs' performance by (1) selecting k-NN samples from a demonstration pool as in-context examples, and (2) building finelevel descriptions aligned with natural language for unnatural labels (e.g., category labels in the RE task).The valuable contextual information helps improve the quality of LLM annotations substantially.During training, we adopt the automatic reweighting technique (Ren et al., 2018) to assign learnable weights to the silver2 training samples.This strategy allows the model to prioritize more informative and representative samples while simultaneously reducing the impact of noisy annotations.
We evaluate LLMAAA on two practical NLP tasks: named entity recognition (NER) and relation extraction (RE).Experiments show that: (1) with small-scale gold data (~100 examples) serving for demonstration and validation, the trained TAMs can outperform their teacher LLMs within hundreds of silver samples via LLMAAA; (2) our approach is significantly more data efficient compared to prevalent data generation methods, which usually require large-scale synthetic training data (size varying from 10k to 200k, Ye et al., 2022;Gao et al., 2023).These results confirm the potential of LLMAAA as a practical and cost-efficient solution to make LLMs as good annotators.The TAMs created through our framework offer advantages in terms of task-specific performance, data privacy, and inference costs, which release the capacity of LLMs for real-world productivity.
We summarize our contributions as follows: • We propose LLMAAA, a framework to employ LLMs as annotators, featuring both efficiency and reliability.
• LLMAAA is capable to train TAMs that outperform teacher LLMs within hundreds of annotated samples, on classic NLP tasks like NER and RE.
• LLMAAA sheds light on the practical substitution of LLMs, with a cost-effective, privacyensured, yet well-performing solution.

Related Work
LLM and In-Context Learning Large language models (LLMs), usually pretrained on large-scale corpus to capture rich linguistic patterns and generate coherent text (Brown et al., 2020;Raffel et al., 2020;Chowdhery et al., 2022;OpenAI, 2023;Touvron et al., 2023), have shown remarkable performance in a wide range of NLP tasks (Min et al., 2021;Zhao et al., 2023).With the proposal of in-context learning (Brown et al., 2020), prompt engineering has been extensively explored to steer LLMs' behavior for desired outcomes.These techniques design specific prompts or instructions to guide models' outputs (Ding et al., 2022;Liu et al., 2023), either in rule-based (Shin et al., 2020) or learning-based (Lester et al., 2021) manners.Recent trend focuses on the strong reasoning capabilities of LLMs and enhances LLMs' performance on complex task with chain-of-thought (CoT) prompting (Wei et al., 2023a).In general, prompt engineering improves the controllability and performance of LLMs in few-shot and zero-shot settings (Zhong et al., 2021), and enables LLMs to solve specific tasks, e.g.information extraction (Wei et al., 2023b;Wang et al., 2023).
Dataset Synthesis Supervised learning methods in NLP are often limited by high-quality annotated data.Output: [{"span": "Japan", "type": "LOC"}, {"span": "Syria", "type": "LOC"}]  data with LLMs, either by annotation or by generation.Following the first line of research, Feng et al. (2021); Chen et al. (2023) employ LLMs as unsupervised annotators to generate dialogue datasets.Recently, AnnoLLM (He et al., 2023) makes LLMs' performance on par with crowdsource annotators by chain-of-thought prompting and self-generated explanations.As a contemporary work, Bansal and Sharma (2023) use LLMs for annotation in the domain transfer setting.Under the formulation of active learning, they propose a new metric, conditional informativeness, that works well with noisy labels.Among generationbased methods, Wang et al. (2021) first use LLM with few-shot prompts to generate training data.Schick and Schütze (2021) attempt to generate labeled text counterparts and text pairs for semantic textual similarity tasks.ZEROGEN (Ye et al., 2022) and SUNGEN (Gao et al., 2023) further extend this practice to zero-shot learning by training small models with zero-shot LLM-generated datasets.However, these approaches still suffer the low-quality and domain-shift issues of the synthetic data, and none of them consider the cost efficiency of data generation via LLMs.
Labor Efficiency and Active Learning Active learning is a technique proposed to minimize the annotation cost during the labeling process (Settles, 2009;Ren et al., 2021).A popular setting for active learning is the pool-based paradigm, which aims to select the most beneficial samples from an unlabeled data pool based on criteria including uncertainty (Lewis and Gale, 1994;Houlsby et al., 2011;Gal et al., 2017), diversity (Huang et al., 2010;Sener and Savarese, 2018), and hybrid ob-jectives (Du et al., 2017;Yang et al., 2017;Ash et al., 2020;Margatina et al., 2021).The selected samples are annotated by human annotators and then added into the labeled dataset iteratively.

LLM as Active Annotator
To exploit LLMs' superior few-shot performance and leverage abundant unlabeled data, we attempt to take LLM as annotator and train task-specific models for inference.An ideal process should be both efficient and reliable: we want to learn TAMs robustly with minimal LLM-generated labels.
Concretely, our solution is to make LLMs as Active Annotator.As shown in Figure 2, LL-MAAA comprises three key components: (1) an LLM annotator that generates pseudo labels of given data, (2) an active acquisition mechanism for efficient data selection, and (3) an automatic reweighting technique to ensure robust learning with noisy labels.LLMAAA iterates the three stages to gradually produce stronger TAMs.

Optimizing LLM as Better Annotator
In-context learning (i.e.PROMPTING) enables LLM to conduct few-shot inference without finetuning.Given a manually-designed prompt T (•, •), a demonstration set S = {x i , y i } k i=1 and the query example x q , PROMPTING first builds a sentence T (S, x q ), conditioned on which LLM then generates a text sequence Finally, y q is mapped to the label space Y.
Despite the decent abilities, previous studies show that the design of task-specific prompts has a large impact on performance, varying between near state-of-the-art and random guess (Gao et al., 2021;Lu et al., 2022b).Finding the best prompts for given tasks and given data points is intractable.However, there are several principles turn out to be effective, compared with plain instruction.Liu et al. (2022) propose a k-NN retrieval strategy, which first embeds the demonstration pool D demo and query sample to vector representations, and then retrieves the nearest k neighbors of the query to form its exemplars.The rationale behind this is that semantically similar examples may help LLM answer the query better.Following their practice, we use Sentence-BERT (Reimers andGurevych, 2019, 2020) to build the representations.
Label Verbalizer In classification tasks, the surface forms of labels may induce difficulties and ambiguities.Taking relation classification for instance, the label "per:parents" can indicate either "subject is the parent of object" or "object is the parent of subject", depending on its definition.To address this problem, we utilize a label verbalizer to transform the surface forms to natural language descriptions with pre-defined templates (Sainz et al., 2021;Lu et al., 2022a), serving as fine-level guidance.The semantic templates we use are shown in Table 7.

Active Data Acquisition
Active learning (AL) seeks to reduce labeling efforts by strategically choosing which examples to annotate.We consider the standard pool-based setting, assuming that a large pool of unlabeled data D pool is available.AL loop starts with a seed labeled set D labeled .At each iteration, we train a model M on D labeled and then use acquisition function f (•, M ) to acquire a batch B consisting of b examples from D pool .We then query the LLM annotator to label B. The labeled batch is then removed from the pool D pool and added to labeled set D labeled , and will serve as training data for the next iteration.The process is repeated for t times.
Active acquisition strategies generally maximize either uncertainty or diversity.On one hand, uncertainty-based methods leverage model predictions to select hard examples.On the other hand, diversity-based methods exploit the heterogeneity of sampled data.We will cover some common strategies for thorough comparisons, and illustrate with classification task for simplicity3 .
Random We consider random selection as baseline, which samples uniformly from D pool .Typically pool data and test data share the same distribution, thus the sampled batch is expected to be i.i.d. with test data.
Maximum Entropy Entropy is one of the most widely used estimations of uncertainty (Settles, 2009).Data for which the model M has the highest entropy are sampled for annotation according to Least Confidence Culotta and McCallum (2005) propose to sort examples with the probability assigned by M to predicted class ŷ, which samples K-Means Diversity sampling intends to select batches of data that is heterogeneous in the feature space.Following Yuan et al. (2020), we apply kmeans clustering to the l 2 -normalized embeddings of M4 , and sample the nearest neighbors of the k cluster centers.

Robust Learning with Noisy Labels
LLM annotators inevitably produce noisy labels, especially with harder tasks and domain-specific data.To stay robust against training label bias, we adopt the automatic reweighting technique (Ren et al., 2018) // truncate weights to zero, and normalize to one We opt for two simple yet effective models as TAMs, and leave other choices for future study.

Named Entity Recognition
Formulation NER aims to extract entities {e i } from text x, where e i can be expressed as a continuous span of sequences with predefined type.We consider the flat scenario (i.e.no overlapping entities), in which NER can be reformulated as a sequence tagging problem with BIO label.
To smoothly adapt uncertainty-based active functions from classification task to sequence tagging, we provide three pooling options: average, sum, and max.In practice, we adopt average and sum operations for better empirical performance.
Model Following Devlin et al. (2019), we leverage BERT to convert tokens into vectorized features, and use a linear classifier with activation to predict the {class}-BIO label for each token.

Relation Extraction
Formulation Given subject entity e subj and object entity e obj in a sentence, RE classifies their relation into a predefined set R ∪ {NA}.
Model We use the same model architecture as Baldini Soares et al. ( 2019), which first encloses entity spans with special tokens [E] and [\E], then encodes the sentence with BERT.The concatenated embedding of subject and object is fed into a linear classifier with activation for final prediction.

Setup
Dataset We experiment with three different NLP datasets: Chinese OntoNotes 4.0 (Weischedel et al., 2011) and English CoNLL03 (Tjong Kim Sang and De Meulder, 2003) for NER, and Re-TACRED (Stoica et al., 2021) for RE.For Re-TACRED, we select a subset describing personal relationships and balance the NA relation instances to the original portion.Details of dataset statistics are described in Appendix A. We report the precision, recall, and micro F1 for both tasks.
Baselines We compare LLMAAA with the following baselines: (1) PROMPTING.The promptbased direct inference on test data, using the same engineering techniques as LLMAAA's teacher LLMs.
(2) SUPERVISED.The TAMs are trained on clean-labeled data D val used in LLMAAA's demonstration/validation.(3) ZEROGEN (Ye et al., 2022).Zero-shot data synthesis method via text generation.(4) FEWGEN.A data synthesis method that enhances ZEROGEN with in-context examples uniformly sampled from the demonstration pool.
Implementation We use ChatGPT5 as LLM annotator for main experiments, and adopt BERT (Devlin et al., 2019;Cui et al., 2021) as TAM's encoder.We also explore with other LLM annotators, GPT-3 (Brown et al., 2020) and GPT-4 (OpenAI, 2023), in § 6.We randomly sample 100 examples from the original validation sets as gold data, reusing the same set for demonstration D demo and validation D val .We use the original training sets as D pool and randomly initialize seed labeled set D labeled with a size of 50 and acquire 50 samples per batch for 9 iterations, which generates 500 silver annotated samples in total.We generate 500 and 5,000 samples via ZEROGEN and FEWGEN for comparison.TAMs under all settings are trained three times with different random seeds, and we report the mean and standard deviation in the results.The training process and hyperparameters are detailed in Appendix B.  1: Evaluation results for LLMAAA and other baselines across three different datasets, using ChatGPT as LLM backbone.We report the mean and standard deviation of three separate runs for each method.Since we set the temperature to 0 in PROMPTING, its results are deterministic and we only run evaluation once.We also denote the amount of data (gold/silver) that TAM used for training.We follow consistent principles in prompt design.Empirically, we find that in-context examples bring marginal benefit to RE, while label verbalizer is a technique specifically designed for the classification task.Therefore, We apply k-NN example retrieval to NER and label verbalizer to RE separately.We set k to 5 for all experiments, including FEWGEN.Refer to Appendix B.3 for full prompts.

Overall Results
Table 1 denotes our main experiment results.LL-MAAA with least confidence as acquisition function outperforms all comparative baselines across all datasets, with 74.00%, 82.84% and 80.79% F1 scores on Chinese OntoNotes 4.0, English CoNLL03 and Re-TACRED-subset, respectively.
Comparing with PROMPTING (i.e. the LLM annotator), LLMAAA shows steady improvement (4% in average score) with TAMs of much fewer parameters and lower inference latency, indicating that LLMAAA provides a decent substitute for LLMs in real-world deployments.LLMAAA also surpasses SUPERVISED, where TAMs are trained on clean-labeled but smaller-scale data.This suggests that LLMAAA is capable of deriving rich knowledge beyond the limited demonstration/vali-dation set on unlabeled data, which benefits generalization.
We also notice that generation-based methods, i.e.ZEROGEN and FEWGEN, fail to establish onpar results, even with 10× more data in zero-shot setting.We argue that the text-generation abilities of LLMs are exaggerated in complex scenarios.To demystify the illusion, we devise a case study on Re-TACRED, as is shown in Table 2. ZEROGEN tends to generate simple templated sentences that deviate from the news domain, i.e. the original corpus of Re-TACRED.These results may induce low-quality and domain-shift issues that hamper TAMs' performance.FEWGEN's performance improves with in-context examples, however, it still lags far behind LLMAAA.In contrast, exploiting the unlabeled data effectively alleviates these problems with much higher efficiency, where only hundreds of annotated samples are sufficient for satisfactory performance.

Effects of Prompt Engineering
Though ChatGPT can well follow human instructions in general, it still struggles with difficult tasks and domain-specific data.We compare the infer-   ence performance of plain instructions with optimized prompts in Table 3.Without k-NN example retrieval module (i.e. in zero-shot manners), the LLM annotator is unable to extract entities well in NER task, shown by a drastic drop in F1 scores (21% on OntoNotes and 25% on CoNLL).This result highlights the need for demonstrations, where LLMs' zero-shot performance is unsatisfactory.In addition, the label verbalizer can help align unnatural labels with natural language descriptions, which improves the performance in RE (from 70.94% to 73.77% in F1).These findings emphasize that prompt engineering is crucial for building strong annotators, and incorporating similar and aligned contexts contributes to better inference.

Accelerating with Active Learning
Figure 3 shows LLMAAA performance with different active learning strategies across all datasets.Uncertainty-based methods, i.e. maximal entropy and least confidence, perform significantly better than the random baseline, with faster con- vergence and higher F1 scores at the end of iterations.It is worth noting that (1) uncertainty-based methods are able to achieve on-par performance with random selection with only 30%~40% training data, (2) they surpass PROMPTING consistently within 500 LLM-annotated training samples.In summary, uncertainty-based active learning strategies enable LLMAAA to be more efficient and more capable.
Though k-means clustering encourages diversity in feature space, it only outperforms random sampling on OntoNotes, while yielding similar results on CoNLL03 and Re-TacRED.This suggests that it may require more training data for finetuned BERT to learn informative representations, and such a diversity-based method may fail in low-resource environments, e.g. at early iterations of the loop.

Reweighting Helps Robust Training
Figure 4 depicts the learning trials with and without the automatic reweighting technique.We observe that reweighting training samples consistently help improve performance across all datasets and methods.This finding proves that the training process of TAMs is more noise-tolerant with automatic reweighting, even with only a small-scale cleanlabeled set (100 samples) serving for validation.
In particular, the performance gain from automatic reweighting is more prominent on Onto-Notes and Re-TACRED, and diminishes on Co-NLL03.We argue that automatic reweighting plays a crucial role when the LLM annotators are relatively poor (as in OntoNotes and Re-TACRED).In such scenarios, the online approximation of the validation set serves as an effective estimation of unbiased data distribution, and helps prevent TAMs from overfitting noisy labels.

LLMAAA with Different Annotators
To guarantee the universal effectiveness of LL-MAAA, we further investigate the performance with other LLM annotators, i.e.GPT-3 (Brown et al., 2020) and GPT-4 (OpenAI, 2023).Due to budgetary considerations, we opt to restrict our experiments to OntoNotes.The precision, recall and F1 score are shown in Table 4.The results indicate that LLMAAA benefits from better annotators with continuous improvements, and more importantly, TAMs trained by LLMAAA outperform the LLM annotators consistently.The student outperforms the weak teacher by a large margin (27% in F1 for GPT-3).As the teacher grows stronger, this gap narrows down.This trend meets our expectations: since student TAMs are trained with a fixed budget of data (500 samples), enhancing the capabilities of teacher LLMs will gradually approach the performance ceiling of the students.More annotation budget and more powerful TAMs can help extend this limit, while we leave the exploration for future research.

Why Can Students Outperform Teachers?
An interesting observation across our experiments is that student TAMs trained with generated labels can outperform teacher LLMs, i.e.LLMAAA > PROMPTING, even without sample reweighting, as shown by Figure 4.Such results partially align with previous findings in knowledge distillation (Wang, 2021;Song et al., 2021) and pseudo-label-based learning (Lee, 2013;Sanyal et al., 2022;Min et al., 2023), which share similar yet slightly different settings with LLMAAA.
We attempt to further explain the phenomenon in a simplified setting, where we consider a binary classification task that predicts y for x ∼ D(x), where D(x) is discrete as in language space.For simplicity, we let y = 1 denote the correct label and y = 0 otherwise.We first make the natural assumption that the teacher's performance is above chance, i.e. the accuracy p > 0.5.Querying teacher for target sample x t will generate pseudo label y t ∼ Bernoulli(p).If the student is a universal function approximator S(x; θ) that outputs a scalar as probability that ŷ = 1, then minimizing the cross-entropy loss will reach optimal with S(x; θ) = p.Usually we predict with heuristics that ŷ = 1 if S(x; θ) > 0.5.With the previous assumption, we have ŷ = 1, which means that S always predicts correctly.This toy case nonetheless explains that an ordinary teacher can raise better students.Though teacher LLMs are deterministic for specific x when the temperature is set to 0, their predictions are yet statistically random in D(x), where the same conclusion holds.
We shall point out that the above discussion considers a much-relaxed setting, where we attempt to account for an intuitive understanding on why students outperform teachers in the hard label distillation problem.We leave the rigorous theoretical analysis for future work.

Conclusion
In this work, we propose LLMAAA, a framework that uses LLMs as active annotators to address the challenges of data scarcity in NLP tasks.With active learning strategies, LLMAAA allows LLMs to label more informative samples that promote TAMs performance efficiently.We also optimize for reliability within the framework, which uses prompt engineering techniques and automatic reweighting to improve annotation quality and to reduce the impact of noisy labels, respectively.Experiments on NER and RE tasks demonstrate the effectiveness of LLMAAA.The evaluation results highlight the efficiency and reliability of LLMAAA.Trained with just hundreds of LLM-annotated samples, TAMs are able to outperform their teacher LLMs substantially.Besides, LLMAAA is also much more efficient compared to prevalent data generation methods, which usually require orders of magnitude more synthetic training data.These findings reveal that LLMAAA offers a cost-effective, privacy-ensured, yet well-performing solution to apply LLMs in practical scenarios.

Limitations
Although LLMAAA demonstrates success in transferring and exceeding LLMs' capabilities with cheaper TAMs, it does come with certain limitations.The main difference between the setting in LLMAAA and previous zero-shot generationbased methods, e.g.ZEROGEN and SUNGEN, is that we use an unlabeled data pool D pool and oracle-annotated data D demo /D val , to provide extra knowledge.However, we shall point out that unlabeled text is readily available in many realworld scenarios, thus it is practical to make the pool-based assumption.Additionally, in complex tasks where zero-shot inference fails (like NER in our experiments), it is costly yet necessary to incorporate demonstrations for LLMs.In LL-MAAA, we strive for minimizing human efforts by restricting the oracle-annotated data to a small scale (100 samples), and exploiting the same data for demonstration and validation.Another bottleneck is the model capacities of teacher LLMs and student TAMs.On one hand, a weak teacher is unable to teach excellent students that are ready to be used for applications (e.g.GPT-3).On the other hand, TAMs are bounded depending on their architectures.When the teacher surpasses the ceiling, it will be theoretically impossible for students to outperform teachers.Despite these cases, we are optimistic that LLMAAA is effective in most situations.
We adopt the proprietary GPT family as annotators in experiments, which are provided by OpenAI in a black-box manner.Though powerful, this practice may raise several concerns, e.g. the potential exposure to test data.Nevertheless, we believe that given the comprehensive analysis in § 6.1, it does not affect the effectiveness of our method.

Ethics Statement
This work utilizes publicly available benchmark datasets, and we respect and adhere to their licenses and agreements.Our proposed method involves the use of LLMs for data annotation, as discussed in GPT3Mix (Yoo et al., 2021).This paradigm still poses several challenges, such as the potential biases or toxic content in the generated data.Therefore, it is crucial to exercise caution when employing our method to invoke LLMs for generating data and when utilizing TAMs trained on such generated data.Applying our work to downstream tasks such as NER and RE may result in issues such as mis-extraction and false information, and may fail in some cases.When employing our method, it is essential to consider using debiasing (Schick et al., 2021) or manual checking to mitigate these concerns.Re-TACRED (Stoica et al., 2021) is a revised version of TACRED (Zhang et al., 2017), a largescale crowdsource-annotated RE dataset.It originally has 40 relation types.Including all these types will lead to much longer prompts, which may exceed the API length limit and receive responses with higher latency.Therefore, we opt to select a subset of relations that describe personal relationships for study.We keep all these relation instances in training/validation/testing sets, and balance the NA relation instances to the original portion.The statistics for each relation type is shown in Table 5.
For all three datasets, we randomly sample 100 examples from the original validation sets and reuse the same data for demonstration D demo and validation D val .We use the full training sets as the initial D pool , from which we randomly sample active learning's seed labeled sets D labeled with a size of 50.

B Implementations B.1 LLM Inference APIs
We access OpenAI APIs by Azure service.The API we use for each model is depicted in Table 6.Since ChatGPT and GPT-4 will continue to be updated, they may generate different responses as time changes, even when the temperature is 0.

B.2 Training Task-Specific Models
For all experiments that train TAMs for inference (i.e.LLMAAA, ZEROGEN, FEWGEN and SUPER-VISED), we repeat each with three random seeds, resulting in different parameter initialization and random data sampling.We report the mean and standard deviation in our results.
We use bert-base-cased (Devlin et al., 2019) as TAMs' encoders with a learning rate of 5e-5 for English data (CoNLL03 and Re-TACRED), and chinese-bert-base-wwm (Cui et al., 2021) with a learning rate of 2e-5 for Chinese data (OntoNotes 4.0).The learning rate of other parameters (i.e.linear classifiers) is set to 1e-4.We optimize the models via AdamW (Loshchilov and Hutter, 2019), with ϵ = 1e-6, under a linear warmup schedule for the first 6% steps.We train all TAMs with a batch size of 8 for 40 epochs and take the checkpoint with the highest validation performance for final prediction.

B.3 Prompts
The full prompts we use for annotation are shown in Table 7.In Re-TACRED, we provide prompts both with and without verbalized labels.To add demonstration, we insert each sample's text into input and label to output.The target sample is added to the last input, and the last output is left blank for prediction.
We also show the prompts for generation in Table 8.We use them similarly to annotation.In the zero-shot setting, to help models generate desired outputs, we use a default example to inform LLMs about the output format.

C Annotation Examples
We show two annotation examples of correct/partially wrong annotations from the CoNLL 2003 NER dataset in Listing 1.The first example is exactly correct, and the second example contains hallucinations that do not exist in ground truth: "April", "March", and "Thursday".

Figure 2 :
Figure 2: LLMAAA puts the LLM annotator in an active learning iteration, which mainly consists of three novel components: (1) an LLM annotator optimized with prompt engineering that generates pseudo labels, (2) an active acquisition mechanism for efficient data selection, and (3) an automatic reweighting technique to ensure robust learning with noisy labels.The annotation and training stages run iteratively and gradually produce labeled data for task-specific models.

Figure 3 :Figure 4 :
Figure 3: LLMAAA's performance with different active acquisition strategies, shown by F1 scores.The dashed lines denote PROMPTING's results.For each method, we report the mean and standard deviation of three runs initialized with different random seeds.

Completion
You are a highly intelligent and accurate news domain named-entity recognition (NER) system.You take passage as input and your task is to recognize and extract specific types of named entities in that given passage and classify into a set of following predefined entity types: [PER, LOC, ORG, MISC] Your output format must be in json form of: [{"span": span, "type": type}, ...]
Mary subj 's father is Adam obj .Tom subj 's mother, Mary obj , lives in New York.Michelle Obama subj 's parents are Fraser C. Robinson III and Marian Shields Robinson obj .
per:children Mike subj 's son is named Jack obj .Lily subj 's children are Alex and Bella obj .Sarah subj has a daughter named Emily obj .Table 2: A case study of generated data with ZEROGEN on Re-TACRED.We leverage ChatGPT as the text generator, and the full prompts we use can be found in Appendix B.3.

Table 3 :
Comparison results between plain instructions and optimized prompts in F1 scores.

Table 4 :
Results on Chinese OntoNotes 4.0 for PROMPT-ING and LLMAAA with different LLMs.LLMAAA uses least confidence as the acquisition function, and annotates 500 samples for TAM training.