Large Language Model Is Not a Good Few-shot Information Extractor, but a Good Reranker for Hard Samples!

Large Language Models (LLMs) have made remarkable strides in various tasks. Whether LLMs are competitive few-shot solvers for information extraction (IE) tasks, however, remains an open problem. In this work, we aim to provide a thorough answer to this question. Through extensive experiments on nine datasets across four IE tasks, we demonstrate that current advanced LLMs consistently exhibit inferior performance, higher latency, and increased budget requirements compared to fine-tuned SLMs under most settings. Therefore, we conclude that LLMs are not effective few-shot information extractors in general. Nonetheless, we illustrate that with appropriate prompting strategies, LLMs can effectively complement SLMs and tackle challenging samples that SLMs struggle with. And moreover, we propose an adaptive filter-then-rerank paradigm to combine the strengths of LLMs and SLMs. In this paradigm, SLMs serve as filters and LLMs serve as rerankers. By prompting LLMs to rerank a small portion of difficult samples identified by SLMs, our preliminary system consistently achieves promising improvements (2.4% F1-gain on average) on various IE tasks, with an acceptable time and cost investment.


Introduction
Large Language Models (LLMs, Brown et al. 2020;Chowdhery et al. 2022;Touvron et al. 2023) have shown remarkable abilities on various NLP applications such as factual question answering (Yu et al., 2023;Sun et al., 2023), arithmetic reasoning (Chen et al., 2022a;Qian et al., 2023) and logical reasoning (Jung et al., 2022;Pan et al., 2023).Given the reasoning, memorization, instruction-following and few-shot adaption capabilities emerging from LLMs, it prompts a compelling question: Can LLMs be used to boost performance in few-shot information extraction (IE) tasks?
To answer this question, we conduct an extensive empirical study to compare the performance between LLMs using in-context learning2 (ICL) and fine-tuned Small Language Models (SLMs).We fairly evaluate SLMs-based and LLMs-based methods across nine datasets spanning four common IE tasks: (1) Named Entity Recognition, (2) Relation Extraction, (3) Event Detection and (4) Event Argument Extraction.For each dataset, we explored four to six settings to encompass typical low-resource extents, from 1-shot to 20-shot or even more.Given the potential sensitivity of LLMs' performance to the prompt context, we meticulously considered variations in instruction, demonstration number and selection strategy, prompt format, etc.Our study reveals that LLMs excel over SLMs only when annotations are extremely limited, i.e., both label types3 and the samples4 per label are extremely scarce.With more (e.g., hundreds of) samples, SLMs significantly outperform LLMs.Furthermore, LLMs incur greater inference latency and costs than fine-tuned SLMs.Hence, we claim that current LLMs are not good few-shot information extractors in general.
We further investigate whether LLMs and SLMs exhibit different abilities to handle various types of samples.We categorize samples according to their difficulty measured by SLMs' confidence scores, and compare LLMs' and SLMs' results within each group.We find that LLMs are good at hard samples, though bad at easy samples.We posit that the knowledge and reasoning abilities in LLMs enable them to handle hard samples (which are sim-ply beyond SLMs' capabilities) well.Nevertheless, LLMs demonstrate strong predisposition to falsepositive predictions on negative samples.Since most negative samples are easy samples (which could be solved readily by SLMs), the performance of LLMs on easy samples sometimes collapses and are usually much worse than fine-tuned SLMs.
Leveraging these findings, we pursue an approach to incorporate LLMs and SLMs within a single system and combine their merits.To this end, we propose a novel filter-then-rerank framework.The basic idea is that SLMs serve as a filter and LLMs as a reranker.Specifically, SLMs initially predict and determine the difficulty of each sample.If the sample is a hard one, we further pass the top-N most-likely candidate labels from SLMs to LLMs for reranking.Otherwise we view the prediction from SLMs as the final decision.By providing easy/hard samples with different solution strategies, our system utilizes each model's strengths to complement each other.Also, it reranks only a small subset of samples and minimizes the extra latency and budgets for calling LLMs.With a modest cost increase, our framework yields a consistent F1 improvement, averaging 2.4% higher than previous methods on various few-shot IE tasks.To the best of our knowledge, this is the first successful attempt to use LLMs to enhance few-shot IE tasks.

LLMs for Information Extraction
Recent studies have increasingly explored Information Extraction (IE) tasks using LLMs.Drawing inspiration from instruction tuning (Wei et al., 2022a), several methods (Wadhwa et al., 2023;Wang et al., 2023a;Lu et al., 2023) transform annotated samples into instruction-answer pairs and then finetune LLMs, such as FlanT5 (Chung et al., 2022), on them.Nonetheless, this method necessitates a vast range of samples with diverse schemas and often yields suboptimal results in low-resource scenarios.In the context of few-shot IE tasks, prevalent strategies bifurcate into two main streams.The first approach perceives LLMs as efficient annotators (Ding et al., 2023;Josifoski et al., 2023).In these methods, they produce a plethora of pseudolabeled samples through LLMs and leverage the enhanced annotations to train SLMs.Conversely, the latter approach employs LLMs in inference using the ICL paradigm, which is the focus of our subsequent discussion.

Few-shot IE with ICL
Regarding few-shot IE tasks, recent studies intensively compare the performance between SLMs and LLMs but yield inconsistent conclusions.Some studies favor LLMs as competent few-shot extractors (Agrawal et al., 2022;Wang et al., 2023b;Li et al., 2023;Zhang et al., 2023a;Wadhwa et al., 2023), while others dispute this claim (Jimenez Gutierrez et al., 2022;Qin et al., 2023;Wei et al., 2023;Gao et al., 2023).This discrepancy leaves the question of whether LLMs perform competitively on few-shot IE tasks unresolved, thus hindering the advances of this domain.
We attribute such disagreement to the absence of an comprehensive and unified benchmark.Existing studies usually vary in tasks, datasets, and few-shot settings.Furthermore, some studies rely on overly simplistic datasets (Jimenez Gutierrez et al., 2022;Li et al., 2023) and may exaggerate the effectiveness of LLMs.Driven by these findings, our research undertakes comprehensive experiments across four IE tasks, nine datasets with various schema complexities (from coarse-grained to fine-grained) and low-resource settings.
In addition to the empirical study, we develop an innovative filter-then-rerank paradigm to combine the strengths of both LLMs and SLMs.It utilizes prompting strategies akin to QA4RE (Zhang et al., 2023a), transforming IE tasks into multi-choice questions.However, our method stands apart by integrating SLMs and LLMs within a single framework.This incorporation (1) enables our paradigm applicable to various IE tasks by providing candidate spans in the text and (2) achieves promising performance under low-resource IE scenarios.

Large LMs v.s. Small LMs
In this section, we compare the performance between LLMs and SLMs to evaluate whether LLMs perform competitively.

Task, Dataset and Evaluation
We run experiments on nine widely-used datasets across four IE tasks.(1) Named Entity Recognition (NER): CONLL03 (Tjong Kim Sang and De Meulder, 2003), OntoNotes (Weischedel et al., 2013) and FewNERD (Ding et al., 2021).(2) Relation Extraction (RE): TACRED (Zhang et al., 2017) and TACREV (Alt et al., 2020).(3) Event Detection (ED): ACE05 (Doddington et al., 2004), MAVEN (Wang et al., 2020) and ERE (Song et al  2015).( 4) Event Argument Extraction (EAE): ACE05, ERE and RAMS (Ebner et al., 2020).With label numbers ranging from 4 to 168, we assess LLMs' performance under different schema complexities.See their details in Appendix A.1.Few-shot Set We construct few-shot datasets from the original datasets above.For training and validation set, we adopt K-shot sampling strategy, i.e., sampling K samples for each label type.See more details in Appendix A.2.For test set, we downsample their original test sets to reduce the cost of LLMs.We randomly sample 500 sentences for RE tasks, and 250 sentences for other task.We ensure that each label has at least one corresponding sample to avoid the absence of rare labels.
Evaluation We adopt micro-F1 score in NER, RE and ED tasks.For EAE task, we follow previous work (Wang et al., 2023b) and adopt head-F1 score, which merely considers matching of the head word rather than the whole content of a text span.We report averaged score w.r.t 5 sampled train/validation sets unless otherwise stated.

Large Language Models
Detailed in Appendix C, we evaluate the ICL abilities of LLMs.Given labeled sentences D = {(s i , y i )} and a test sentence s, our goal is to predict structured information y from s using a frozen LLM L. We feed LLM with prompt P E,I,f (D, s): We give examples of prompts on four IE tasks in Figure 1.The prompts consist of three parts: instruction I (color in green in Figure 1), demonstration f (E(D, s)) (demo; color in blue) and the question f (x) (color in black).Here E denotes demo selector and E(D, s) ⊂ D denotes selected sentences as the demo to predict s.Prompt format f5 refers to the template which converts demo E(D, s) and sample s to input context for LLMs.Then LLM generates f (y) (color in red) from which we could readily parse the extraction results y.Models L: We explore six LLMs from two sources.(1) OpenAI models6 : we employ Chat- GPT, CODEX (Chen et al., 2022a) and Instruct-GPT (Ouyang et al., 2022) for main experiments.We also evaluate GPT-4 in Appendix D.3.(2) Open-source models: we use LLaMA-13B (Touvron et al., 2023) and its instruction-tuned counterpart, Vicuna-13B (Chiang et al., 2023).Instruction I: The instruction (1) describes the task and (2) enumerates all possible labels for reference.we adopt instructions shown in Figure 1.
Demo selector E: The maximum input length of code-davinci-002, text-davinci-003 and gpt-4-0314.Due to budget constraints, we execute InstructGPT and GPT-4 only once per setting.We do not conduct EAE task on CODEX since it had been unavailable at that time.
LLMs usually limits the sentence number in demos even under few-shot settings.Therefore for each test sentence s, we demand a demo retriever E(D, s) which selects a small subset from D as the sentences in demo.Following previous methods (Liu et al., 2022;Su et al., 2022), we retrieve demos according to their sentence embedding similarity to the test samples.

Main Results
We summarize the main experimental outcomes in Figure 2 (2) Among proprietary LLMs, ChatGPT performs better on NER and EAE tasks, but poorer so on RE and ED tasks.InstructGPT and CODEX demonstrate comparable performance across these tasks.
LLMs show limited inference speed.We compare the inference speed of different methods and show their results in Table 1.We observe that LLMs is much slower than SLMs since they have much more parameters, longer input contexts and extra response decay (if external APIs applied).

Analysis on Prompt Sensitivity
Previous work (Lu et al., 2022b) indicates that the efficacy of LLMs on specific tasks can be significantly influenced by the construction of the prompt.
To ensure that LLMs' suboptimal outcomes are not erroneously ascribed to inappropriate prompt designs, we meticulously examine the impact of diverse prompt variations from four aspects, i.e., instruction format, demo number, demo selector and prompt format.We leave comprehensive details of the variants and their results to Appendix E.2-E.5, and illustrate salient findings in Figure 3.Our findings include that (1) diverse instruction strategies yield comparable results in IE task; (2) increasing the number of samples in demonstrations does not unequivocally enhance performance; and (3) The selection strategy of demonstration matters, and retrieval based on sentence embedding

Analysis:
The New Yorker is a well-known American magazine that has been published since 1925, and is primarily known for its long-form journalism, commentary, and satire.It has a reputation for publishing high-quality writing on a wide variety of topics, including politics, culture, and the arts.So The New Yorker is a media/newspaper organization.
7 These two tasks require unfixed numbers of (label, span) tuple.Furthermore, the length of each span is also unfixed.
To mitigate LLMs' drawbacks mentioned above, we propose a filter-then-rerank paradigm to integrate both SLMs and LLMs within the same system.This paradigm uses SLMs as filters to select the top-N candidate labels, then LLMs rerank them to make final decisions.By using SLM-generated candidate answers, the focus of LLMs shifts from sentence-level (i.e., identifying all entities/events in the sentence) to sample-level (i.e., determining single entity/event candidate provided).Each question now corresponds to a single sample, allowing us to reframe prompts as multi-choice questions (MCQ; shown in Figure 4) problem.Under such format, each candidate label is converted to a choice by pre-defined templates.We claim filter-then-rerank paradigm is more likely to elicit the powers of LLMs and smoothly solve few-shot IE tasks because: (1) LLMs are more familiar with MCQ prompts than IE-format prompts (Zhang et al., 2023a).( 2) This paradigm reduces the label scopes significantly, since N is usually much smaller than fine-grained label numbers.

LLMs are Hard Sample Solver
Our filter-then-rerank paradigm, unfortunately, presents unsatisfactory performance (and even suffers longer latency since LLMs rerank candidates per sample).Given LLMs' abilities in memorization and reasoning, however, we still believe that LLMs are potential to solve some, if not most, IE samples effectively.We hypothesize that LLMs are more proficient than SLMs on hard samples.These samples are characterized by their requisite for external knowledge acquisition or sophisticated reasoning strategies, areas where LLMs can leverage their extensive parametric knowledge bases and inherent reasoning mechanisms.In contrast, SLMs often falter with such samples, constrained by their restricted modeling capacities.
We leverage an unsupervised metric from SLMs to evaluate the difficulty of samples.Given a sample x in the sentence s, we define the highest probability across all labels as the confidence score: where L denotes the label set and P SLM (l|x; s) the probability of a span x (in the sentence s) referring to label l computed by SLMs.We classify samples with low confidence scores as hard samples.
Otherwise we view them as easy samples.We conduct experiments to confirm our hypothesis that LLMs excel on hard samples.We group samples by confidence scores and compare two methods within each group: (a) SLM-based methods without LLM reranking, and (b) SLMs as the filter and LLMs as the reranker.Method (b) differs from (a) by adding a single LLM to rerank the top-N SLM predictions, using MCQ prompts.
The results in Figure 5 substantiate our assumption.(1) LLM-based reranking (blue lines) enhances performance on hard samples (left areas in the figure).We provide a detailed analysis of specific challenging instances where LLM rerankers prove advantageous in Appendix F.1.These instances demonstrate the efficacy of LLMs in harnessing external knowledge and complex reasoning to rectify erroneous predictions initially made by SLMs (red lines).( 2) Conversely, LLM-based reranking impedes performance on easy samples (right areas), resulting in a significant degradation, particularly for very easy samples (rightmost areas).In conclusion, LLMs exhibit greater proficiency in handling hard samples compared to SLMs, yet they underperform relative to SLMs on easy samples.

Why LLMs Fail on Easy Samples
We investigate why LLMs (relatively) fail on easy samples in this section.As shown in Table 2, we observe significant higher negative sample ratios for easy samples across diverse IE tasks.In other words, most negative samples are easy samples for SLMs.Here we refer negative samples to those labeled as None.We speculate that the proficiency of SLMs with negative samples stems from their ability to adeptly discern apparent patterns during the fine-tuning stages.Therefore, SLMs could predict negative samples with (relatively) high confidence and accuracy.Due to LLMs' predisposition to false-positive predictions on negative samples, however, the performance of LLMs on easy samples collapses.We attribute such false-positive predictions to (1) hallucination and (2) span boundary mismatch.We detail such two kinds of mistakes with cases in Appendix F.2.

Adaptive Filter-then-rerank Paradigm
Above findings can be summarized as: (1) SLMs generally outperform LLMs, especially with more training samples and fine-grained labels.
(2) SLMs are much more time-and cost-efficient.
(3) LLMs serve as powerful rerankers on hard samples that challenge SLMs.Based on them, we propose a simple, efficient, and effective adaptive reranker that combines the strengths of SLMs and LLMs.

Method
Our adaptive filter-then-rerank approach, shown in Figure 6, uses supervised SLMs as a filter to make preliminary decisions.Samples with confidence scores exceeding threshold are viewed as easy samples otherwise hard ones.For easy samples, we retain SLM predictions as final results.For hard samples, top-N predictions from SLMs are reranked via LLMs using ICL.Here LLMs employ MCQ prompts (Figure 4), containing demos and a sample to be reranked.The LLMs then generate the final answer and optionally provide an explanation.

Experimental Setup
We conduct experiments on FewNERD for NER task, TACREV for RE task and ACE05 for ED task.We employ top-performing SLM-based methods from Section 3 (FSLS or KnowPrompt) as the

Filter
The sentence implies that Laura Silsby is associated with the city of Meridian in the state of Idaho, and does not provide information about her birthplace.So Laura Silsby lives in the city Meridian.

Demonstration
The lawyer denied Italian news reports that she wept while addressing the court, but said Knox was upset as she recounted the pressure, the aggressiveness of the police who called her a liar.

Easy Sample Hard Sample
Figure 6: The overall architecture of our adaptive filter-then-rerank paradigm.We color easy samples in orange and hard samples in pink.For easy samples, the final predictions are exactly from the SLM-based methods.For hard samples, the top-N predictions from SLMs are fed into LLMs as the format of multiple-choice questions (pink box).
The question is paired with demos (green box).LLMs rerank these N candidates and generate the final prediction.
filter, and Vicuna-13B, InstructGPT or GPT-4 as the reranker.The threshold τ to determine sample difficulty is optimized on the valid set.For hard sample, the top-3 SLM predictions and None (if not included) are feed to LLMs for reranking.
Each LLM prompt has 4-shot demos.See demo examples in Appendix G.1.We follow templates in Lu et al. (2022a) for TACREV and carefully design others.See these templates in Appendix G.2.We adopt chain-of-thought reasoning (Wei et al., 2022b), i.e., prefacing the answer with an explanation, to facilitate LLMs' reranking procedure.
Baseline We compare our method with two kinds of baselines to validate its effectiveness.
(1) LLMs with ICL: We follow the prompts in Section 3.3 and conduct experiments on three LLMs.
(2) Supervised SLMs: We follow previous SoTA methods shown in Section 3.4 (FSLS or Know-Prompt).We additionally combine two SLMs with ensemble or reranking approach (i.e., replace the LLM with another SLM as the reranker) to verify that improvements from our SLM-LLM integrated system are not solely due to the ensemble effects.

Main Results
Table 3 shows that our filter-then-rerank method consistently improves performance across three datasets and nine settings.For instance, with In-structGPT, reranking provides an average F1 gain of 2.4% without SLM ensemble (Lines 4 vs. 7).
Based on ensemble SLMs as the filter, our method still achieves 2.1% (Lines 5 vs. 8) gains on av-erage.This confirms (1) the effectiveness of the LLM reranking and (2) its gains are different and (almost) orthogonal to the SLM ensemble.

Analysis
Few makes big difference Our method selectively reranks hard samples.Table 4 shows that (1) only a minor fraction (0.5%~10%) of samples are deemed hard and are reranked by LLMs.
(2) Despite their limited quantity, reranking results in a substantial performance boost on these samples (10%~25% absolute F1 gains).This uplift on a small subset significantly enhances the overall performance.
GPT-4 is more aggressive From Tables 3 and 4, GPT-4 generally improves more on hard samples, yet InstructGPT surpasses GPT-4 in NER and RE tasks when evaluated overall.This discrepancy arises from GPT-4's aggressive reranking which introduces more true positives.InstructGPT, however, focuses more on reducing false positives.
Few makes small cost Figure 7 demonstrates that our method impressively reduces budget and latency by approximately 80%~90% compared to direct ICL.This reduction is due to (1) fewer LLM callings (only for hard samples) and (2) shorter prompts (fewer candidate labels and demos).

Ablation Study
We investigate the effectiveness of the modules in adaptive filter-then-rerank system by removing each of them in turn: (1) CoT: We exclude the explantion for each examples in demo.

GPT-4
InstructGPT before after △ ratio before after △ ratio FewNER 31.9 40.7 8.8 3.2% 31.hence cutting inference costs.(4) The performance collapses without a filter to identify sample difficulty, reiterating the need for an integrated SLM-LLM system to complement each other.

Conclusion
Through an extensive empirical study on nine datasets spanning four IE tasks, we find that LLMs, despite their superiority in extreme low-resource scenarios, are not effective few-shot information extractors in general.They struggle with IE-related prompts, have limited demonstration capacity, and incur high inference costs.However, LLMs significantly improve the performance on hard samples when combined with SLM.Building on these insights, we propose an adaptive filter-then-rerank paradigm to leverage the strengths of SLMs and LLMs and mitigate their limitations.This approach consistently achieves promising results, with an average 2.4% F1 gain across multiple few-shot IE tasks, while minimizing latency and budget costs.

Limitations
We do work hard to find better prompts to elicit the power of LLMs on few-shot IE tasks in Section 3.5, by exploring various kinds of LLMs, demonstration strategies and prompt formats.We find that different prompt variants do not significantly impact in-context learning abilities.As an empirical study, we acknowledge the potential existence of a lottery prompt superior to our explored prompts.However, it seems unlikely that an improved prompt would substantially alter our conclusions.
Another common risk when evaluating LLMs on public benchmark is their potential memorization of samples tested.To mitigate such potential contamination, we use earlier and stable versions of these models rather than the newer and updated ones (for example, gpt-4-0314 instead of gpt-4).Even if such contamination makes abilities of LLMs overestimated, our primary conclusions remain unchanged because we find that LLMs are NOT good few-shot information extractors.
Regarding our adaptive filter-then-rerank paradigm, a key limitation lies in how to assess sample difficulty.In this work, we employ a simple unsupervised metric, i.e., the maximum probabilities from SLMs.This is predicated on the assumption that SLMs are well-calibrated (Guo et al., 2017).However, it is an obviously imperfect assumption.We envision that calibrating SLMsbased filters or developing an advanced difficulty metric could substantially enhance LLM rerankers' performance.We leave them for future work.

A Datasets
A.1 Full Datasets We construct few-shot IE datasets and conduct the empirical study on nine datasets spanning four tasks, with varying schema complexities ranging from 4 to 168.We show their statistics in Table 6.

A.2 Details of Few-shot IE Datasets
Sampling Algorithm for Train/Valid Datasets.
We downsample sentences from original training dataset to construct few-shot training and valid datasets.We adopt K-shot sampling strategy that each label has (at least) K samples.We set 6 Kvalues (1, 5, 10, 20, 50, 100) for RE tasks and 4 K-values (1, 5, 10, 20) for other tasks.For RE task, each sentence has exactly one relation and we simply select K sentences for each label.For NER, ED and EAE tasks, each sentences is possible to contain more than one entities/events/arguments. Since our sampling is at sentence-level, the algorithm of accurate sampling , i.e., finding exactly K samples for each label, is NP-complete8 and unlikely to find a practical solution.Therefore we follow Yang and Katiyar (2020) adopting a greedy sampling algorithm to select sentences for NER and ED tasks, as shown in Algorithm 1.Note that the actual sample number of each label can be larger than K under this sampling strategy.For all three tasks, we additionally sample negative sentences (without any defined labels) and make the ratio of positive sentences (with at least one label) and negative sentences as 1:1.The statistics of the curated datasets are listed in Table 7.

Algorithm 1 Greedy Sampling
Require: shot number K, original full dataset D = {(X, Y)} tagged with label set E 1: Sort E based on their frequencies in {Y} as an ascending order 2: S ← ϕ, Counter ← dict() 3: for y ∈ E do 4: Counter(y) ← 0 5: end for 6: for y ∈ E do  Based on the subsets constructed above, we optionally further split them into training and valid sets.For few-shot datasets with more than 300 sentences, we additionally split 10% sentences as the valid set and the remaining sentences as training set.Otherwise, we do not construct valid set and conduct 5-fold cross validation to avoid overfitting.

B Details on SLMs
We adopt five representative supervised methods to evaluate the ability of SLMs on few-shot IE tasks.
(1).Fine-tuning (FT): Add a classifier head on SLMs to predict the labels of each sentence/word.
(2).FSLS (Ma et al., 2022a): The state-of-the-art extractive-based method for few-shot NER task.Ma et al. (2023) also validate its competitive performance on few-shot ED tasks.
(5).UIE (Lu et al., 2022c): A competitive unified generation-based method for few-shot IE tasks.We introduce their implementation details below: Fine-tuning/FSLS.We implement these two meth- ods by ourselves.We use RoBERTa-large (Liu et al., 2019) as the backbones.We adopt Automatic Mixed Precision (AMP) training strategy9 to save memory.We run each experiment on a single NVIDIA V100 GPU.We train each model with the AdamW (Loshchilov and Hutter, 2019) optimizer with linear scheduler and 0.1 warm-up steps.
We set the weight-decay coefficient as 1e-5 and maximum gradient norms as 1.0.We set the batch size as 64, the maximum input length as 192, the training step as 500 and the learning rate as 5e-5.KnowPrompt We implement this method based on original source code10 , and use RoBERTa-large as our backbones.We set 10 maximum epochs for 50-and 100-shot datasets, and as 50 epochs for other datasets.We keep all other hyperparameters as default, and run each experiment on a single NVIDIA V100 GPU.PAIE We implement this method on original source code11 , and use BART-large (Lewis et al., 2020) as backbones.We keep all hyperparameters as default for ACE and RAMS dataset.For ERE dataset, we set the training step as 1000, the batch size as 16 and the learning rate as 2e-5.We run each experiment on a single NVIDIA V100 GPU.UIE We implement this method based on original source code12 , and use T5-large (Raffel et al., 2020) as the backbones.We run each experiment on a single NVIDIA Quadro RTX8000 GPU.We set the batch size as 4 with 4000 training steps.We set the maximum input length as 800 and the learning rate as 1e-4.

C LLMs Implementations
Regarding our empirical study, we explore the ICL abilities of LLMs on few-shot IE tasks.We mainly use five LLMs from two sources.( 1 (2) Open-source models: LLaMA-13B (Touvron et al., 2023) and its instruction-tuned counterpart, Vicuna-13B (Chiang et al., 2023).We detail their implementation details in the next sections below.

C.1 Open-source Models
We implement multiple ICL approaches on LLaMA-13B and Vicuna-13B without fine-tuning.
We set the maximum input length as 2048 and the batch size as 1.We run each experiment on a single NVIDIA V100 GPU.To achieve this, we leverage the Accelerate13 framework and fp16 inference to save memory.We set maximum output length as 96 and sampling temperature as 0 (i.e., greedy decoding).We set both frequency_penalty and presence_penalty as 0.

C.2 OpenAI Models
We implement multiple ICL approaches on Ope-nAI models by calling their official APIs14 .We set the maximum input length as 3600 for all tasks and models.The only exception occurs when we use CODEX on RE tasks, where we set the maximum input length as 7000.We unify the maximum output length as 32 for RE task, and 96 for other three tasks.We set the sampling temperature coefficient as 0, i.e., greedy decoding.

D Pivot Experiments on LLMs D.1 Sampling Temperature
Existing prompt-engineering discussion15 suggests setting the sampling temperature t = 0 for tasks with structured outputs, including IE tasks.We validate this conclusion in Table 8, from which we could see the generated quality when t = 0 is much higher than the quality when t ̸ = 0. Therefore we set t = 0 in all main experiments, and do not take self-consistency (Wang et al., 2023c) into account.

D.2 Automatic Chain-of-thought
We additionally investigate whether rationales could facilitate LLMs' performance on few-shot IE tasks.Since there exists no golden rationales in For example, given the sentence "DSC and Traction Control on all Speed3 models is also standard.",we would feed LLM the query that "Could you explain why Speed3 is a kind of car".Then we insert the bootstrapped rationales between the sentences and ground-truth answers.If a sentence has no positive labels, however, we do not ask LLMs and keep the original format as the vanilla ICL approach.Here we prompt InstructGPT to generate the rationales with temperature t = 0.7.We compare the performance with and without Auto-CoT as shown in Table 9.We are frustrated to find Auto-CoT degrades the performance with a large margin.We speculate this degration could be attributed to three main reasons.(1) The rationale increase the length of each sample and thus decrease the overall example number in demos.(2) There exists an obvious discrepancy between sentences with and without positive labels.The rationales are only provided for sentences with positive labels because it is hard to explain why a sentence dose not contain any label.
(3) Some auto-generated rationales are low-quality, especially for RE tasks.We would explore better strategy to exploit auto-genertaed rationales in the future work.We tend to minimize the GPT-4 calls due to its high price.Thus we utilize 20-/100-shot settings across each dataset to compare GPT-4's performance with other LLMs.Table 10 reveals that GPT-4 does not outperform other LLMs significantly, except on OntoNotes and MAVEN.However, even on these datasets, GPT-4 still falls behind supervised SLMs by a significant margin.Consequently, the exclusion of GPT-4 does not undermine the conclusions drawn from our main experiments, and we omit it from our empirical study.

E Auxiliary Experiments E.1 LLMs struggle on Fine-grained Datasets
Based on the results shown in Figure 2, we additionally provide a quantitative analysis to show that LLMs struggle with fine-grained datasets.Under the 5-shot setting, we compare the performance difference of LLMs (ChatGPT) and SLMs (SoTA few-shot models) among different datasets.For each IE task, we observe a clear negative corre- 6.9 0.3 -9.9 lation between the label number (row 2) and the performance difference (row 5).In other words, with more label types, LLMs tend to perform relatively worse than SLMs.Therefore we conclude that LLMs struggle on fine-grained datasets.

E.2 Finding Better Instruction
To investigate whether LLMs would benefit from complex instructions, we explored six instruction variants from simple to complex.Take NER task as an example, we illustrate them as below.

Instruction0: [empty]
Instruction1: Identify the entities expressed by each sentence, and locate each entity to words in the sentence.The possible entity types are: [Type_1], [Type_2], ..., [Type_N].If you do not find any entity in this sentence, just output 'Answer: No entities found.'Instruction2: Identify the entities expressed by each sentence, and locate each entity to words in the sentence.The possible entity types are: If you do not find any entity in this sentence, just output 'Answer: No entities found.'Instruction3: Assume you are an entity-instance annotator.
Given a sentence, you need to (1) identify the word or phrase about the entity in the sentence, and (2) classify its entity type.
The possible entity types are listed as below: [Type_1], [Type_2], . . ., [Type_N] Given a sentence, you need to (1) identify the word or phrase about the entity in the sentence, and (2) classify its entity type.
The possible entity types are listed as below: Regarding these six instructions, we evaluate their performance of ChatGPT on four 20-shot IE tasks.As shown in Table 12, there is no significant correlation between the instruction complexity and LLMs' performance.Even the prompt without instruction (I0) leads to comparable, if not better, results than prompt with complex instructions.Therefore, we use simple instruction (I1) in our main experiment.

E.3 Do More Samples in Demos Help?
We wonder whether longer demos bring more powerful ICL abilities for LLMs.Thus we investigate the impact of increasing the number of demonstrations on LLMs' performance in Figure 8.We observe that: (1) The performance of the RE task consistently improves with more demos, indicating its potential benefiting from additional annotations.
(2) The NER and ED tasks reach a stable or degraded performance with increased demo numbers, suggesting that they are limited even before reaching the maximum input length.
(3) Open-source LLMs, i.e., LLaMA and Vicuna, have more limited capacities in leveraging demos compared to Ope-nAI models, with their performance stagnating or even collapsing with only a few (2-4) demos.

E.4 Finding Better Demo Selection Strategy
The maximum input length of LLMs usually limits the sentence number in demos even under fewshot settings.For each test sentence s, we demand a demo retriever E(D, s) which selects a subset from D as the sentences in demo.Following previous work, we consider three commonly-used strategies.
(3) Efficient Prompt Retriever (Rubin et al., 2022): retrieving by a neural retriever R trained on D.  For each test sentence s, we pre-retrieve M similar sentences Then we score each sentence in D by their likelihoods where f denotes the prompt format adopted and L the scoring LM.We randomly select positive samples s ′ i (pos) from the top-K D sentences and hard negative samples s ′ i (hard-neg) from the bottom-K D ones.Then we train R D by inbatch contrastive learning (Chen et al., 2020).For each sentence s ′ i within the batch, there are 1 positive sentences s ′ i (pos) and 2B −1 negative sentences {s ′ j (hard-neg) } B j=1 ∪ {s ′ j } B j̸ =i .Here we adopt M as 40, K D as 5, f as text prompt, the batch size B as 128, and the scoring LM L as FLAN-T5-xl.Table 13 demonstrates the F1-score performance on different selection strategies.We find that both the sentence embedding and EPR surpass random sampling by a large margin.Given the simplicity of the sentence embedding, we adopt it, rather than EPR, as our selection strategy in main experiment.

F.2 Easy Samples
As discussed in Section 4.3, we attribute the inferior performance of LLMs on easy samples to their false-positive predictions for negative samples.Upon examining the incorrect cases, we speculate that two main reasons are responsible: (1) LLM hallucination and (2) span boundary mismatch.We provide details on them with specific cases below.
LLM Hallucination.The most distinctive mistakes made by LLMs.Without specific fine-tuning on IE tasks, LLMs (with in-context learning approaches) sometimes hallucinate and provide meaningless, even confusing explanations or answers.For instance, LLMs might predict a verb, preposition or even special symbols as an entity.
Case 1: LLMs predict the word die as entity person-other (Ground truth: None) Sentence: Flavius Rumoridus (died 5th century AD) was a Roman soldier who was appointed consul in AD 403 in the Western Roman Empire at the same time as the eastern emperor Theodosius II served in the same capacity in the East.

Candidate Choices:
(a) died do/does not belong to any known entities.
(b) died is a kind of geopolitical entity.
(c) died is a person, but not affiliated with following professions: actor, artist, athlete, author, director, politician, scholar, soldier.
Analysis: The sentence provides information about Flavius Rumoridus, who was a Roman soldier and appointed consul in AD 403.The sentence also mentions that he died in the 5th century AD, which is a statement about his death.Therefore, the correct answer is (c) because the sentence does not provide any information about Flavius Rumoridus being affiliated with any of the professions listed in option (c).
Correct Answer: (c) Case 2: LLMs predict the word of as entity organization-other (Ground truth: None) Sentence: During these years he studied at the South Kensington Schools of the Royal College of Art.
Candidate Choices: (a) of do/does not belong to any known entities.
(b) of is an educational institute.
(c) of pertains to an organization that does not fall under the categories of company, educational institution, government, media, political party, religion, sports league, sports team, band or musical group.
Analysis: In the given sentence, of is a preposition that indicates the relationship between South Kensington Schools and Royal College of Art.None of the candidate choices directly address the correct relationship between these entities or state that they are not applicable.Therefore, the correct answer is (c): of does not pertain to any known entities mentioned in the choices provided.
Correct Answer: (c) Candidate Choices: (a) 65th do/does not belong to any known entities.
(b) 65th is a kind of award.
(c) 65th refers to some event except attack, election, natural disaster, protest, revolution and sports.
Analysis: The sentence mentions the 65th British Academy Film Awards, which is an event that recognizes outstanding achievements in the film industry.Therefore, option (c) is incorrect as it excludes events like award ceremonies.Option (a) is also incorrect as the number 65th is associated with the event.
Hence, the correct answer is (b) as 65th refers to the specific edition of the BAFTA Awards.

Correct Answer: (b)
G Details on Adaptive Filter-then-rerank Paradigm

G.1 Demonstration Examples
We convert few-shot IE tasks to multiple-choice questions in our filter-then-rerank paradigm.We show 4 examples used in demonstrations for FewN-ERD dataset in Table 16, for TACREV dataset in Table 17, and for ACE05 datasets in Table 18.

G.2 Template
In our filter-then-rerank paradigm, we utilize templates converting candidate labels to question options.We list the template for FewNERD dataset in Table 19, for TACREV dataset in Table 20, and for ACE05 datasets in Table 21.

Event Detection
We have fought every institution, every law firm ... it 's cost us a fortune but now we 're going on a massive expansion.

Conflict.Attack ✗
None ✓ In the given sentence, the word fought is used to describe the action of the speaker fighting against various institutions and law firms.This does not involve any physical violence or court proceedings, so the word fought does not trigger any known event.

Figure 1 :
Figure 1: Examples of prompts used.The green, blue and black parts in the top boxes represent the instruction, demonstration (demo) and test sentence in the prompt respectively.The red parts represent the outputs from LLMs.We plot only 1 example for convenience of visualization.The actual demo number is usually much larger than 1.

Figure 2 :
Figure 2: Overall results of SLM-based methods (dashed lines) and LLM-based methods (solid lines) on nine datasets across four IE tasks.The black, horizontal dashed lines represent the SoTA performance on full dataset.

Figure 5 :
Figure5: Relationship between confidence scores and performance with/without LLM reranking.We adopt RoBERTa-large as filter and InstructGPT as reranker.
(a)she is the other family member of lawyer (b)she is a lawyer (c)she has no known relations to lawyer Analysis: The word 'she' refers to someone who was upset while recounting certain events in court.The word 'lawyer' refers to someone who denied a news report about that same person weeping in court.There is no information in the sentence to indicate that the two individuals are related in any way.Answer: (c) Adrien said he met the Baptists' leader, Laura Silsby of Meridian, Idaho, in Port-au-Prince on Jan 26.(a)Laura Silsby lives in the state or province Meridian (b)Laura Silsby lives in the city Meridian (c)Laura Silsby was born in the city Meridian (d)Laura Silsby has no known relations to Meridian Analysis: Question Adrien said he met the Baptists' leader, Laura Silsby of Meridian, Idaho, in Port-au-Prince on Jan 26.The last hostage, Italian engineer Eugenio Vagni, was released early Sunday.

Figure 7 :
Figure 7: The financial and time cost over 500 sentences.InstructGPT as the reranker.
...'.If you do not find any entity in this sentence, just output 'Answer: No entities found.'

Figure 8 :
Figure8: Relationship between demo number and F1-score among three datasets.Note that the x-axis in each subfigure represents the number of demos (not the shot value K) during ICL.We adopt sentence embedding as the demo selection strategy and text prompt in this experiment.

Relation Extraction Named Entity Recognition Event Detection Event Argument Extraction
.,

rerank Paradigm Read following sentences and identify what is the entity type of "The New Yorker" quoted by <t>. Sentence: In
2004 Gourevitch was assigned to cover the 2004 U.S. presidential election for "<t> The New Yorker <t>".Candidate Choices: (a)The New Yorker does not belong to any known entities.(b)The New Yorker is a broadcast program.(c)The New Yorker is a kind of written art.(d)The New Yorker is a media/newspaper organization.

Table 2 :
Comparative ratios of negative to positive samples across various datasets and subsets.We set fixed threshold τ here for simplicity.

Table 3 :
Overall results of LLM-based ICL methods, SLM-based supervised methods, and our proposed filter-thenrerank (SLM+LLM) methods.The best results are in bold face and the second best are underlined.All results except InstructGPT and GPT-4 are averaged over 5 runs, and sample standard deviations are in the round bracket.

Table 5 :
Ablation study on three datasets.The filter is ensembled SLMs and the reranker is GPT-4.
models as zero-shot relation extractors.In Findings of the Association for Computational Linguistics: ACL 2023, pages 794-812, Toronto, Canada.Association for Computational Linguistics.

Table 6 :
Statistics of nine datasets used.Note that the #mentions for event detection tasks refers to the number of trigger words, while the #mentions for event argument extraction tasks refers to the number of arguments.

Table 7 :
The statistics of few-shot training sets.We set different random seeds and generate 5 training sets for each setting.We report their average statistics.

Table 8 :
F1-scores across different t values.Experiments run on 10-shot settings with CODEX.

Table 9 :
The F1-score difference between with and without Auto-CoT.We generate rationales by Instruct-GPT, then adopt ICL w.Auto-CoT approach and use CODEX as our backbone for inference.

Table 11 :
Performance comparison between LLMs (ChatGPT) and SLM-based methods among datasets with various schema complexities.

Table 13 :
F1-scores on three demo-selection strategies.Experiments run on 20-shot settings with ChatGPT.
Table 15 showcases some hard examples which benefits from our LLM reranking.In accordance with our intuition, we observe that the LLM rerankers correct two kinds of erroneous predictions made by LLMs.(1) The lack of external knowledge, such as the first (Triptolemus is a figure in Greek mythology) and third examples (Minas Gerais is a state instead of city).(2) Limited reasoning abilities, such as the second (His wife's children are his children) and the fourth (The word "fought" in this sentence does not involve any physical violence) examples.
LLMs predict Baron, a subspan of Baron Carl Hurleman, as the entity person-other (The ground-truth entity type of Baron is None.And the ground-truth entity type of Baron Carl Hurleman is person-artist/author).
65th British Academy Film Awards, as the entity other-awards (The ground-truth entity type of 65th is None.And the ground-truth entity type of 65th British Academy Film Awards is other-awards).

Table 15 :
Examples of the samples corrected by LLM reranking.We sample four examples from NER, RE and ED tasks, respectively.Sentences:The sentences in which samples locate.We color the samples (entities or trigger words) to be identified.Before: The prediction before LLM reranking.Based on SLM-based methods.After: The reranked prediction using LLMs.Rationales: LLM-generated Explanations.The sentence states that 'Eliza Samudio, Bruno's ex-girlfriend and his 4-month-old baby boy's mother, disappeared from her hometown of Belo Horizonte, Minas Gerais.'This indicates that Eliza Samudio is from the city of Belo Horizonte, which is located in the state of Minas Gerais.So Eliza Samudio lives in the state or province Minas Gerais.

Table 19 :
Templates for FewNERD dataset, where {ent} is the placeholder for entity type.person,but not affiliated with following professions: actor, artist, athlete, author, director, politician, scholar, soldier.organization-other{ent} pertains to an organization that does not fall under the categories of company, educational institution, government, media, political party, religion, sports league, sports team, band or musical group.geographic locaton that does not fall under the categories of geopolitical entity, body of water, island, mountain, park, road, railway and transit.
organization-media/newspaper{ent} is a media/newspaper organization.