CodeIE: Large Code Generation Models are Better Few-Shot Information Extractors

Large language models (LLMs) pre-trained on massive corpora have demonstrated impressive few-shot learning ability on many NLP tasks. A common practice is to recast the task into a text-to-text format such that generative LLMs of natural language (NL-LLMs) like GPT-3 can be prompted to solve it. However, it is nontrivial to perform information extraction (IE) tasks with NL-LLMs since the output of the IE task is usually structured and therefore is hard to be converted into plain text. In this paper, we propose to recast the structured output in the form of code instead of natural language and utilize generative LLMs of code (Code-LLMs) such as Codex to perform IE tasks, in particular, named entity recognition and relation extraction. In contrast to NL-LLMs, we show that Code-LLMs can be well-aligned with these IE tasks by designing code-style prompts and formulating these IE tasks as code generation tasks. Experiment results on seven benchmarks show that our method consistently outperforms fine-tuning moderate-size pre-trained models specially designed for IE tasks (e.g., UIE) and prompting NL-LLMs under few-shot settings. We further conduct a series of in-depth analyses to demonstrate the merits of leveraging Code-LLMs for IE tasks.


Introduction
Information extraction (IE) aims to recognize structured information from plain text. It spans various tasks with diverse output structures such as named entity recognition (NER), relation extraction (RE), etc. (Sang and Meulder, 2003;Grishman, 2019;Wang et al., 2021a;Zhong and Chen, 2021;Lu et al., 2022). To express and address these different tasks in a unified framework, recent works propose to linearize the output structures into unstructured * Equal contribution. † Corresponding author. 1 Code is available at https://github.com/dasepli/CodeIE industrial designer, and CEO of Apple;
While this kind of linearizing approach achieves promising results with sufficient training data, it still performs poorly under the few-shot scenario. For instance, compared with full-data training, the performance dropped by around 20% when applying UIE on a 5-shot NER task CoNNL03 (Lu et al., 2022).
Considering the tremendous few-shot adapting capabilities of large language models (LLMs) (Brown et al., 2020;Rae et al., 2021;Chowdhery et al., 2022;Hoffmann et al., 2022), we manage to employ them to perform few-shot IE tasks, especially the few-shot NER task and RE task. Typically, for NLP tasks like text classification, previous works reformulate them into text-to-text generation formats and prompt the LLMs of natural language (NL-LLMs) like GPT-3 (Brown et al., 2020) to generate the answer. In contrast, due to the complex structure inside the targets of IE tasks, linearized targets of previous works like "((person: Steve) (organization: Apple))" are usually "unnatural", resulting in a mismatch between the output format at the pre-training time and the inference time (see Figure 1(a)). As a consequence, when using these flattening methods to perform IE tasks with pre-trained language models, the predicted outputs are fragile and often require complex decoding strategies to be post-processed into valid structures (Lu et al., 2022;Josifoski et al., 2022).
In this paper, we propose to frame these two IE tasks into code generation tasks and leverage the LLMs of code (Code-LLMs) to address them. We argue the abundant structured code information encoded in the pretrained Code-LLMs can benefit these IE tasks. As demonstrated in Figure  1(b), it is easy to convert the text-to-structure IE task into a structure-to-structure code generation task, while transforming it into a text-to-text format can be difficult. Take the example input in Figure 1, "Steve became CEO of Apple in 1998 .", we wrap it into a piece of Python code, and formulate the structured entity outputs as Python dictionaries with keys "text" and "type". We compose them into a Python function that is semantically equivalent to the NER example, which is shown as follows: def named_entity_recognition(input_text): """ extract named entities from the input_text . """ input_text = "Steve became CEO of Apple in 1998 ." entity_list = [] # extracted named entities entity_list.append({"text": "Steve", "type": "person"}) entity_list.append({"text": "Apple",\ "type": "organization"}) After demonstrating a few training samples with the same format, we feed the code-style prompt (the highlighted lines with light grey color) into Code-LLMs and get the structured prediction.
We conduct experiments on seven benchmarks of NER and RE tasks, and carefully analyze the benefits of our approach (named CODEIE). The findings are as follows:  1) Prompting Code-LLMs (e.g., Codex (Chen et al., 2021)) with code-style inputs consistently outperforms fine-tuning UIE, a specially pre-trained model for IE tasks, and prompting NL-LLMs (e.g., GPT-3) under fewshot settings. 2) With the same LLM (either NL-LLM or Code-LLM), the code-style prompt performs better than the linearized text prompt, demonstrating the advantage of representing structured targets with code. 3) With the same prompt (either natural language or code), the Code-LLM (i.e., Codex) achieves better performance than the NL-LLM (i.e., GPT-3), demonstrating the merits of performing IE tasks with Code-LLMs. 4) Compared with natural language prompts, using the code-style prompts showed higher fidelity to the output structures, i.e., the outputs have a lower structural error rate.
The high-level differences between previous moderate-size models, NL-LLMs, and Code-LLMs for IE tasks are summarized in Table 1.

CODEIE
In this section, we first formulate the two IE tasks we focus on, named entity recognition (NER) and relation extraction (RE) in Section 2.1. Then we describe how we recast these structured prediction tasks into code generation tasks (Section 2.2) and prompt Code-LLMs to perform them (Section 2.3) under the few-shot scenario. We use Python language for our code generation tasks since public Python codebases are abundant and Code-LLMs are sufficiently pre-trained on them.

Code-LLMs
Code Prompt Prediction RE "work for" "Steve became CEO of Apple in 1998" (b) Converting RE into code generation task Figure 2: The way to convert NER and RE into code generation task.
The target y of NER is a set of (e, t) pairs, where e is an entity span (e.g., "Steve") and t is the corresponding entity type (e.g., "person"). The entity span is a sequence of tokens from x, and the entity type belongs to a pre-defined entity type set T .
The prediction target y of RE is comprised of a set of triplets (e 1 , r, e 2 ), where e 1 and e 2 are two entity spans from x and r ∈ R is the semantic relation (e.g., "work for") between the two entities.
Here R denotes a pre-defined relation type set. In addition to extracting the relation of entities, we are often interested in also predicting the entity types t 1 and t 2 of entities e 1 and e 2 at the same time.
In the few-shot setting, we are given a small set of annotated samples {(x i , y i )} n i=1 that consists of k samples per class to compose a k-shot setting.

Formulating IE Tasks into Code Generation Task
In order to utilize generative Code-LLMs for IE tasks, we reformulate IE tasks as code generation tasks. The code generation task is to predict the subsequent code sequence given an incomplete piece of code. Hence, we can recast the input and output of the IE task into an incomplete piece of code and the code to be predicted, respectively, such that they can compose a complete piece of code that is semantically equivalent to the original sample while maintaining the syntax of the programming language.
In this work, we mainly use Python functions to represent IE tasks. We wrap the input text x into a code-style prompt x c and represent the output structure y with structured Python elements, such as the list, dictionary, etc. As shown in Figure 2, for NER and RE tasks, we first transform the task name into the name of the Python function and add a docstring to illustrate the goal of the task. We assign the input text string x to a variable input_text. Then we initialize an empty list to save the output and append a descriptive comment like "# extracted named entities" to prompt Code-LLMs to put named entities into the list. We pack the above code as our code prompt x c .
For the structured target y, we utilize the append method of Python list and represent each basic information unit (e.g., a pair for NER tasks or a triplet for RE tasks) as a Python dictionary. Hence, the target y c to be predicted by Code-LLMs is reformulated into a list of dictionaries. For NER, we add Python dictionaries with keys "text" and "type" to the list, where the values of the dictionaries are the corresponding entity span and entity type. For RE, we similarly add dictionaries with keys "rel_type", "ent1_type", "ent1_text", "ent2_type", and "ent2_text" to the list to represent the structured target.
The Code-LLM is expected to complete the list conditioning on the function name, docstring, and input text. Figure 2 shows examples of formulating an original IE sample into a code-style one.
It is worth noting that the design space of the code-style prompt is large and hard to be fully explored. The formulation described above is a straightforward instance using Python. We also explore several other formulations to recast IE tasks into code generation tasks, which can be found in Appendix A.1.

Prompting Code-LLMs with In-Context Demonstrations
Despite the carefully designed prompt, it is nontrivial to perform IE tasks by prompting Code-LLMs without any samples. Therefore, it is necessary to let Code-LLMs be aware of a few labeled samples in typical few-shot settings.
With the increasing size of pre-trained language models, fine-tuning is becoming more and more expensive or even infeasible since recent LLMs are usually released as black-box APIs . Hence, instead of fine-tuning Code-LLMs on the few-shot dataset, we explore including the labeled samples in the context and performing incontext learning (Brown et al., 2020). We select n samples {(x i , y i )} n i=1 from the training dataset and convert them to corresponding code-style pairs We concatenate them as a string to compose the in-context demonstrations x c 1 ⊕ y c 1 . . . x c n ⊕ y c n . Given an arrived test sample x, we first convert it to a code prompt x c and prepend the demonstration context, i.e., x c 1 ⊕y c 1 . . . x c n ⊕y c n ⊕x c . After feeding the constructed input into the Code-LLM, we are expected to get an output y c that is formatted as the same as y c 1 , y c 2 , . . . y c n (see Figure  2). We find that y c almost always retains the syntax of Python language and is easy to be converted back to its original structure y.

Code-LLMs
1) Fine-tuning We fine-tune the base and large versions of two moderate-size pre-trained models: T5 and UIE. T5 is a sequence-tosequence model pre-trained on large-scale text corpora. UIE is a model further pre-trained from T5 on the structured datasets. UIE utilizes the textual structured extraction language (SEL) to express the output structures. We use the same approach and parameters with Lu et al. (2022) when fine-tuning T5 and UIE. 2) Prompting We compare our approach with prompting NL-LLMs, in particular GPT-3. We mainly experiment with the text-davinci-002.
We use a text prompt, of which the format is slightly modified from SEL. As shown in Figure 1(a), given an input text x, the text prompt and output format are like "The text is x. The named entities in the text: " and "((person: ...)(organization:...))", respectively. See Appendix A.2 for more details of the text prompt. The approach and super-parameters of NL-LLMs prompting and Code-LLMs prompting are identical.  Table 3: Experiment performances on NER and RE benchmarks. Our approach is highlighted with light grey. The full data fine-tuning performances come from UIE. For the few-shot setting, we evaluate T5-base, UIE-base, T5large and UIE-large with fine-tuning, and GPT-3 and Codex by few-shot prompting with two different prompt types. The text prompt is the structured extraction language (SEL) introduced by UIE. The code format is introduced in Section 2.2. We set the shot number (#shot) and the corresponding sample number (#sample) differently to fit into the GPT-3 maximum length limitation (4097 tokens).
Few-Shot Setting For each IE task, we randomly sample k training samples for each entity or relation type to construct a k-shot training set. The value of k varies across different datasets to satisfy the maximum length limitation (4097) of GPT-3. To be compatible with datasets that contain samples with empty targets, we regard those empty-target samples as an additional class and include k samples belonging to that class in the training set.
Evaluation Same as previous work (Lu et al., 2022), we use Entity F1 and Relation Strict F1 as the evaluation metrics for NER tasks and RE tasks, respectively. Under these metrics, an entity span prediction is correct if its offsets and entity type match the golden entity. And a relation prediction is correct if the relation type is correct and the corresponding offsets and types of its entities are correct. Since few-shot training is of high variance, we perform 3 runs with different random seeds for each experiment and report the mean and standard deviation of the metric.

Results
LLMs vs. Moderate-sized Models As shown in Table 3, LLMs (GPT-3 and Codex) achieve superior performance over moderate-sized models (T5 and UIE) under few-shot settings, demonstrating a strong few-show learning ability on IE tasks. Especially, on average performance over the seven con- sidered benchmarks, our proposed CODEIE (Codex + code prompt) achieves the best results, improving T5-large and T5-base by 132% and 327%, respectively. In addition, under 1-shot learning settings, CODEIE improves the performance of UIE-large by more than 60% on CoNLL03 and CoNLL04 benchmarks (see Table 6 in the Appendix).
Code Prompt vs. Text Prompt We then compare the performance of code prompt vs. text prompt when using the same LLM, i.e., comparing GPT-3 + text prompt with GPT-3 + code prompt and comparing Codex + text prompt] with Codex + code prompt . As a result, we find that prompting LLMs with code yields significant improvement (23% for GPT-3 and 16% for Codex). What is surprising is that code prompt is even more beneficial to GPT-3, which is not specif-  Table 4: Performance of different prompt designs. "struct lang" and "func def" are the "text" and "code" prompt types respectively in our main experiments.
ically trained on code data.
Code-LLMs vs. NL-LLMs When using the same kind of prompt and comparing the used LLMs, i.e., comparing GPT-3 + text prompt and Codex + text prompt and comparing GPT-3 + code prompt and Codex + code prompt , we find that Codex consistently surpasses GPT-3, demonstrating that code pre-training can be beneficial to IE tasks.

Different Shot Numbers
We further compare these approaches under different shot numbers on CoNLL03 and CoNLL04. As shown in Figure 3, we can see that the obtained phenomenons still hold when increasing the number of shots.

Different Prompt Designs
The design of the prompt can be an important factor affecting the model performance (Min et al., 2022). Hence, we explore additional prompt designs for both text prompt and code prompt. The detailed prompt deigns can be found in Appendix A. The experimental results are shown in Table 4, from which we find that code prompts consistently outperform text prompts. Hence, the superior performance of using code prompts is mainly contributed by the code style instead of some specific instance of prompt design.

Different LLMs
To verify the versatility of the proposed approach and the observed findings, we further conduct experiments with text-davinci-001 version of GPT-3 and code-davinci-001 version of Codex. As shown in Table 7, the previous findings still hold across the two different versions.

Analysis
To take a closer look at the difference between prompting NL-LLMs with textual format input and prompting Code-LLMs with code format input, in this section, we define several informative metrics and conduct in-depth analyses to shed some light on the following question: what contributes to the final performance of CODEIE for IE tasks?

Format Consistency
We can see from Figure 1(a) that an apparent inappropriateness to use NL-LLMs for IE tasks is the inconsistency between the structured output format at inference time and NL-LLMs that are trained on natural language at pre-training time, while the format of code-style output aligns well with Code-LLMs. It has been evidenced that adapting pre-trained models to downstream tasks in a manner that is well aligned with it pre-training paradigm usually achieves better few-shot learning performance. Hence we assume the promising performance of CODEIE partly comes from the better format consistency between the code-style sample and the pretrained code model.
To verify this hypothesis, given a sample, we compare the perplexities of a pre-trained language model on its text format and a pre-trained code model on its code format. Formally, given a generative model M , the conditional perplexity ppl of a sample (x, y) is as follows, For an original IE sample (x, y), we first transform it to its natural language text pair (x t , y t ) and its code piece pair (x c , y c ), and then compute the conditional perplexity of them with the language model M nl and the code model M c , respectively, i.e., the ppl M nl (y t |x t ) and ppl M c (y c |x c ). A lower conditional perplexity means the output format aligns well with the pre-training distribution of the model.
Since LLMs usually limit user access by their black-box APIs, we instead utilize two agent models T5 (Raffel et al., 2020) and CodeT5 (Wang et al., 2021b) to calculate the perplexities. CodeT5 is a variant of T5 model that is further pre-trained on code data. We calculate the perplexities on the previous seven datasets with the base verison of the two models, namely T5-base and CodeT5-base. Figure 4 shows the mean perplexities of two base version models on the training samples of each task. We can observe the perplexity of the text format outputs measured by T5-base is usually larger than code format outputs measured by CodeT5-base. That means, transforming IE samples to code formats can better align with the data distribution of code pre-training and therefore the pre-trained code language model.

Model Fidelity
Besides the low format consistency of prompting ML-LLMs, we find that NL-LLMs are more likely to generate outputs with structural and semantic errors when performing few-shot IE tasks than Code-LLMs. In other words, Code-LLMs seem to be more faithful to the demonstrated few-shot samples than NL-LLMs. To quantitatively measure the model fidelity, we define two metrics: Structure Fidelity Structure fidelity measures how faithful the model is to the structure of demonstrations provided in the context. This can be simply measured by calculating the structural error rate, which is the proportion of generated samples with structural errors. In particular, we construct a  parser with a series of hand-written rules to transform the model-generated outputs back to the desired format and filter out samples with invalid structures. Figure 5 demonstrates the structure fidelity of different models with different prompts on the seven benchmarks. Results show that the outputs generated by GPT-3 and Codex using text prompts are fragile while using code prompts tends to generate nearly zero structurally erroneous samples. Besides, with the same text prompt, Codex tends to generate fewer structurally errant samples than GPT-3, demonstrating its superior understanding ability on general structured input instead of being limited to existing programming languages.
Semantic Fidelity Another measurement of model fidelity is semantic fidelity, which is designed for those samples that have a valid structure and can succeed in our parser but are semantically incorrect. The difference between the defined semantic fidelity and the conventional prediction error is that semantic fidelity mainly considers model behaviours that violate the formulation of the task, e.g., predicting an entity type that does not exist in the given entity type set or extracting an entity span that does not appear in the input text. Some example semantic errors detected in our experiments are listed in Table 5. We report the statistical result of the tasks in Table 8 and Table 9 in the Appendix. As a result, we find that GPT-3 generated more semantic errors than Codex though some of the errors seem to be "correct" but are out of the pre-defined class set. In a nutshell, GPT-3 tends to generate free-form results and Codex is more faithful to the demonstrations provided in the context and therefore is more predictable for IE tasks.

Fine-grained Performance
In addition, we conduct a fine-grained evaluation to compare different approaches. In addition to the F1 score, precision and recall are also important metrics for NER and RE tasks. To investigate how different LLMs and prompting methods affect precision and recall, we report the two metrics in Figure 6. Results show that: (a) The code prompt improves model performance in both precision and recall; (b) Compared with GPT-3, Codex achieves higher recall and comparable precision on NER tasks and and achieves both higher precision and recall on RE tasks.

Related Work
Generative Information Extraction Generative information extraction which frames IE tasks as token generation tasks receive more attention recently due to their potential to unify different tasks Josifoski et al., 2022).  designs various ways to linearize entities into a sentence to unify various named entity recognition subtasks. TANL (Paolini et al., 2021) uses augmented language to improve the effect of generative models. Lu et al. (2022) also proposes a structured extraction language (SEL) and pre-trains their UIE model with this language on multiple structured datasets. These works linearize the structure output of IE tasks into text format to align the pre-trained models. Different from them, we propose to recast the structural samples of IE tasks into structural code format and utilize aligned pre-trained code models to perform the tasks.

Code-LLMs for Complex Tasks Recent works
show Code-LLMs perform better on complex tasks like commonsense and symbolic reasoning (Madaan et al., 2022;Cheng et al., 2022), mathematical logic (Suzgun et al., 2022) and event argument prediction (Wang et al., 2022) tasks. We focus on the two mainstream IE tasks different from them, i.e., NER and RE. Besides, in-depth analyses are conducted to provide more insights.  (Gutiérrez et al., 2022) tests the performance of GPT-3 on biomedical NER and RE tasks and finds it underperforms compared to fine-tuning smaller pretrained models. Its concurrent work (Agrawal et al., 2022) finds that GPT-3 performs well on few-shot clinical IE tasks. We conduct our experiments on more general NER and RE datasets and find GPT-3 can achieve comparable performance to fine-tuning the UIE model. Besides, we successfully employ the LLMs of code with better performances for these IE tasks.

Conclusion
We propose the first work to utilize the structured Code-LLMs with code-style prompts to perform the few-shot NER and RE tasks. Experiments show our approach consistently surpasses the UIE models and the NL-LLMs counterpart under the fewshot setting. We conducted extensive analysis and find the performances come from better format consistency and model fidelity, etc. We think these analyzes can facilitate future work. As the further works, we will employ CODEIE on more IE tasks in different domains, and inspect the robustness of it.

Limitations
Though our approach demonstrates better performances than the baseline models, how to design a good code-format prompt has not been fully inspected. Besides, we mainly conduct experiments on the black-box GPT-3 and Codex models but they are not open-sourced and querying the GPT-3 model cost the economic budget. And the use of LLMs may bring environmental pollution. Another limitation of our approach is that the Code-LLMs mainly trained on programming language datasets with English annotations. Exploring our model on non-English datasets (like Chinese datasets) is the future work.