Learning In-context Learning for Named Entity Recognition

Named entity recognition in real-world applications suffers from the diversity of entity types, the emergence of new entity types, and the lack of high-quality annotations. To address the above problems, this paper proposes an in-context learning-based NER approach, which can effectively inject in-context NER ability into PLMs and recognize entities of novel types on-the-fly using only a few demonstrative instances. Specifically, we model PLMs as a meta-function Lambda_instruction, demonstrations, text.M, and a new entity extractor can be implicitly constructed by applying new instruction and demonstrations to PLMs, i.e., (Lambda . M) (instruction, demonstrations) ->F where F will be a new entity extractor F: text -> entities. To inject the above in-context NER ability into PLMs, we propose a meta-function pre-training algorithm, which pre-trains PLMs by comparing the (instruction, demonstration)-initialized extractor with a surrogate golden extractor. Experimental results on 4 few-shot NER datasets show that our method can effectively inject in-context NER ability into PLMs and significantly outperforms the PLMs+fine-tuning counterparts.


Introduction
Named entity recognition (NER) aims to detect and classify named entities in text, such as People, Disease, and Movie. Traditional NER methods (Lample et al., 2016;Li et al., 2020c;Yan et al., 2021) have achieved remarkable success Figure 1: Illustration of in-context NER, which uses instruction, demonstrations, and text as input to identify entities. The in-context learning model can be regarded as a meta-function that takes instruction and demonstrations as input and produces an entity extractor capable of identifying the desired entities (Akyürek et al., 2022). when entity types are pre-defined and massive highquality annotations are provided. Unfortunately, real-world NER still suffers from the diversity of entity types (e.g., the extraction of Movie is very different to Disease), the emergence of new entity types (e.g., Virus of , and the lack of high-quality annotations. To address these problems, recent studies often employ few-shot learning techniques, including fine-tuning-based and metric-based methods. Finetuning-based methods extract entities of new types by adjusting model weights using new instances (Ma et al., 2022a;Chen et al., 2022a;Das et al., 2022). The main drawbacks of these methods are that re-training is often expensive (especially for large-scale models) and new entity types cannot be addressed on-the-fly. Metric-based methods are free from updating parameters and identifying entities by learning to compare query instances with support instances (or prototypes) (Yang and Katiyar, 2020;Tong et al., 2021). These methods are limited to the matching architectures and are sensitive to domain shift since they do not fully explore the information of target domain (Ma et al., 2022c).
In this paper, we propose an in-context learning-based NER approach, which can effectively address the above problems by injecting in-context NER ability into PLMs and then recognizing entities of new types on-the-fly using only a few demonstrative instances. Specifically, we model PLMs as a meta-function (Akyürek et al., 2022) for NER λ instruction, demonstrations, text .M, and a new entity extractor can be implicitly constructed by applying new instruction and demonstrations to PLMs, i.e., (λ.M)(instructions, demonstrations) → F where F will be a new entity extractor F: text → entities. For example, in Figure 1, our method can construct entity extractors of new Disease and Virus types on-the-fly by applying PLMs using demonstrations such as "Text: Cancer is a leading cause of death worldwide. Entities: Cancer is disease". Furthermore, we propose a meta-function pre-training algorithm to inject the above in-context NER ability into PLMs. The algorithm pre-trains PLMs by comparing the implicitly (instruction, demonstration)constructed extractor with an explicitly fine-tuned surrogate golden extractor. The comparison ensures that the meta-function (λ.M) will generate an entity extractor F from instructions and demonstrations as accurately as possible.
The proposed method can seamlessly leverage the powerful language understanding and generation capabilities of large-scale PLMs (Brown et al., 2020), effectively address diverse and new entity types through in-context learning, and only requires a couple of demonstrations for each entity type. Compared to fine-tuning methods, our method does not require expensive retraining, and new entity types can be extracted on-the-fly, with no need for model weight adjusting. Compared with metricbased methods, our method can dynamically utilize the information entailed in instruction and demonstrations rather than be limited to the fixed metric space.
To verify the effectiveness of our method, we further pre-train PLMs using a large-scale distantly annotated NER dataset from Wikipedia and Wikidata. Experimental results on 4 few-shot NER benchmarks show that our method can effectively inject in-context NER ability into PLMs and significantly outperforms the PLMs+fine-tuning counterparts 2 .
In general, this paper's main contributions are: • We propose an in-context NER method that can effectively extract entities of novel types • We design a meta-function pre-training algorithm, which models PLMs as a meta-function and injects in-context NER ability into PLMs by comparing the (instruction, demonstration)constructed extractor with a surrogate golden extractor.
• How to inject in-context ability into small models is an important research direction of NLP in the big model era. Our work can benefit new directions for future works.

Related work
Few-shot NER Few-shot learning is a promising technique for low-resource NER. Currently, there are two main categories of FS-NER methods: finetuning-based methods and metric-based methods. Fine-tuning-based FS-NER methods re-train NER models using new instances. Metric-based methods identify entities by pre-training to compare query instances with support instances (Snell et al., 2017;Fritzler et al., 2019;Yang and Katiyar, 2020;Tong et al., 2021;Ji et al., 2022) using given NER datasets. FS-NER is a challenging task, and several improvements have been proposed to enhance its performance. These include leveraging label information (Hou et al., 2020;Wang et al., 2021a;Lu et al., 2022b;Ma et al., 2022a;Chen et al., 2022a;, designing new paradigms such as decomposition methods (Ji et al., 2022;Ma et al., 2022c;, prompt-based methods (Cui et al., 2021;Ma et al., 2022b), and demonstration-based methods (Lee et al., 2022;Zhang et al., 2022a); , and proposing new learning strategies like metalearning (Li et al., 2020a,b;de Lichy et al., 2021;Ma et al., 2022c), contrastive learning (Das et al., 2022), and self-training (Huang et al., 2021;Wang et al., 2021b). In this paper, we address FS-NER via in-context learning (Gutiérrez et al., 2022), which empowers PLMs with in-context learning ability and entities of new entity types can be extracted on-the-fly.
In-context learning The in-context learning ability has been observed in large-scale PLMs such as GPT-3 (Brown et al., 2020), and has been widely applied in different tasks such as "chain of thought" reasoning . Recent studies aim to enhance in-context learning  Figure 2: The formats of input and output of in-context few-shot NER. The input is formed by instruction, demonstrations, and text.
by selecting valuable demonstrations Rubin et al., 2022), optimizing the order of demonstrations (Lu et al., 2022a), and calibrating output distributions (Zhao et al., 2021). Some studies try to replicate in-context learning in smaller models (Min et al., 2022a;Chen et al., 2022b). Additionally, some researchers attempt to replicate in-context learning using smaller models (Min et al., 2022b;Chan et al., 2022). Furthermore, there are efforts to understand the underlying mechanisms (Akyürek et al., 2022) of in-context learning which suggest that it can be compared to a meta-function and facilitate implicit fine-tuning (Dai et al., 2022;von Oswald et al., 2022). This paper is inspired by previous studies and considers in-context named entity recognition (NER) as a meta-function. To enhance the ability of pre-trained language models (PLMs) to perform in-context NER, we propose an effective pretraining algorithm. Unlike MetaICL (Min et al., 2022a), which only transforms multi-task learning into the form of in-context learning for pretraining, our approach also includes meta-function pre-training (Section 4.3) based on the underlying mechanisms of in-context learning.

In-context Named Entity Recognition
This section describes how to recognize entities through in-context NER. In in-context learning, the model will read the information of target entity types from both instruction and demonstrations, and then extract entities of target types within the text. In this way, new entity types can be extracted on-the-fly, without the need for model retraining.
Concretely, this paper formulates in-context NER as a sequence-to-sequence generation pro-cess. The input X = [I; D; T ] includes instruction I, demonstrations D, and text T while the output is a list of extracted entities Y = [e 1 , ..., e n ]. Figure 2 shows an example, where an in-context NER model will identify that the target entity types are Disease and Virus, distill the knowledge about these types from demonstrations(e.g., the context patterns of a disease), and finally recognize "SARS-CoV-2" as virus and "COVID-19" as disease using the above knowledge. The details are described as follows.
Instruction The instruction is a sequence of target entity types, guiding the model to extract what entity types (Min et al., 2022a). The instruction for target entity types {l 1 , . . . , l n } is I ="Target types: l 1 ; . . . ; l n ". For example, in Figure 2 the instruction is "Target types: disease; virus".
Demonstrations Demonstrations provide the intra-class knowledge of target entity types (e.g., entity semantics and context patterns) and illustrate the form of outputs. As shown in Figure 2, the demonstrations contain the illustrative instances for different target types, and each instance is "Text: {text} Entities: {extractions}", where {extractions} are entities presented in the {text}.
Extractions The output of the extraction process is a list of entities, denoted as Y = [e 1 , . . . , e n ] where e i is i-th extracted entities. Each extraction e is represented as "ENTITY is type". For instance, in Figure 2, the extraction "COVID-19 is disease." indicates that "COVID-19" is an entity mention with the type "Disease". This natural languagelike representation allows us to better utilize the text generation capabilities of pre-trained language models. During inference, we locate all mentions in the text and further output their locations. Architecture Given the above task formulation, we employ an encoder-decoder network like T5 (Raffel et al., 2020), where the encoder encodes <instruction, demonstrations, text> and the decoder generates all extractions as a tokenized text sequence Y = [y 1 , . . . , y n ].
The success of in-context NER depends on two critical abilities: the in-context learning ability and the extraction ability. For in-context learning, the models should be able to implicitly construct accurate extractors of new entity types by following the instruction and capturing the knowledge in demonstrations. In this way, we can see a PLM as a meta-function, i.e., a function of extractors whose input is (instruction, demonstrations) and whose output is an entity extractor. For extraction, the models should be able to locate specific spans and categorize them into target entity types. The following section demonstrates how to inject such an in-context learning ability into PLMs and construct an effective in-context NER model.

Meta-Function Pre-training for
In-Context NER In this section, we will explain how to incorporate in-context named entity recognition (NER) capabilities into pre-trained language models (PLMs). Although large-scale PLMs like GPT-3 have demonstrated the ability to learn in-context, this capability is not always controllable or predictable. Additionally, unlike classification and question-answering tasks that align with the pre-training objective of language models (i.e., producing natural text output), NER requires more complex span extraction and type specification. As a result, Gutiérrez et al. (2022) show that LMs aren't well-suited for incontext NER tasks. In this paper, we propose metafunction pre-training, an algorithm that can inject in-context NER ability into PLMs in a controllable and predictable way. Specifically, we model PLMs as a metafunction (Akyürek et al., 2022) for NER λ instruction, demonstrations, text .M, and a new entity extractor can be implicitly constructed by applying new instructions and demonstrations to PLMs, i.e., (λ.M)(instructions, demonstractions) → F where F will be a new entity extractor F:text → entities. Based on the meta-function formulation, we further pre-train PLMs for in-context NER abilities by: • optimizing PLMs via a meta-function loss, so that the implicitly (instruction, demonstration)-constructed extractor F will be as close as an explicitly fine-tuned surrogate golden extractor; • optimizing PLMs via an extraction loss, so that the in-context NER can effectively locate and categorize entities in a text. The details are described in the following.

Pre-training Settings
Pre-training Corpus Construction To continually pre-train PLMs for in-context NER, we first collect an in-context pre-training NER corpus D in-context = {x 1 , x 2 , ..., x n }, where each x is an in-context NER task represented as a tuple = (in-struction, demonstrations, text, entities).
Specifically, to sample in-context NER task x, we use traditional NER corpus D NER where each NER instance is a (text, entities) pair as follows: 1. In-context Task Sampling: To construct an incontext NER task x = (instruction, demonstrations, text, entities): (1) we first sample N target entity types from D NER to form instruction and sample K instances for each type to form demonstrations; (2) then we sample the text and the entities of x by either randomly sample an instance from N target entity types, or randomly sample from instances of other entity types, i.e., their extractions are NIL. We sample NIL instances because in real-world applications many instances will not contain target entities, and NIL instances are sampled with a predefined proportion γ. 2. Type Anonymization: To ensure the models rely on in-context demonstrations for entity knowledge and avoid overfitting to entity type names, we anonymize entity types by randomly substituting them with a set of type indicators {<type1>, . . ., <type99>}, rather than directly using the original type names such as Disease and Virus. We found this anonymization strategy can significantly improve the in-context learning ability of PLMs. Specifically, we randomly substitute each entity type name with pre-defined 99 type indicators {<type1>, . . ., <type99>}, and the substitute probability for each name is 80%.
Pre-training Loss Based on the in-context pretraining corpus D in-context , we pre-train our incontext NER model by optimizing the loss: where L meta-function is the meta-function loss which ensures PLMs can implicitly generate accurate entity extractors (Section 4.2), L extraction is the extraction loss which ensures PLMs have good extraction ability (Section 4.3), α is the coefficient of metafunction loss.

Meta-function Pre-training
As mentioned above, a good in-context NER model should be able to implicitly construct an accurate entity extractor by partially applying PLMs with instruction I and demonstrations D:  Figure 3: Overview of our meta-function pre-training.
Our goal is to ensure that the extractor F(instruction,demonstrations) closely resembles the golden extraction function. To obtain the golden extraction function, we use a surrogate strategy and the surrogate extraction function is the fine-tuned encoder using demonstrations.
For example, given the instruction and demonstrations in Figure 2, we want PLMs to implicitly build an accurate extractor for Disease and Virus. Therefore if we know the golden extraction function F * for target entity types, we can optimize PLMs for in-context NER ability by minimizing the distance ||F * − F||.
Unfortunately, the golden extraction function F * is unknown. In this paper, we approximate F * using a surrogate extractor which is the finetuned counterpart using demonstrations D. That is, for each in-context pre-training task x, we first recover all NER (text, entities) instances from x as x , then we fine-tune the model and use the fine-tuned encoder F as the surrogate of F * . The overall meta-function pre-training is shown in Figure 3.
Formally, given instruction I, demonstration D, and text T , we first feed them into the encoder and obtain the feature of I and T , l1, ..., ln, d1, ..., dm, t1, ..., t k = Encoder(I; D; T ) (3) Then we obtain the feature of the implicitly generated function F using the features of instruction I and text T , and ignore the features of D: F = [l 1 , ..., l n , t 1 , ..., t k ]. In Figure 3, the feature F can be seen as the output of Disease and Virus extractor F.
To obtain the feature of the fine-tuned counterpart, we perform a one-step gradient descent 3 on the encoder using the instances in the demonstration D and get the surrogate encoder, which can be seen as an approximation of golden F * . Note that this fine-tuning operation is performed after the model has been copied, so there is no impact on the parameters of the original model. In the example in Figure 3, Encoder is a Disease and Virus extractor. After performing one-step updating, we feed instruction and text [I; T ] into the surrogate encoder to get their features: where F = {l 1 , . . . , l n , t 1 , . . . , t k } is features of instruction I and text T . In the example in Figure 3, the feature F can be seen as the estimated output of golden extractor F * for Virus and Disease entity types. Then, we pre-train our in-context NER model to be a good meta-function by making the output of F and F * consistent, i.e., minimizing the distance between F and F . The meta-function loss is: where d(·) is euclidean distance. Note that when calculating the gradient of L meta-function , F is seen as constant. To this end, the meta-function gradient can be estimated as: where θ encoder is the parameters of the encoder and X = [I; D; T ] is the input. The estimated gradient will be used to update the parameters of the encoder.
In this way, the in-context NER models will be trained to be a good meta-function (Akyürek et al., 2022), which can also be seen as an ability for implicit fine-tuning (Dai et al., 2022;von Oswald et al., 2022).

Extraction Function Pre-training
Besides the in-context learning ability, we also pretrain PLMs to be good extractors via extraction loss. Given instruction I, demonstrations D, and text T , the sequence-to-sequence entity extractor directly models the generation probability token by token in an auto-regressive way. Formally, we optimize the model parameters θ by minimizing the negative likelihood of in-context instances: And the extraction gradient is computed as: To learn the above extraction ability, we design two extraction pre-training tasks, including an entity extraction task and a pseudo extraction language modeling task: Entity Extraction Task. This task is used to train the ability to extract entities from text, we use both in-context NER settings whose input is (instruction, demonstrations, text) and traditional NER settings whose input is (instruction, text), and output is entities. Note that type anonymization is only conducted in in-context NER setting.

Pseudo Extraction Language Modeling Task .
Because there is a mismatch between the entity extraction task and the original language modeling task, and the size of the NER corpus is usually far smaller than the text corpus for language modeling pre-training, we design a pseudo extraction LM task to bridge the above gap. Specifically, we randomly sample unlabeled sentences from the text corpus and automatically build pseudo extraction (instruction, demonstrations, text, pseudo entities) tasks. For instance, given a demonstration sentence such as "I think this movie is cool and I really like it very much" and a text "I do not like it.": (1) To begin with, we choose some spans from demonstrations (such as "this movie" and "like") and designate them as pseudo entities 4 . We assign random types to these entities from type indicators. For instance, we consider "this movie" as a pseudo entity of type <type2> and "like" as a pseudo entity of type <type14>. (2) The input of the pseudo extraction task is instruction="Target types:<type2>; where the entities ("this movie" and "like") and other random spans ("very much") in demonstrations are masked. The text="Text: I do not like it." which is not masked. (3) The output of the pseudo extraction task is "like is <type14>" since the model will learn from demonstrations that <type14> corresponds to "like". (4) We also conduct traditional NER settings whose input is (instruction, text). The entities in the text will be masked as in demonstrations, e.g. "Target types: this movie; like Text: I [MASK1] not [MASK2] it.". The output will be "Entities: [MASK2] is like.".
We can see that the pseudo extraction LM task can benefit in-context NER in two ways. Firstly, it can significantly increase the size and diversity of in-context NER pre-training tasks from a largescale unlabeled corpus. Secondly, this task pretrains PLMs with a mixture of extraction target and span prediction task, therefore avoiding PLMs overfit to only extraction task.
When pre-training, We transformed the NER and language model tasks into a uniform format and sampled input instances alternately.

Experiments
This section evaluates our method by conducting experiments on few-shot NER settings.

Experimental Settings
Pre-training settings. Following Chen et al. (2022a), we build a large-scale distant NER dataset by aligning Wikipedia and Wikidata. Specifically, our dataset was made from Wikipedia text with hyperlinks to Wikidata, where we labeled entity types using the linked Wikidata item's attributes. Entity types were gathered from Wikidata's Subclas-sOf and InstanceOf attributes for each span. We filtered ambiguous and low-frequency types (occurrences <100k) to obtain higher-quality demonstrations. Finally, we retained 2046 types and 55 million (text, entities) pairs and use a 40/15 million split for training/validation. We sample 5 million in-context tasks for training and 10k for validation, where each instance with type number N is 10 and instance number K is 10. We employ the T5-v1.1-large (Raffel et al., 2020) model as the initial model for MetaNER and further pre-train 500k steps with learning rate=5e-5 and warm-up Models #Param CoNLL03 WNUT17 NCBI-disease SEC-filings AVE 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot steps=10k. In this paper, we refer to the pre-trained model as MetaNER.

Pre-trained Language Models
Few-shot settings. Our experiments follow the standard k-shot NER setting Huang et al. (2021): For each entity type, we sample k training instances as in-context demonstrations. We evaluate models by micro-F1 and report the average performance by repeating each experiment 10 times.

Main Results
The experimental results are shown in Table 1. We can see that: 1) Few-shot NER is challenging even for large language models, while MetaNER can achieve good in-context NER performance. Compare with best-performed PLMs, MetaNER achieves 8.4% F1 improvements. Moreover, due to the gap between language model task and NER task, large language models achieve poor in-context learning performance on some datasets.
2) Our in-context NER method can achieve robust performance, even under a large sourcetarget domain gap.
Compared with bestperformed metric-based NER models, MetaNERbase and MetaNER achieves 26.8% and 40.7% F1 improvement. Specifically, the performance improvement is more significant when source-target domain gap is larger, i.e., the NCBI-disease (biology domain) and SEC-filings (finance domain).
3) Meta-function pre-training can effectively inject in-context learning ability into both small and large PLMs. Both MetaNER-base and MetaNER achieve impressive performance in 1shot and 5-shot settings, which verified that MetaNER can effectively inject in-context NER ability into small PLMs, although currently incontext learning has been seen an ability only emerged only on large language models such as GPT-3.  To analyze and understand the effect of type anonymization, meta-function pre-training, entity extraction pre-training, and pseudo extraction LM pre-training, we conduct the following ablation experiments: (1) MetaNER w/o MF: remove the meta-function pre-training; (2) MetaNER w/o LM: remove pseudo extraction LM pre-training; (3) MetaNER w/o anonymization: we use the original entity type names in both pre-training and incontext NER, without using type anonymization. The results are shown in Table 2, we can see that: 1) meta-function pre-training is critical for in-context learning ability. By removing the meta-function pre-training, the results drop significantly when the domain gaps are larger, i.e., NCBI-disease. At the same time, meta-function pre-training is helpful for the model to make more precise predictions.

Ablation Studies
2) The pseudo extraction LM task significantly benefits in-context NER. We found MetaNER w/o LM results in a performance drop than MetaNER. We believe this is because, although using an automatically constructed pseudo dataset, this task can significantly improve the size and the diversity of in-context NER tasks, meanwhile can retain a good language modeling ability.
3) Type name anonymization prevents incontext NER model from type name overfitting, and therefore enhances the in-context learning ability. The ablation of type name anonymization results a 5.7% performance drop in Table 2. We believe this is because type names will let models tend to memorize entity knowledge using type names, thus the model will not learn to capture entity knowledge from demonstrations on-the-fly.

Effects of Meta-function Pre-training
One main idea of this paper is that in-context NER model can be viewed as a meta-function which can implicitly build new entity extractors. To demonstrate whether meta-function pre-training can train a good meta-function, we sample 1000 instances from each dataset, and show the difference between the (instruction, demonstrations)-initialized entity extractor F and the surrogate entity extractor F , i.e., ||F − F|| in Section 4.2 in Figure 4. We can see that meta-function pre-training can equip PLMs with a good meta-function ability, i.e., the (instruction, demonstrations)-initialized entity extractor after pre-training is significantly close to its fine-tuned counterpart.
All the models are implemented by us except SDNet.

In-context Learning vs Fine-tuning
MetaNER can also be directly fine-tuned using traditional NER instances. We employed the identical fine-tuning approach as previous works (Huang et al., 2021;Lu et al., 2022b;Chen et al., 2022a). Following Lu et al. (2022b), we also implemented the Rejection Mechanism when fine-tuning the T5-v11-large and MetaNER to achieve better few-shot performance.
To compare in-context NER with fined-tuned NER, Table 3 reports the performance of the finetuned counterpart of MetaNER -MetaNER-FT(its training is similar to surrogate entity extractor but with multi-step gradient descent until coverage), together with several fine-tuned few-shot NER baselines. We can see that: 1) MetaNER is an effective architecture, which achieves good performance on both in-context learning and fine-tuning settings; 2) Currently, fine-tuning can achieve better performance than their in-context learning counterpart. We believe this is because fine-tuned models' parameters need to be specialized to specific entity types, meanwhile in-context learning needs to generalize to different types on-the-fly, i.e., generalization-specialization trade-off. We believe this also verified the reasonableness of using a fine-tuned surrogate extractor to approximate the golden extractor.

Conclusion
In this paper, we propose an in-context learningbased NER approach and model PLMs as a metafunction, which can inject in-context NER ability into PLMs and recognize entities of new types onthe-fly using only a few demonstrative instances. Experimental results show that our method is effective for in-context NER. For future work, we will extend our method to different NLP tasks like event extraction and relation extraction.

Limitations
In-context learning is an useful ability, this paper only focuses on in-context named entity recognition, leaves the learning of other NLP tasks' incontext learning abilities for future work.
Currently, we learn in-context learning via metafunction pre-training, by comparing an in-context extraction function and a fined-tuned surrogate extraction function at the representation level of their encoders. There are two approximation here: one is fined-tuned surrogate extraction function for approximating golden extraction function, and the difference between representations for approximating the divergence between functions. We think the above two approximations can be further improved for better and faster in-context learning.