Few-NERD: A Few-shot Named Entity Recognition Dataset

Recently, considerable literature has grown up around the theme of few-shot named entity recognition (NER), but little published benchmark data specifically focused on the practical and challenging task. Current approaches collect existing supervised NER datasets and re-organize them to the few-shot setting for empirical study. These strategies conventionally aim to recognize coarse-grained entity types with few examples, while in practice, most unseen entity types are fine-grained. In this paper, we present Few-NERD, a large-scale human-annotated few-shot NER dataset with a hierarchy of 8 coarse-grained and 66 fine-grained entity types. Few-NERD consists of 188,238 sentences from Wikipedia, 4,601,160 words are included and each is annotated as context or a part of the two-level entity type. To the best of our knowledge, this is the first few-shot NER dataset and the largest human-crafted NER dataset. We construct benchmark tasks with different emphases to comprehensively assess the generalization capability of models. Extensive empirical results and analysis show that Few-NERD is challenging and the problem requires further research. The Few-NERD dataset and the baselines will be publicly available to facilitate the research on this problem.


Introduction
Named entity recognition (NER), as a fundamental task in information extraction, aims to locate and classify named entities from unstructured natural language.A considerable number of approaches equipped with deep neural networks have shown promising performance (Chiu and Nichols, 2016) on fully supervised NER.Notably, pre-trained language models (e.g., BERT (Devlin et al., 2019a)) with an additional classifier achieve significant success on this task and gradually become the base paradigm.Such studies demonstrate that deep models could yield remarkable results accompanied by a large amount of annotated corpora.
With the emerging of knowledge from various domains, named entities, especially ones that need professional knowledge to understand, are difficult to be manually annotated on a large scale.Under this circumstance, studying NER systems that could learn unseen entity types with few examples, i.e., few-shot NER, plays a critical role in this area.There is a growing body of literature that recognizes the importance of few-shot NER and contributes to the task (Hofer et al., 2018;Fritzler et al., 2019;Yang and Katiyar, 2020;Li et al., 2020a;Huang et al., 2020).Unfortunately, there is still no dataset specifically designed for few-shot NER.Hence, these methods collect previously proposed supervised NER datasets and reorganize them into a few-shot setting.Common options of datasets include OntoNotes (Weischedel et al., 2013), CoNLL'03 (Tjong Kim Sang, 2002), WNUT'17 (Derczynski et al., 2017), etc.These research efforts of few-shot learning for named entities mainly face two challenges: First, most datasets used for few-shot learning have only 4-18 coarse-grained entity types, making it hard to construct an adequate variety of "N-way" metatasks and learn correlation features.And in reality, we observe that most unseen entities are finegrained.Second, because of the lack of benchmark datasets, the settings of different works are inconsistent (Huang et al., 2020;Yang and Katiyar, 2020), leading to unclear comparisons.To sum up, these methods make promising contributions to few-shot NER, nevertheless, a specific dataset is urgently needed to provide a unified benchmark dataset for rigorous comparisons.
To alleviate the above challenges, we present a large-scale human-annotated few-shot NER dataset, FEW-NERD, which consists of 188.2k sentences extracted from the Wikipedia articles and 491.7k entities are manually annotated by well-trained annotators (Section 4.3).To the best of our knowledge, FEW-NERD is the first dataset specially constructed for few-shot NER and also one of the largest human-annotated NER dataset (statistics in Section 5.1).We carefully design an annotation schema of 8 coarse-grained entity types and 66 fine-grained entity types by conducting several pre-annotation rounds.(Section 4.1).In contrast, as the most widely-used NER datasets, CoNLL has 4 entity types, WNUT'17 has 6 entity types and OntoNotes has 18 entity types (7 of them are value types).The variety of entity types makes FEW-NERD contain rich contextual features with a finer granularity for better evaluation of fewshot NER.The distribution of the entity types in FEW-NERD is shown in Figure 1, more details are reported in Section 5.1.We conduct an analysis of the mutual similarities among all the entity types of FEW-NERD to study knowledge transfer (Section 5.2).The results show that our dataset can provide sufficient correlation information between different entity types for few-shot learning.
For benchmark settings, we design three tasks on the basis of FEW-NERD, including a standard supervised task (FEW-NERD (SUP)) and two few-shot tasks (FEW-NERD-INTRA) and FEW-NRTD (INTER)), for more details see Section 6. FEW-NERD (SUP), FEW-NERD (INTRA), and FEW-NERD (INTER) assess instance-level generalization, type-level generalization and knowledge transfer of NER methods, respectively.We implement models based on the recent state-of-theart approaches and evaluate them on FEW-NERD (Section 7).And empirical results show that FEW-NERD is challenging on all these three settings.We also conduct sets of subsidiary experiments to analyze promising directions of few-shot NER.Hopefully, the research of few-shot NER could be further facilitated by FEW-NERD.

Related Work
As a pivotal task of information extraction, NER is essential for a wide range of technologies (Cui et al., 2017;Li et al., 2019b;Ding et al., 2019;Shen et al., 2020).And a considerable number of NER datasets have been proposed over the years.For example, CoNLL'03 (Tjong Kim Sang, 2002) is regarded as one of the most popular datasets, which is curated from Reuters News and includes 4 coarsegrained entity types.Subsequently, a series of NER datasets from various domains are proposed (Balasuriya et al., 2009;Ritter et al., 2011;Weischedel et al., 2013;Stubbs and Uzuner, 2015;Derczynski et al., 2017).These datasets formulate a sequence labeling task and most of them contain 4-18 entity types.Among them, due to the high quality and size, OntoNotes 5.0 (Weischedel et al., 2013) is considered as one of the most widely used NER datasets recently.
As approaches equipped with deep neural networks have shown satisfactory performance on NER with sufficient supervision (Lample et al., 2016;Ma and Hovy, 2016), few-shot NER has received increasing attention (Hofer et al., 2018;Fritzler et al., 2019;Yang and Katiyar, 2020;Li et al., 2020a).Few-shot NER is a considerably challenging and practical problem that could facilitate the understanding of textual knowledge for neural model (Huang et al., 2020).Due to the lack of specific benchmarks of few-shot NER, current methods collect existing NER datasets and use different few-shot settings.To provide a benchmark that could comprehensively assess the generalization of models under few examples, we annotate FEW-NERD.To make the dataset practical and close to reality, we adopt a fine-grained schema of entity annotation, which is inspired and modified from previous fine-grained entity recognition studies (Ling and Weld, 2012;Gillick et al., 2014;Choi et al., 2018;Ringland et al., 2019).
3 Problem Formulation 3.1 Named Entity Recognition NER is normally formulated as a sequence labeling problem.Specifically, for an input sequence of tokens x = {x 1 , x 2 , ..., x t }, NER aims to assign each token x i a label y i ∈ Y to indicate either the token is a part of a named entity (such as Person, Organization, Location) or not belong to any entities (denoted as O class), Y being a set of pre-defined entity-types.

Few-shot Named Entity Recognition
N -way K-shot learning is conducted by iteratively constructing episodes.For each episode in training, N classes (N -way) and K examples (K-shot) for each class are sampled to build a support set S train = {x (i) , y (i) } N * K i=1 , and K examples for each of N classes are sampled to construct a query set Q train = {x (j) , y (j) } N * K j=1 , and S Q = ∅.Few-shot learning systems are trained by predicting labels of query set Q train with the information of support set S train .The supervision of S train and Q train are available in training.In the testing procedure, all the classes are unseen in the training phase, and by using few labeled examples of support set S test , few-shot learning systems need to make predictions of the unlabeled query set Q test (S Q = ∅).However, in the sequence labeling problem like NER, a sentence may contain multiple entities from different classes.And it is imperative to sample examples in sentence-level since contextual information is crucial for sequence labeling problems, especially for NER.Thus the sampling is more difficult than conventional classification tasks like relation extraction (Han et al., 2018).Some previous works (Yang and Katiyar, 2020;Li et al., 2020a) use greedy-based sampling strategies to iteratively judge if a sentence could be added into the support set, but the limitation becomes gradually strict during the sampling.For example, when it comes to a 5-way 5-shot setting, if the support set already had 4 classes with 5 examples and 1 class with 4 examples, the next sampled sentence must only contain the specific one entity to strictly meet the requirement of 5 way 5 shot.It is not suitable for FEW-NERD since it is annotated with dense entities.Thus, as shown in Algorithm 1 we adopt a N -way K∼2K-shot setting in our paper, the primary principle of which is to ensure that each class in S contain K∼2K examples, effectively alleviating the limitations of sampling.
Algorithm 1: Greedy N -way K∼2K-shot sampling algorithm Input: Dataset X , Label set Y, N , K Output: output result 1 S ← ∅; // Init the support set // Init the count of entity types 4 Collection of FEW-NERD

Schema of Entity Types
The primary goal of FEW-NERD is to construct a fine-grained dataset that could specifically be used in the few-shot NER scenario.Hence, schemas of traditional NER datasets such as CoNLL'03, OntoNotes that only contain 4-18 coarse-grained types could not meet the requirements.The schema of FEW-NERD is inspired by FIGER (Ling and Weld, 2012), which contains 112 entity tags with good coverage.On this basis, we make some modifications according to the practical situation.It is worth noting that FEW-NERD focuses on named entities, omitting value/numerical/time/date entity types (Weischedel et al., 2013;Ringland et al., 2019) like Cardinal, Day, Percent, etc. First, we modify the FIGER schema into a two-level hierarchy to incorporate simple domain information (Gillick et al., 2014).The coarse-grained types are {Person, Location, Organization, Art, Building, Product, Event, Miscellaneous }.Then we statistically count the frequency of entity types in the automatically annotated FIGER.By removing entity types with low frequency, there are 80 finegrained types remaining.Finally, to ensure the practicality of the annotation process, we conduct rounds of pre-annotation and make further modifications to the schema.For example, we combine the types of Country, Province/State, City, Restrict into a class GPE, since it is difficult to distinguish these types only based on context (especially GPEs at different times).For another example, we create a Person-Scholar type, because in the pre-annotation step, we found that there are numerous person entities that express the semantics of research, such as mathematician, physicist, chemist, biologist, paleontologist, but the Figer schema does not define this kind of entity type.We also conduct rounds of manual denoising to select types with truly high frequency.
Consequently, the finalized schema of FEW-NERD includes 8 coarse-grained types and 66 fine-grained types, which is detailedly shown accompanied by selected examples in Appendix.

Paragraph Selection
The raw corpus we use is the entire Wikipedia dump in English, which has been widely used in constructions of NLP datasets (Han et al., 2018;Yang et al., 2018;Wang et al., 2020).Wikipedia contains a large variety of entities and rich contextual information for each entity.FEW-NERD is annotated in paragraph-level, and it is crucial to effectively select paragraphs with sufficient entity information.Moreover, the category distribution of the data is expected to be balanced since the data is applied in a fewshot scenario.It is also a key difference between FEW-NERD and previous NER datasets, whose entity distributions are usually considerably uneven.In order to do so, we construct a dictionary for each fine-grained type by automatically collecting entity mentions annotated in FIGER, then the dictionaries are manually denoised.We develop a search engine to retrieve paragraphs including entity mentions of the distant dictionary.For each entity, we choose 10 paragraphs and construct a candidate set.Then, for each fine-grained class, we randomly select 1000 paragraphs for manual annotation.Eventually, 66,000 paragraphs are selected, consisting of 66 fine-grained entity types, and each paragraph contains an average of 61.3 tokens.

Human Annotation
As named entities are expected to be contextdependent, annotation of named entities is complicated, especially with such a large number of entity types.For example, shown in Table 1, "London is the fifth album by the British rock band Jesus Jones..", where London should be annotated as an entity of Art-Music rather than Location-GPE.Such a situation requires that the annotator has basic linguistic training and can make reasonable judgments based on the context.
Annotators of FEW-NERD include 70 annotators and 10 experienced experts.All the annotators have linguistic knowledge and are instructed with detailed and formal annotation principles.Each paragraph is independently annotated by two welltrained annotators.Then, an experienced expert goes over the paragraph for possible wrong or omissive annotations, and make the final decision.With 70 annotators participated, each annotator spends an average of 32 hours during the annotation process.We ensure that all the annotators are fairly compensated by market price according to their workload (the number of examples per hour).The data is annotated and submitted in batches, and each batch contains 1000∼3000 sentences.To ensure the quality of FEW-NERD, for each batch of data, we randomly select 10% sentences and conduct double-checking.If the accuracy of the annotation is lower than 95 % (measured in sentencelevel), the batch will be re-annotated.Furthermore, we calculate the Cohen's Kappa (Cohen, 1960) to measure the aggreements between two annotators, the result is 76.44%, which indicates a high degree of consistency.
5 Data Analysis 5.1 Size and Distribution of FEW-NERD FEW-NERD is not only the first few-shot dataset for NER, but it also is one of the biggest humanannotated NER datasets.We report the the statistics of the number of sentences, tokens, entity types and entities of FEW-NERD and several widely-used NER datasets in Table 2, including CoNLL'03, WikiGold, OntoNotes 5.0, WNUT'17 and I2B2.We observe that although OntoNotes and I2B2 are considered as large-scale datasets, FEW-NERD is significantly larger than all these datasets.Moreover, FEW-NERD contains more entity types and annotated entities.As introduced in Section 4.2, FEW-NERD is designed for few-shot learning and the distribution could not be severely uneven.Hence, we balance the dataset by selecting paragraphs through a distant dictionary.The data distribution is illustrated in Figure 1, where Location (especially GPE) and Person are entity types with the most examples.Although utilizing a distant dictionary to balance the entity types could not produce a fully balanced data distribution, it still ensures that each fine-grained type has a sufficient number of examples for few-shot learning.

Knowledge Correlations among Types
Knowledge transfer is crucial for few-shot learning (Li et al., 2019a).To explore the knowledge correlations among all the entity types of FEW-NERD, we conduct an empirical study about entity type similarities in this section.We train a BERT-Tagger (details in Section 7.1) of 70% arbitrarily selected data on FEW-NERD and use 10% data to select the model with best performance (it is actually the setting of FEW-NERD (SUP) in Section 6.1).After obtaining a contextualized encoder, we produce entity mention representations of the remaining 20% data of FEW-NERD.Then, for each fine-grained types, we randomly select 100 instances of entity embeddings.We mutually compute the dot product among entity embeddings for each type two by two and average them to obtain the similarities among types, which is illustrated in Figure 2. We observe that entity types shared identical coarse-grained types typically have larger similarities, resulting in an easier knowledge transfer.In contrast, although some of the fine-grained types have large similari-  ties, most of them across coarse-grained types share little correlations due to distinct contextual features.This result is consistent with intuition.Moreover, it inspires our benchmark-setting from the perspective of knowledge transfer (see Section 6.2).

Benchmark Settings
We collect and manually annotate 188,238 sentences with 66 fine-grained entity types in total, which makes FEW-NERD one of the largest human-annotated NER datasets.To comprehensively exploit such rich information of entities and contexts, as well as evaluate the generalization of models from different perspectives, we construct three tasks based on FEW-NERD (Statistics are reported in Table 3).(Balasuriya et al., 2009) 1.7k 39k 3.6k 4 General OntoNotes (Weischedel et al., 2013) 103.8k 2067k 161.8k 18 General WNUT'17 (Derczynski et al., 2017) 4.7k 86.1k 3.1k 6 SocialMedia I2B2 (Stubbs and Uzuner, 2015) 107 this setting is to explore if the coarse information will affect the prediction of new entities.

Models
Recent studies show that pre-trained language models with deep transformers (e.g., BERT (Devlin et al., 2019a)) have become a strong encoder for NER (Li et al., 2020b).We thus follow the empirical settings and use BERT as the backbone encoder in our experiments.We denote the parameters as θ and the encoder as f θ .Given a sequence x = {x 1 , ..., x n }, for each token x i , the encoder produces contextualized representations as: (1) Specifically, we implement four BERT-based models for supervised and few-shot NER, which are BERT-Tagger (Devlin et al., 2019b), Proto-BERT (Snell et al., 2017), NNShot (Yang and Katiyar, 2020) and StructShot (Yang and Katiyar, 2020).
BERT-Tagger As stated in Section 6.1, we construct a standard supervised task based on FEW-NERD, thus we implement a simple but strong baseline BERT-Tagger for supervised NER.BERT-Tagger is built by adding a linear classifier on top of BERT and trained with a cross-entropy objective under a full supervision setting.
The first baseline model we implement is Proto-BERT, which is a method based on prototypical network (Snell et al., 2017) with a backbone of BERT (Devlin et al., 2019a) encoder.This approach derives a prototype z for each entity type by computing the average of the embeddings of the tokens that share the same entity type.The computation is conducted in support set S. For the i-th type, the prototype is denoted as z i and the support set is S i , (2) While in the query set Q, for each token x ∈ Q, we firstly compute the distance between x and all the prototypes.We use the l-2 distance as the met- Then, through the distances between x and all other prototypes, we compute the prediction probability of x over all types.In the training step, parameters are updated in each meta-task.In the testing step, the prediction is the label of the nearest prototype to x.That is, for a support set S Y with types of Y and a query x, the prediction process is given as (3)

NNShot & StructShot
NNShot and Struct-Shot (Yang and Katiyar, 2020) are the state-of-theart methods based on token-level nearest neighbor classification.In our experiments, we use BERT as the backbone encoder to produce contextualized representations for fair comparison.Different from the prototype-based method, NNShot determines the tag of one query based on the token-level distance, which is computed as Hence, for a support set S Y with type of Y and a query x, With the identical basic structure as NNShot, StructShot adopts an additional Viterbi decoder during the inference phase (Hou et al., 2020)  p(y|x) and solve the problem: To sum up, BERT-Tagger is a wellacknowledged baseline that could produce pronounced results on supervised NER.Proto-BERT, and NNShot & StructShot respectively use prototype-level and token-level similarity scores to tackle the few-shot NER problem.These baselines are strong and representative models of the NER task.For implementation details, please refer to Appendix.
We evaluate models by considering query sets Q test of test episodes.We calculate the precision (P), recall (R) and micro F1-score over all test episodes.Instead of the popular BIO schema, we utilize the IO schema in our experiments, using I-type to denote all the tokens of a named entity and O to denote other tokens.

The Overall Results
We evaluate all baseline models on the three benchmark settings introduced in Section 6, including FEW-NERD (SUP), FEW-NERD (INTRA) and FEW-NERD (INTER).Supervised NER As mentioned in Section 6.1, we first split the FEW-NERD as a standard supervised NER dataset.As shown in Table 4, BERT-Tagger yields promising results on the two widely used supervised datasets.The F1-score is 91.34%, 89.11%, respectively.However, the model suffers a grave drop in the performance on FEW-NERD (SUP) because the number of types of FEW-NERD (SUP) is larger than others.The results indicate that FEW-NERD is challenging in the supervised setting and worth studying.
We further analyze the performance of different entity types (see Figure 3).We find that the model achieves the best performance on the Person type and yields the worst performance on the Product type.And almost for all the coarse-grained types, the Coarse-Other type has the lowest F1-score.the calculation of prototypes essentially serves as regularization.

Error Analysis
We conduct error analysis to explore the challenges of FEW-NERD, the results are reported in Table 7.We choose the setting of FEW-NERD (INTER) because the test set contains all the coarse-grained types.We analyze the errors of models from two perspectives.Span Error denotes the misclassifying in token-level classification.If an O token is misclassified as a part of entity, i.e., I-type, it is an FP case, and if a token with the type I-type is misclassified to O, it is FN.Type Error indicates the misclassification of entity types when the spans are correctly classified.A "Within" error represents the entity is misclassified to another type within the same coarse-grained type, while "Outer" denotes the entity is misclassified to another type in a different coarse-grained type.As the statistics of type errors may be impacted by the sampled episodes in testing, we conduct 5 rounds of experiments and report the average results.The results demonstrate that the token-level accuracy is not that low since most O tokens could be detected.But an entity mention is considered to be wrong if one token is wrong, which becomes the main reason for the challenge of FEW-NERD.If an entity span could be accurately detected, the models could yield relatively good performance on entity typing, indicating the effectiveness of metric learning.

Conclusion and Future Work
We propose FEW-NERD, a large-scale few-shot NER dataset with fine-grained entity types.This is the first few-shot NER dataset and also one of the largest human-annotated NER dataset.FEW-NERD provides three unified benchmarks to assess approaches of few-shot NER and could facilitate future research in this area.By implementing state-of-the-art methods, we carry out a series of experiments on FEW-NERD, demonstrating that few-shot NER remains a challenging problem and worth exploring.In the future, we will extend FEW-NERD by adding cross-domain annotations, distant annotations, and finer-grained entity types.FEW-NERD also has the potential to advance the construction of continual knowledge graphs.

A.1 Processing
We use the dump2 of English Wikipedia, and extract the raw text by WikiExtractor3 .NLTK language tool4 is used for word and sentence tokenization in the preprocessing stage.As stated in Section 4.2, we develope a search engine to index and select paragraphs with key words in distant dictionaries.If the search is performed with linear operations, the calculation process will be extremely slow, instead, we adopt a search engine with Lucene5 to conduct effective indexing and searching.

A.2 More Details of the Schema
As stated in Section 4.1, we use FIGER (Ling and Weld, 2012) as the start point and conduct rounds of make a series of modifications.Despite the modifications mentioned in Section 4.1, we also conduct manual denoising of the automatically annotated data of FIER.For each entity type and the corresponding automatically annotated mentions, we randomly select 500 mentions and compute the accuracy to obtain the real frequency.For example, statistics report that cemetery is a type with high frequency.However, a plenty number of the mentions labeled as cemetery are actually GPE.Similarly, engineer is also affected by noise.

A.3 Interface
The interface in shown in Figure 4, where annotators could expediently select entity spans and annotate the corresponding coarse and fine types.And annotators could check the current annotation information on the interface.

B Implementation Details
All the four models use BERT base (Devlin et al., 2019a) and the backbone encoder and initialized with the corresponding pre-trained uncased weights6 .The hidden size is 768, and the number of layers and heads are 12.Models are implemented by Pytorch framework7 (Paszke et al., 2019) and Huggingface transformers8 (Wolf et al., 2020).BERT models are optimized by AdamW9 (Loshchilov and Hutter, 2019) with the learning rate of 1e-4.We evaluate our implementations of NNShot and StructShot on the datasets used in the original paper, producing similar results.For supervised NER, the batch size is 8, and we train BERT-Tagger for 70000 steps and evaluate it on the test set.For 5 way 1∼2 and 5∼10 shot settings, the batch sizes are 16 and 4, and for 10 way 1∼2 and 5∼10 shot settings, the batch sizes are 8 and 1.We train 12000 episodes and use 500 episodes of the dev set to select the best model, and test it on 5000 episodes of the test set.Most hyper-parameters are from original settings.We manually tune the hyper-parameter τ in Viterbi for StructShot, and the value for 1∼2 settings shot is 0.320, for 5∼10 shot settings is 0.434.All the experiments are conducted with CUDA on NVIDIA Tesla V100 GPUs.With 2 GPUs used, the average time to train 10000 episodes is 135 minutes.The number of parameters of the models is 120M.

C Entity Types
As introduced in Section 4.1 in main text, FEW-NERD is manually annotated with 8 coarsegrained and 66 fine-grained entity types, and we list all the types in Table 8.The schema is designed under practical situation, we hope the schema could help to better understand FEW-NERD.Note that ORG is the abbreviation of Organization, and MISC is the abbreviation of Miscellaneous.

Figure 1 :
Figure 1: An overview of FEW-NERD.The inner circle represents the coarse-grained entity types and the outer circle represents the fine-grained entity types, some types are denoted by abbreviations.

Figure 2 :
Figure 2: A heat map to illustrate knowledge correlations among type in FEW-NERD, each small colored square represents the similarity of two entity types.

Figure 3 :
Figure 3: F1-scores of different entity types on FEW-NERD (SUP), we report the average performance of each coarse-grained entity type on the legends.

Figure 4 :
Figure 4: Screeshot of the interface used to annotate FEW-NERD.

Table 1 :
ParagraphLondon[Art-Music]is the fifth album by the British [Loc-GPE] rock band Jesus Jones[Org-ShowOrg]in 2001 through Koch Records[Org-Company]. Following the commercial failure of 1997's "Already[Art-Music]" which led to the band and EMI [Org-Company] parting ways, the band took a hiatus before regathering for the recording of "London [Art-Music] " for Koch/Mi5 Recordings, with a more alternative rock approach as opposed to the techno sounds on their previous albums.The album had low-key promotion, initially only being released in the United States[Loc-GPE].Two EP's were released from the album, "Nowhere Slow [Art-Music] " and "In the Face Of All This[Art-Music]".An annotated case of FEW-NERD

Table 3 :
Statistics of train, dev and test sets for three tasks of FEW-NERD.We remove the sentences with no entities for the few-shot benchmarks.
train , E dev , E test , and E train E dev E test = E, E train E dev E test = ∅.Note that all the entity types are fine-grained types.Under this circumstance, instances in train, dev and test datasets only consist of instances with entities in E train , E dev , E test respectively.However, NER is a sequence labeling problem, and it is possible that a sentence contains several different entities.To avoid the observation of new entity types in the training phase, we replace the labels of entities that belong to E test with O in the training set.Similarly, in the test set, entities that belongs to E train and E dev are also replaced by O. Based on this setting, we develop two few-shot NER tasks adopting different splitting strategies.FEW-NERD (INTRA) Firstly, we construct E train , E dev and E test according to the coarse-grained types.In other words, all the entities in different sets belong to different coarse-grained types.train , E dev , the coarse-grained types are shared.Specifically, we roughly assign 60% fine-grained types of all the 8 coarse-grained types to E train , 20% to E dev and 20% E test , respectively.The intuition of

Table 4 :
(not in training phase), where we estimate a transition distribution p(y |y) and an emission distribution Results of BERT-Tagger on previous NER datasets and the supervised setting of FEW-NERD.

Table 6 :
Performance of state-of-art models on FEW-NERD (INTER).It is also observed that NNShot and StructShot may suffer from the instability of the nearest neighbor mechanism in the training phase, and prototypical models are more stable because

Table 7 :
Error analysis of 5 way 5∼10 shot on FEW-NERD (INTER), "Within" indicates "within the coarse types" and "Outer" is "outer the coarse types".