Few-Shot Named Entity Recognition: An Empirical Baseline Study

This paper presents an empirical study to efficiently build named entity recognition (NER) systems when a small amount of in-domain labeled data is available. Based upon recent Transformer-based self-supervised pre-trained language models (PLMs), we investigate three orthogonal schemes to improve model generalization ability in few-shot settings: (1) meta-learning to construct prototypes for different entity types, (2) task-specific supervised pre-training on noisy web data to extract entity-related representations and (3) self-training to leverage unlabeled in-domain data. On 10 public NER datasets, we perform extensive empirical comparisons over the proposed schemes and their combinations with various proportions of labeled data, our experiments show that (i)in the few-shot learning setting, the proposed NER schemes significantly improve or outperform the commonly used baseline, a PLM-based linear classifier fine-tuned using domain labels. (ii) We create new state-of-the-art results on both few-shot and training-free settings compared with existing methods.


Introduction
Named Entity Recognition (NER) involves processing unstructured text, locating and classifying named entities (certain occurrences of words or expressions) into particular categories of pre-defined entity types, such as persons, organizations, locations, medical codes, dates and quantities. NER serves as an important first component for tasks such as information extraction (Ritter et al., 2012), information retrieval (Guo et al., 2009), question answering (Mollá et al., 2006), task-oriented dialogues (Peng et al., 2020a;Gao et al., 2019) and other language understanding applications (Nadeau and Sekine, 2007;Shaalan, 2014 Figure 1: An overview of methods studied in our paper. Linear classifier fine-tuning is a default baseline that updates an NER model from pre-trained Roberta/BERT. We study three orthogonal strategies to improve NER models in the limited labeled data settings. has shown remarkable success in NER in recent years, especially with self-supervised pre-trained language models (PLMs) such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019c). State-of-the-art (SoTA) NER models are often initialized with PLM weights, fine-tuned with standard supervised learning. One classic approach is to add a linear classifier on top of the representations provided by PLMs, and fine-tune the entire model using a cross-entropy objective on domainspecific labels (Devlin et al., 2019). Despite its simplicity, the approach generally results in good performance on benchmarks and serves as a strong baseline in this study. Unfortunately, even with these PLMs, building NER systems still remains a labor-intensive, timeconsuming task. It requires rich domain knowledge and expert experience to annotate a large corpus of in-domain labeled tokens to teach the models to achieve a reasonable accuracy. However, this is in contrast to the real-world application scenarios, where only very small amounts of labeled data are available for new domains, such as medical (Hofer et al., 2018) domain. The cost of building NER systems at scale with rich annotations (i.e., hundreds of different enterprise use-cases/domains) can be prohibitively expensive. This draws attentions to a challenging but practical research problem: fewshot NER.
To deal with the challenge of few-shot learning, we focus on improving the generalization ability of PLMs for NER from three complementary directions, shown in Figure 1. Instead of limiting ourselves in making use of limited in-domain labeled tokens with the classic approach, (i) we create prototypes as the representations for different entity types, and assign labels via the nearest neighbor criterion; (ii) we continuously pre-train PLMs using web data with noisy labels that is available in much larger quantities to improve NER accuracy and robustness; (iii) we tag the in-domain unlabeled data with soft labels via self-training (Xie et al., 2020), and perform semi-supervised learning in conjunction with the limited labeled data. Our contributions include: (i) We present the first systematic study for few-shot NER, a problem that is little explored in the literature. Three distinctive schemes and their combinations are investigated. (ii) We perform comprehensive comparisons of these schemes on 10 public NER datasets from different domains. (iii) Compared with existing methods on few-shot and training-free NER settings, the proposed schemes achieve SoTA performance despite their simplicity. To shed light on future research on few-shot NER, our study suggests that: (i) Noisy supervised pre-training can significantly improve NER accuracy, and we will release our pre-trained checkpoints. (ii) Selftraining consistently improves few-shot learning when the ratio of data amounts between unlabeled and labeled data is high. (iii) The performance of prototype learning varies on different datasets. It is useful when the number of labeled examples is small, or when new entity types are given in the training-free settings.
2 Background on Few-shot NER Few-shot NER is a sequence labeling task, where the input is a text sequence (e.g., sentence) of length T , X = [x 1 , x 2 , ..., x T ], and the output is a corresponding length-T labeling sequence Y = [y 1 , y 2 , ..., y T ], where y ∈ Y is a one-hot vector indicating the entity type of each token from a pre-defined discrete label space. The training dataset for NER often consists of pair-wise data where N is the number of training examples. Traditional NER systems are trained in the standard supervised learning paradigms, which usually requires a large number of pairwise examples, i.e., N is large. In realworld applications, the more favorable scenarios are that only a small number of labeled examples are given for each entity type (N is small), because expanding labeled data increases annotation cost and decreases customer engagement. This yields a challenging task few-shot NER.
Linear Classifier Fine-tuning. Following the recent self-supervised PLMs (Devlin et al., 2019;Liu et al., 2019c), a typical method for NER is to utilize a Transformer-based backbone network to extract the contextualized representation of each token z = f θ 0 (x) . A linear classifier (i.e., a linear layer with parameter θ 1 = {W, b} followed by a Softmax layer) is applied to project the representation z into the label space f θ 1 (z) = Softmax(Wz +b). In another word, the end-to-end learning objective for linear classifier based NER can be obtained via a function composition y = f θ 1 • f θ 0 (x), with trainable parameters θ = {θ 0 , θ 1 }. The pipeline is shown in Figure 2(a). The model is optimized by minimizing the cross-entropy: where the KL divergence between two distributions is KL(p||q) = E p log(p/q), and the prediction probability vector for each token is In practice, θ 1 = {W, b} is always updated, while θ 0 can be either frozen (Liu et al., 2019a,b;Lu, 2019) or updated (Devlin et al., 2019;Yang and Katiyar, 2020).

Methods
When only a small number of labeled tokens are available, it renders difficulties for the classical supervised fine-tuning approach: the model tends to over-fit the training examples and shows poor generalization performance on the testing set (Fritzler et al., 2019). In this paper, we provide a comprehensive study specifically for limited NER data settings, and explore three orthogonal directions shown in Figure 1: (i) How to adapt metalearning such as prototype-based methods for fewshot NER? (ii) How to leverage freely-available web data as noisy supervised pre-training data? (iii) How to leverage unlabeled in-domain sentences in a semi-supervised manner? Note that these three directions are complementary to each other and can be used jointly to further extrapolate the methodology space in Figure 1.  A prototype set is constructed via averaging features of all tokens belonging to a given entity type in the support set (e.g., the prototype for Person is an average of three tokens: Mr., Bush and Jobs). For a token in the query set, its distances from different prototypes are computed, and the model is trained to maximize the likelihood to assign the query token to its target prototype. (c) The Wikipedia dataset is employed for supervised pre-training, whose entity types are related but different (e.g., Musician and Artist are more fine-grained types of Person in the downstream task). The associated types on each token can be noisy. (d) Self-training: An NER system (teacher model) trained on a small labeled dataset is used to predict soft labels for sentences in a large unlabeled dataset. The joint of the predicted dataset and original dataset is used to train a student model.

Prototype-based Methods
Meta-learning (Ravi and Larochelle, 2017) has shown promising results for few-shot image classification (Tian et al., 2020) and sentence classification Geng et al., 2019). It is natural to adapt this idea to few-shot NER. The core idea is to use episodic classification paradigm to simulate few-shot settings during model training. Specifically in each episode, M entity types (usually M < |Y|) are randomly sampled from D L , containing a support set (K sentences per type) and a query set We build our method based on prototypical network (Snell et al., 2017), which introduces the notion of prototypes, representing entity types as vectors in the same representation space of individual tokens. To construct the prototype for the m-th entity type c m , the average of representations is computed for all tokens belonging to this type in the support set S: where S m is the token set of the m-th type in S, and f θ 0 is defined in (2). For an input token x ∈ Q from the query set, its prediction distribution is computed by a softmax function of the distance between x and all the entity prototypes. For example, the prediction probability for the m-th prototype is: where I m is the one-hot vector with 1 for m-th coordinate and 0 elsewhere, and d( We provide a simple example to illustrate the prototype method in Figure 2(b). At each training iteration, a new episode is sampled, and the model parameter θ 0 is updated via plugging (4) into (1). In the evaluation phase, the label of a new token x is assigned using the nearest neighbor criterion arg min m d(f θ 0 (x), c m ).

Noisy Supervised Pre-training
Generic representations via self-supervised pre-trained language models (Devlin et al., Liu et al., 2019c) have benefited a wide range of NLP applications. These models are pre-trained with the task of randomly masked token prediction on massive corpora, and are agnostic to the downstream tasks. In other words, PLMs treat each token equally, which is not aligned with the goal of NER: identifying named entities as emphasized tokens and assigning labels to them. For example, for a sentence " Mr. Bush asked Congress to raise to $ 6 billion ", PLMs treat to and Congress equally, while NER aims to highlight entities like Congress and downplay their collocated non-entity words like to.
This intuition inspires us to endow the backbone network with an ability to upweight the representations of entities for NER. Hence, we propose to employ the large-scale noisy web data WiFiNE (Ghaddar and Langlais, 2018) for noisy supervised pretraining (NSP). The authors automatically annotated the 2013 English Wikipedia dump by querying anchored strings as well as the coreference mentions in each wiki page to the Freebase. The WiFiNE dataset is of 6.8GB and contains 113 entity types along with over 50 million sentences. Though introducing inevitable noises (e.g., a random subset of 1000 mentions are manually evaluated and the accuracy of automatic annotations reaches 77% as reported in the paper, due to the error of identifying coreferences), this automatic annotation procedure is highly scalable and affordable. The label set of WiFiNE covers a wide range of fine-grained entity types, which are often related but different from entity types in the downstream datasets. For example in Figure 2(c), the entity types Musician and Artist in Wikipedia are more fine-grained than Person in a typical NER dataset. The proposed NSP learns representations to distinguish entities from others. This particularly favors the few-shot settings, preventing over-fitting via the prior knowledge of extracting entities from various contexts in pre-training.
Two pre-training objectives are considered in NSP, respectively: the first one is to use the linear classifier in (2), the other is a prototype-based objective in (4). For the linear classifier, we found that the batch size of 1024 and learning rate of 1e −4 works best, and for the prototype-based approach, we use the episodic training paradigm with M = 5 and set learning rate to be 5e −5 . For both objectives, we train the whole corpus for 1 epoch and apply the Adam Optimizer (Kingma and Ba, 2015) with a linearly decaying schedule with warmup at 0.1. We empirically compare both objectives in experiments, and found that the linear classifier in (2) improves pre-training more significantly.

Self-training
Though manually labeling entities is expensive, it is easy to collect large amounts of unlabeled data in the target domain. Hence, it becomes desired to improve the model performance by effectively leveraging unlabeled data D U with limited labeled data D L . We resort to the recent self-training scheme (Xie et al., 2020) for semi-supervised learning. The algorithm operates as follows: 1. Learn teacher model θ tea via cross-entropy using (1) with labeled tokens D L .
2. Generate soft labels using a teacher model on unlabeled tokens: 3. Learn a student model θ stu via cross-entropy using (1) on labeled and unlabeled tokens: where λ U is the weighting hyper-parameter.
A visual illustration for self-training procedure shown in Figure 2(d). It is optional to iterate from Step 1 to Step 3 multiple times, by initializing θ tea in Step 1 with newly learned θ stu in Step 3. We only perform self-training once for simplicity, which has already shown excellent performance.

Settings
Methods. Throughout our experiments, the pretrained base RoBERTa model is employed as the backbone network. We investigate the following 6 schemes for the comparative study: (i) LC is the linear classifier fine-tuning method in Section 2, i.e., adding a linear classifier on the backbone, and directly fine-tuning on entire model on the target dataset; (ii) P indicates the prototype-based method in Section 3.1; (iii) NSP refers to the noisy supervised pre-training in Section 3.2; Depending on the pre-training objective, we have LC+NSP and P+NSP. (iv) ST is the self-training approach in Section 3.3, it is combined with linear classifier fine-tuning, denoted as LC+ST; (v) LC+NSP+ST. We evaluate our methods on 10 public benchmark datasets, covering a wide range of domains.  The statistics of these datasets are summarized in Table 1, and detailed descriptions are provided in Appendix. For each dataset, we conduct three sets of experiments using various proportions of the training data: 5-shot, 10% and 100%. More experimental settings such as the hyper-parameters and evaluation details are shown in Appendix.

Comprehensive Comparison Results
To gain thorough insights and benchmark fewshot NER, we first perform an extensive comparative study on 6 methods across 10 datasets.
The results are shown in Table 2. We can draw the following major conclusions: (i) By comparing columns 1 and 2 (or comparing 3 and 4 ), it clearly shows that noisy supervised pretraining provides better results in most datasets, especially in the 5-shot setting, which demonstrates that NSP endows the model an ability to extract better NER-related features. (ii) The comparison between columns 1 and 3 provides a headto-head comparison between linear classifier and prototype-based methods: while the prototypebased method demonstrates better performance than LC on CoNLL, WikiGold, WNUT17 and Multiwoz in the 5-shot learning setting, it falls behind LC on other datasets and in average statistics. It shows that the prototype-based method only yields better results when there is very limited labeled data: the size of both entity types and examples are small. (iii) When comparing columns 5 with 1 (or comparing columns 6 and 2 ), we observe that using self-training consistently works better than directly fine-tuning with labeled data only, suggesting that ST is a useful technique to leverage in-domain unlabeled data if allowed.
(iv) Column 6 shows the highest F1-score in most cases, demonstrating the three proposed schemes in this paper are complementary to each other, and can be combined to yield best results in practice.
In Figure 3, we show the learning curve of average F1-score on 5-shot CONLL-2003 testing dataset over 10 repeated experiments. The checkpoint via NSP provides a better initialization than Roberta, as NSP exhibits improvement over their counterpart methods at the beginning of learning, and eventually leads to higher F1-score.

Comparison with SoTA Methods
Competitive methods. The current SoTA on few-shot NER includes: (i) StructShot (Yang and Katiyar, 2020), which extends the nearest neighbor classification with a decoding process using abstract tag transition distribution. Both the model and the transition distribution are trained from the source dataset OntoNotes. (ii) L-TapNet+CDT (Hou et al., 2020) is a slot tagging method which constructs an embedding projection space using label name semantics to well separate different classes. It also includes a collapsed dependency transfer mechanism to transfer label dependency information from source domains to target domains. (iii) SimBERT is a simple baseline reported in (Yang and Katiyar, 2020;Hou et al., 2020); it utilizes a nearest neighbor classifier based on the contextualized representation output by the pre-trained BERT, without fine-tuning on few-shot examples. The results reported in the StructShot paper use IO schema instead of BIO schema, thus we report our performance on both for completeness.
For fair comparison, following (Yang and Katiyar, 2020), we also continuously pre-train our model on OntoNotes after the noisy supervised pretraining stage. For each 5-shot learning task, we repeat the experiments by re-sampling few-shot examples for 10 times. The results are reported in Table 3. We observe that our proposed methods consistently outperform the StructShot model across all three datasets, even by simply pre-training the    Table 4: F1-score on training-free settings, i.e., predicting novel entity types using nearest neighbor methods. The best results are in bold. † indicates results from (Ziyadi et al., 2020;Wiseman and Stratos, 2019). that target domain. Our prototype-based method is able to perform such immediate inference. Two recent studies on training-free NER are: (i) Neighbortagging (Wiseman and Stratos, 2019) which copies token-level labels from weighted nearest neighbors; (ii) Example-based NER (Ziyadi et al., 2020) which is the SoTA on training-free NER, identifying the starting and ending tokens of unseen entity types.
We observed that our basic prototype-based method, under the training-free setting, does not gain from more given examples. We hypothesize that this is because tokens belonging to the same entity type are not necessarily close to each other, and are often separated in the representation space. Though it is hard to find one single centroid for all tokens in the same type, we assume that there exist local clusters of tokens belonging to the same type. To resolve such issue, we follow (Deng et al., 2020) and extend our method to a version called Multi-Prototype, by creating K/5 prototypes for each type given K examples per type. (e.g., 2 prototypes per class are used for the 10-shot setting). The prediction score for a testing token belonging to a type is computed via averaging the prediction probabilities from all prototypes of the same type.
We compare with previous methods in Table 4 and observe that multi-prototype methods not only benefit from more support examples, but also surpass neighbor tagging methods and example-based NER by a large margin on two out of three datasets. For the MIT Movie dataset, one entity type can span a large chunk with multiple consecutive words in a sentence, which favors the span-based method like (Ziyadi et al., 2020) anymore come from" is annotated as entity type Quote. The proposed methods in this paper can be combined with the span-based approach to specifically tackle this problem, and we leave it as future work. Further, if slightly fine-tuning is allowed, we see that the prototype-based method achieves 0.438 with 5-shot learning in Table 2, better than 0.395 achieved by example-based NER given 500 examples.

Related Work
General NER. NER is a long standing problem in NLP. Deep learning has significantly improved the recognition accuracy. Early efforts include exploring various neural architectures (Lample et al., 2016) such as Bidrectional LSTMs (Chiu and Nichols, 2016) and adding CRFs to capture structures (Ma and Hovy, 2016). Early studies have noticed the importance of reducing the annotation labor, where semi-supervised learning is employed, such as clustering (Lin and Wu, 2009), and combining supervised objective with unsupervised word representations (Turian et al., 2010). PLMs have recently revolutionized NER, where largescale Transformer-based architectures (Peters et al., 2018;Devlin et al., 2019) are used as backbone network to extract informative representations. Contextualized string embedding (Akbik et al., 2018) is proposed to capture subword structures and polysemous words in different usage. Masked words and entities are jointly trained for prediction in (Yamada et al., 2020) with entity-aware self-attention.
These methods are designed for standard supervised learning, and have a limited generalization ability in few-shot settings, as empirically shown in (Fritzler et al., 2019).
Prototype-based methods recently become popular few-shot learning approaches in machine learning community. It was firstly studied in the context of image classification (Vinyals et al., 2016;Sung et al., 2018;Zhao et al., 2020), and has recently been adapted to different NLP tasks such as text classification Geng et al., 2019;Bansal et al., 2020), machine translation (Gu et al., 2018) and relation classification (Han et al., 2018). The work closest related to ours is (Fritzler et al., 2019) which explores prototypical network on fewshot NER, but only utilizes RNNs as the backbone model and does not leverage the power of largescale Transformer-based architectures for word representations. Our work is similar to (Ziyadi et al., 2020;Wiseman and Stratos, 2019) in that all of them utilize the nearest neighbor criterion to assign the entity type, but differs in that (Ziyadi et al., 2020;Wiseman and Stratos, 2019) consider every individual token instance for nearest neighbor comparison, while ours considers prototypes for comparison. Hence, our method is much more scalable when the number of given examples increases.
Supervised pre-training. In computer vision, it is a de facto standard to transfer ImageNetsupervised pre-trained models to small image datasets to pursue high recognition accuracy (Yosinski et al., 2014). The recent work named big transfer (Kolesnikov et al., 2019) has achieved SoTA on various vision tasks via pre-training on billions of noisily labeled web images. To gain a stronger transfer learning ability, one may combine supervised and self-supervised methods (Li et al., 2020c,b). In NLP, supervised/grounded pretraining have been recently explored for natural language generation (NLG) (Keskar et al., 2019;Zellers et al., 2019;Gao et al., 2020;Li et al., 2020a). They aim to endow GPT-2 (Radford et al.), an ability of enabling high-level semantic controlling in language generation, and are often pre-trained on massive corpus consisting of text sequences associated with prescribed codes such as text style, content description, and task-specific behavior. In contrast to NLG, to our best knowledge, large-scale supervised pre-training has been little studied for natural language under-standing (NLU). There are early studies showing promising results by transferring from mediumsized datasets to small datasets in some NLU applications. For example, from MNLI to RTE for sentence classification (Phang et al., 2018;Clark et al., 2020;, and from OntoNER to CoNLL for NER (Yang and Katiyar, 2020). Our work further increases the supervised pre-training at the scale of web data (Ghaddar and Langlais, 2018), 1000 orders of magnitude larger than (Yang and Katiyar, 2020), showing consistent improvements.
Self-training. Self-training (Scudder, 1965) is one of the earliest semi-supervised methods, and has recently achieved improved performance for tasks such as ImageNet classification (Xie et al., 2020), visual object detection (Zoph et al., 2020), neural machine translation (He et al., 2020) and sentence classification (Mukherjee and Awadallah, 2020;Du et al., 2020). It is shown via object detection tasks in (Zoph et al., 2020) that stronger data augmentation and more labeled data can diminish the value of pre-training, while self-training is always helpful in both low-data and high-data regimes. Our work presents the first study of selftraining for NER, and we observe similar phenomenons: it consistently boosts few-shot learning performance across all 10 datasets.

Conclusion and Future Work
We have presented an empirical study on several directions in few-shot NER. Three foundational methods and their combinations are systematically investigated: prototype-based methods, noisy supervised pre-training and self-training. They are intensively compared on 10 public datasets under various settings. All of them improve the PLM's generalization ability when learning from a few labeled examples, among which supervised pretraining and self-training turn out to be particularly effective. Our proposed schemes achieve SoTA on both few-shot and training-free settings compared with recent studies. We will release our benchmarks and code, in hope of inspiring future fewshot NER research with more advanced methods to tackle this challenging and practical problem.
For future work, we believe our studies can be combined with other interesting explorations in distant supervised learning, such as augmentationbased methods (Dai and Adel, 2020) and methods dealing with noisy labels (Meng et al., 2021). It would also be promising for researchers to consider larger pre-trained language models to learn better entity representations.

Ethical Considerations
The dataset WiFiNE (Ghaddar and Langlais, 2018) used in our noisy supervised pre-training stage is a public dataset. It is consistent with the terms of use of any sources and the original authors' intellectual property and privacy rights. As a modified version of Wikipedia dataset, the collection procedure ensures no ethical concerns e.g., toxic language and hate speech. The entity types in our pre-training and fine-tuning datasets are common objects observed in daily life, detailed in Appendix.  (Stubbs and Uzuner, 2015) on medical domain. The detailed statistics of these datasets are summarized in Table 1.
For each dataset, we conduct three sets of experiments using various proportions of the training data: 5-shot, 10% and 100%. For 5-shot setting, we sample 5 sentences for each entity type in the training set and repeat each experiment for 10 times. For 10% setting, we down-sample 10 percent of the training set, and for 100% setting, we use the full training set as labeled data. We only study the self-training method in 5-shot and 10% settings, by using the rest of the training set as unlabeled in-domain corpus.
Hyper-parameters. We have described details for noisy supervised pre-training in Section 3.2. For training on target datasets, we set a fixed set of hyperparameters across all the datasets: For the linear classifier, we set batch size = 16 for 100% and 10% settings, batch size = 4 for 5-shot setting. For each episode in the prototype-based method, we set the number of sentences per entity type in support and query set (K, K ) to be (5, 15) for 100% and 10% settings, and (2, 3) for 5-shot setting. For both training objectives, we set learning rate = 5e −5 for 100% and 10% settings, and learning rate = 1e −4 for 5-shot setting. For all training data sizes, we set training epoch = 10, and Adam optimizer (Kingma and Ba, 2015) is used with the same linear decaying schedule as the pre-training stage. For self-training, we set λ U = 0.5.
Evaluation. We follow the standard protocols for NER tasks to evaluate the performance on the test set (Sang and Meulder, 2003). Since RoBERTa tokenizes each word into subwords, we generate wordlevel predictions based on the first word piece of a word. Word-level predictions are then turned into entity-level predictions for evaluation when calculating the f1-score. Two tagging schemas are typically considered to encode chunks of tokens into entities: BIO schema marks the beginning token of an entity as B-X and the consecutive tokens as I-X, and other tokens are marked as O. IO schema uses I-X to mark all tokens inside an entity, thus is defective as there is no boundary tag between same type of entities. In our study, we use BIO schema by default, but also report the performance evaluated by IO schema for fair comparison with some previous studies.

B Dataset statistics
We show the entity types and their corresponding frequencies in pre-training dataset in Table 5 and downstream benchmark datasets in Table 6 and Table 7. We see that the entity types for pre-training and fine-tuning are semantically related, but different in granularity. For example, the location category in pre-training dataset contains fine-grained entity types like country, city, road, and bridge, while the Onto dataset for fine-tuning only gives a coarse-grained partition by geopolitical locations (countries, cities) and non geopolitical ones (highways, bridges). Further, for each categories of entity types, the pre-training dataset has a much higher frequency than fine-tuning dataset, allowing the model to learn heterogeneous contextual knowledge before deploying to a specific domain.