Prompt-Based Metric Learning for Few-Shot NER

Few-shot named entity recognition (NER) targets generalizing to unseen labels and/or domains with few labeled examples. Existing metric learning methods compute token-level similarities between query and support sets, but are not able to fully incorporate label semantics into modeling. To address this issue, we propose a simple method to largely improve metric learning for NER: 1) multiple prompt schemas are designed to enhance label semantics; 2) we propose a novel architecture to effectively combine multiple prompt-based representations. Empirically, our method achieves new state-of-the-art (SOTA) results under 16 of the 18 considered settings, substantially outperforming the previous SOTA by an average of 8.84% and a maximum of 34.51% in relative gains of micro F1. Our code is available at https://github.com/AChen-qaq/ProML.


Introduction
Named entity recognition (NER) is a key natural language understanding task that extracts and classifies named entities mentioned in unstructured texts into predefined categories.Few-shot NER targets generalizing to unseen categories by learning from few labeled examples.
Recent advances for few-shot NER use metric learning methods which compute the token-level similarities between the query and the given support cases.Snell et al. (2017) proposed to use prototypical networks that learn prototypical representations for target classes.Later, this method was introduced to few-shot NER tasks (Fritzler et al., 2019;Hou et al., 2020).Yang and Katiyar (2020) proposed StructShot, which uses a pretrained language model as a feature extractor and performs viterbi decoding at inference.Das et al. (2022) proposed CONTaiNER based on contrastive learning.This approach optimizes an objective that characterizes the distance of Gaussian distributed embeddings under the metric learning framework.
Despite the recent efforts, there remain a few critical challenges for few-shot NER.First of all, as mentioned above, metric learning computes tokenlevel similarities between the query and support sets.However, the architectures used for computing similarities in previous work are agnostic to the labels in the support set.This prevents the model from fully leveraging the label semantics of the support set to make correct predictions.Second, while prompts have been demonstrated to be able to reduce overfitting in few-shot learning (Schick and Schütze, 2020), due to a more complex sequence labeling nature of NER, the optimal design of prompts remains unclear for few-shot NER.
In light of the above challenges, we explore a better architecture that allows using prompts to fully leverage the label semantics.We propose a simple method of Prompt-based Metric Learning (ProML) for few-shot NER, as shown in Figure 1.Specifically, we introduce mask-reducible prompts, which is a special class of prompts that can be easily reverted to the original input by using a mask.By performing a masked weighted average over the representations obtained from multiple prompts, our method accepts multiple choices of prompts as long as they are mask-reducible.These prompts improve label efficiency by inserting semantic annotations into the text inputs.As instantiations of this framework, we design an option prefix prompt to provide the model with the candidate label options, and a label-aware prompt to associate each entity with its entity type in the input.As shown in Figure 2, a single prompt provides useful information but has some shortcoming.However, with a weighted average, multiple prompts are combined, which fully leverages label information.
In our experiments, we find that using multiple prompts with the masked weighted average is effective for few-shot NER.Empirically, our method achieves new state-of-the-art (SOTA) results under 16 of the 18 considered settings, substantially outperforming the previous SOTA by an average of 9.12% and a maximum of 34.51% in relative gains of micro F1.

Related Work
Few-Shot NER.Few-shot NER targets generalizing to unseen categories by learning from few labeled examples.Noisy supervised methods (Huang et al., 2020) perform supervised pretraining over large-scale noisy web data such as WiNER (Ghaddar and Langlais, 2017).Self training methods (Wang et al., 2021) perform semisupervised training over a large amount of unlabelled data.Alternative to these data-enhancement approaches, metric learning based methods have been widely used for few-shot NER (Fritzler et al., 2019;Yang and Katiyar, 2020;Das et al., 2022).Recently, prompt-based methods (Ma et al., 2021;Cui et al., 2021;Lee et al., 2022) are proposed for few-shot NER as well.To introduce more finegrained entity types in few-shot NER, a large-scale human-annotated dataset Few-NERD (Ding et al., 2021) was proposed.Ma et al. (2022b); Wang et al. (2022) formulate NER task as a span matching problem and decompose it to several procedures.Ma et al. (2022b) decomposed the NER task into span detection and entity typing, and they separately train two models and finetune them on the test support set, achieving SOTA results on Few-NERD (Ding et al., 2021).Different from the above related works, our approach is a general framework of using prompts for token-level metric learning problems.
Meta Learning.The idea of meta learning was first introduced in few-shot classification tasks for computer vision, attempting to learn from a few examples of unseen classes.Since then metric-based methods have been proposed, such as matching networks (Vinyals et al., 2016) and Prototypical networks (Snell et al., 2017), which basically compute similarities according to the given support set, learn prototypical representations for target classes, respectively.It has been shown that these methods also enable few-shot learning for NLP tasks such as text classification (Bao et al., 2019;Geng et al., 2019), relation classification (Han et al., 2018), named entity recognition (Fritzler et al., 2019;Yang and Katiyar, 2020;Das et al., 2022), and machine translation (Gu et al., 2018).Our ap-proach also falls into the category of metric-based meta learning and outperforms previous work on NER with an improved architecture.
Label Semantics for NER.There have been some approaches that make use of label semantics (Ma et al., 2022a;Hou et al., 2020).Hou et al. (2020) propose a CRF framework with labelenhanced representations based on the architecture of Yoon et al. (2019).However, they mainly focus on slot tagging tasks while their performance on NER tasks is poor.Ma et al. (2022a) introduce label semantics by aligning token representations with label representations.Both of them only use label semantics for learning better label representations.In contrast, our approach incorporates label semantics into the inputs so that the model is able to jointly model the label information and the original text samples.This makes the similarity scores dependent on the support set labels and is particularly crucial for metric learning.Our experiments also verify the advantages of our approach compared to previous work using labels semantics.
Prompt-Based Approaches for NER.With the emergence of prompt-based methods in NLP research, very recently, some prompt-based approaches for few-shot NER have been proposed (Cui et al., 2021;Lee et al., 2022;Ma et al., 2021).However, they use prompts to help with the label predictions based on classification heads instead of metric learning.Moreover, some of these methods require searching for templates (Cui et al., 2021), good examples (Lee et al., 2022), or labelaware pivot words (Ma et al., 2021), which makes the results highly dependent on the search quality.Different from these methods, our approach does not rely on a search process.More importantly, another key difference is that we employ prompting in the setting of metric learning.
3 Task Definition 3.1 Few-shot NER Named entity recognition (NER) is a sequence labeling task1 .Formally, for a sentence x consisting of n tokens there is a corresponding ground-truth label sequence y = [y 1 , y 2 , • • • , y n ] where each y i is an encoding of some label indicating the entity type for token x i .Then a collection of these (x, y) pairs form a   (j) }, with only few labeled samples, the model is required to perform quick adaptions.In this paper, we mainly focus on two evaluation protocols and two task formulations which will be explained as follows.

Evaluation protocols
Following Ding et al. (2021); Ma et al. (2022a), we summarize two evaluation protocols as follows.
Episode Evaluation An episode, or a task, is defined as a pair of one support set and one query set (S, Q) each consisting of sentences downsampled from the test set.For an N -way K-shot downsampling scheme, there are N labels among the support set S where each label is associated with K examples.The query set Q shares the same label set with the support set.Based on the support set, the model is required to predict labels for the query set.
To perform an episode evaluation, a collection of T episodes {(S t , Q t )} T t=1 are prepared.The evaluation results are computed within each episode and are averaged over all T episodes.
Low-resource Evaluation Different from the few-shot episode evaluation, low-resource evaluation aims to directly evaluate the model on the whole test set.For a test dataset D T with a label set C T , a support set S associated with the labels from C T is constructed by K-shot downsampling such that each label has K examples in S. Based on the support set S, the model is required to predict labels for the query set which is the rest of the test set D T .To perform a low-resource evaluation, T different runs of support set sampling are run and averaged.

Task formulation
Following Yang and Katiyar (2020), we formulate few-shot NER tasks in the following two ways.

Tag-Set Extension
To mimic the scenario that new classes of entities emerge in some domain, Yang and Katiyar (2020)   (i) } constitute various target domains.In this setting, there may exist some overlapping entity classes between the source and target domains, but due to the domain gaps, it is still considered a few-shot setting.
Note that the task formulation is independent of the evaluation protocol, and different combinations will be considered in our experiments.

Prompt Schemas
Motivated by existing prompt-based methods (Liu et al., 2021;Paolini et al., 2021) and the metric learning framework, our ProML provides label semantics by introducing prompts to metric learning models.We proposed a simple yet effective prompt class called the "mask-reducible prompts".Through this class of prompts, we can provide flexible prompts to the model which is consistent with metric learning methods that use token-level similarities as the metric.Starting with this schema, we will introduce two prompts that are used in ProML , the option-prefix prompt and the labelaware prompt.

Mask-Reducible Prompts
Suppose the raw input sequence is Let f prompt be a prompt function mapping x to the prompted result x ′ .We call this f prompt is a mask-reducible prompt function if for all x and its prompted result Intuitively, this means there is only some insertions in the prompt construction so that we can revert x ′ back to x through a simple masking operation.The corresponding prompt of f prompt is called a mask-reducible prompt.
Given a length preserving sequence-to-sequence encoder Enc(x; θ), a sequence of input tokens x, and a mask-reducible prompt function f prompt , we first construct the prompted result x ′ = f prompt (x), then pass the sequence x ′ through the encoder to get representations h ′ = Enc(x ′ ; θ).
Since Enc(•; θ) is length preserving, the length of h ′ is the same as x ′ , and we can compute h = h ′ [m == 1] to get the representation for input tokens, where m is the desired mask that could reduce x ′ to x (i.e.x ′ [m == 1] = x).
Through this process, the encoder receives the full prompts as its input while only the representations of raw input tokens are extracted.
Prompt A: Option Prefix Prompts An option prefix prompt takes the concatenation of all annotations as an option prefix to incorporate label semantics into modeling.Formally, for a given set of label options 2, where option prefix prompts reduce the label space to avoid incorrectly classify non-entities.The option prefix prompts inform the main model of which labels to predict, which can be used to learn label-dependent representations for computing the similarities.
Prompt B: Label-Aware Prompts A labelaware prompt appends the entity type to each entity occurrence in the input so that the model is aware of such information.While the aforementioned option prefix prompts incorporate global label information, the label-aware prompts introduce local information about each entity.Specifically, let f B (x, y) be the prompt function.Given a sequence of input tokens x and its ground-truth label sequence y, for each entity e that occurs in x, we obtain its corresponding label E from the sequence y, and replace e with an label-appended version "[e|E]" to construct the prompted result x ′ = f B (x, y).Both the entity e and its label E are sequences of tokens.Because the label-aware prompt can be applied when the ground-truth label is available, in our few-shot learning setting, we do not apply this prompt to the query set.An example is given in Figure 2, where label-aware prompts provide full label information in prompted inputs.More details will be explained in the following descriptions of our model architecture.
Note that it is possible to design other maskreducible prompts for NER, which will be naturally handled by our framework.In our study, we find these two prompts work well practically and use them as instantiations to demonstrate the effectiveness of our framework.

Model and Training
The overall architecture of ProML is shown in Figure 1.Compared to the contrastive learning framework utilized by CONTaiNER (Das et al., 2022), our architecture uses a transformer backbone to encode different prompted inputs separately and employs a masked weighted average to obtain token representations, which will be elaborated as follows.These modifications significantly enhance the performance of our model when compared to the baseline method.
At the meta training phase, we sample minibatches from the training set D train , where each mini-batch contains a few-shot episode (S train , Q train ).We obtain the label set associated with the support set S train and use a lookup dictionary to translate each label id to its natural language annotation.This leads to a set of label annotations S. Then for an input sequence x = [x 1 , x 2 , • • • , x l ] and its label sequence y = [y 1 , y 2 , • • • , y l ] from the support set S train , we collect the prompted results p A = f A (x, S), p B = f B (x, y) and the corresponding masks m A , m B .These prompted results are then passed through a pretrained language model PLM.The average of outputs from the last four hidden layers are computed as the intermediate representations We perform a masked weighted average to obtain token representations where ρ ∈ (0, 1) is a hyperparameter.The token representations for the query set are computed similarly.However, during both training and testing, we only use the option-prefix prompt for the query set since the ground-truth label sequence will not be available at test time.As a result, we do not perform a weighted average for the query set.After obtaining the token representations, two projection layers f µ , f Σ are employed to produce two Gaussian embeddings, i.e., the mean and precision parameters of a d-dimensional Gaussian distribution N (µ,Σ) for each token in the query and support sets (Das et al., 2022).
Given the Gaussian embeddings for samples in both the support and query sets, we compute the distance metrics.Similar to CONTaiNER (Das et al., 2022), for a token x i from the support set S train and a token x ′ j from the query set Q train , the distance between two tokens x i , x ′ j is defined as the Jenson-Shannon divergence (Fuglede and Topsøe, 2004) of their Gaussian embeddings, i.e., where D KL refers to the Kullback-Leibler divergence.
The similarity between x i and x ′ j is then defined as s(x i , x ′ j ) = exp(−dist(x i , x ′ j )).Let S train , Q train be collections of all tokens from sentences in S train , Q train .For each q ∈ Q train , the associated loss function is computed as ℓ(q) = − log p∈Xq s(q, p)/|X q | p∈S train s(q, p) , where X q is defined by X q = {p ∈ S train |p, q have the same labels}.The overall loss function within a mini-batch is the summation of token-level losses, L =

Nearest Neighbor Inference
At test time, we compute the intermediate representations for tokens from the support and query sets just as we did during the meta training phase.Following CONTaiNER (Das et al., 2022), we no longer use the projection layers f µ , f Σ at test time but directly perform nearest neighbor inference using the token representations h.For each query token, according to the Euclidean distance in the representation space, we compute the distance to each entity type by the distance to the nearest tokens from the support set associated with that entity type and assign the nearest entity type to the query token.For the k shot setting where k > 1, we also use the average distance of the nearest k neighbors associated with each entity type as the distance to the entity types.2020), we split OntoNotes 5.0 (Weischedel et al., 2013) into Onto-A, Onto-B, and Onto-C for the tag-set extension formulation.For the domain transfer formulation, we use OntoNotes 5.0 (Weischedel et al., 2013) as the source domain, CoNLL'03 (Sang and Meulder, 2003), WNUT'17 (Derczynski et al., 2017), I2B2'14 (Stubbs and Uzuner, 2015), and GUM (Zeldes, 2017) as target domains.We also take Few-NERD (Ding et al., 2021) as one of the tag-set extension tasks, which is a large-scale human-annotated dataset speciallly designed for few-shot NER.The datasets statistics are presented in Table 3.We adopt the IO tagging scheme, where a label "O" is assigned to non-entity tokens and an entity type label is assigned to entity tokens.We also transform the abbreviated label annotations into plain texts; e.g., [LOC] to [location].
Existing method that make use of label semantics, DualEncoder (Ma et al., 2022a) is also reproduced for comparison.Recent prompt-based methods EntLM (Ma et al., 2021) and Demon-strateNER (Lee et al., 2022) are also employed as the baselines as well.We also compare our model with the recently-introduced based meth-  ods DecomposeMetaNER (Ma et al., 2022b) and ESD (Wang et al., 2022). 2 For a fair comparison, we use bert-base-uncased (Devlin et al., 2019) as the PLM encoder and adopted the same pre-trained encoder in all the reproducible experiments of the baseline methods.
Evaluation Protocols Following Das et al. ( 2022); Yang and Katiyar (2020), we use the lowresource evaluation protocol for domain transfer tasks and for the tag-set extension tasks Onto-A, Onto-B, and Onto-C.Since Few-NERD (Ding et al., 2021) is specifically designed for episode evaluation, all of our experiments on Few-NERD dataset are evaluated under episode evaluation protocol.We follow the N -way K-shot downsampling setting proposed by Ding et al. (2021).For episode evaluation, we conduct 5 different runs of experiments, each of them contains 5000 test episodes.
2 The dataset we used is Few-NERD Arxiv V6 Version, while Ma et al. (2022b); Wang et al. (2022) reported their performances in the papers based on an earlier version (i.e.Arxiv V5 Version).We find the performances on the latest Few-NERD dataset on their official github repo at https://github.com/microsoft/vert-papers/tree/master/papers/DecomposedMetaNER.
For low-resource evaluation, 10 different runs of support set sampling is performed.

Main Results
The main results of low-resource evaluation and episode evaluation are shown in Tables 1 and 2 respectively.Training details are provided in Appendix A.1.Our method achieves new state-ofthe-art (SOTA) results under 16 out of the 18 considered settings.To compare with previous SOTA across different settings, we collect the relative improvement fractions from all settings and then compute an average and a maximum over these fractions.The result shows that ProML substantially outperforming the previous SOTA by an average of 9.12% and a maximum of 34.51% (from 28% to 37% on GUM 5-shot) in relative gains of micro F1.These outstanding results show that our method is effective for few-shot NER tasks.
The generalization difficulties are affected by both the label space and the domain gap.For example, Onto-A, B, and C datasets share the same domain but are constructed to have disjoint label space.CoNLL is a subset of the OntoNotes dataset, so its performance is much better than other domains.
Compared with the other baselines, the performances of prompt-based baselines decrease by a larger margin in the 1-shot settings since they heavily rely on finetuning on support sets.

Ablation Study and Analysis
The ablation study results for prompts choices and averaging weights on all tag-set extension tasks are shown in Table 4, 5.We adopt the episode Table 4: Ablation Study for ProML .The tuple indicates which prompts are used in the support set and query set.The variant A, A refers to using the option prefix prompt only in both the support set and query set.plain+A (ρ = 0.5), plain refers to that the original inputs and option prefix prompts are used for the support set with an averaging weight ρ = 0.5, while the query set only use origin inputs.A+B, A is our ProML method.With the help of label semantic annotations, the model is able to leverage this information to better learn the representation of each token.In addition, the model does not need to spend much capacity memorizing and inferring the underlying entity types for input tokens, which is crucial in the few-shot setting where labels are scarce.
The performance of variant "B, plain" is not good since only the support set leverages labelaware prompts so that there is a gap between the amounts of additional information from support to query.Thus there is a potential risk that the model only emphasizes these labels in support inputs while neglecting the semantics for tokens themselves, causing an overfitting problem.However, after introducing a weighted average, as shown in "plain+B, plain", the performance significantly improves.This observation suggests that the labelaware prompt is useful and the weighted average mitigates the overfitting by reducing the gaps between support and query.
As we will show in the next section, combining the two prompts always leads to the best performance because the model is able to dynamically adapt to the two representations.
Effect of Masked Weighted Average As reported before, a weighted average could reduce the gaps between computing representations for the support set and the query set and make use of the information provided by label-aware prompts.By adjusting the averaging weight ρ, we are able to balance the weights of the two representations for different data distributions.
We compared different averaging settings in 4. The option prefix only variant "A, A" performs better than "plain+A, plain" because the label option information is provided to both support and query.The performance of "plain+B, plain" and "A+B, A" improve as ρ increases, which is consistent with our motivation According to Table 4, with a properly selected averaging weight ρ, our ProML outperforms all baselines by a large margin among all tested datasets, which indicates that both prompts contribute to our final performance.Importantly, ρ = 0.7 tends to work well in most of the settings, which can be used as the default hyperparameter in our framework without tuning.
Visualizing Embedding Space We visualize the token representations from support sets and query sets over several episodes from the test set of Few-NERD INTRA, as Figure 3 shows.We observe that the token representations produced by ProML are concentrated in different clusters.In addition, we shall observe a clear decision boundary between different clusters.On the contrary, CONTaiNER seems to learn scattered, less separable features.

Conclusions
We propose a novel prompt-based metric learning framework ProML for few-shot NER that leverages multiple prompts to guide the model with label semantics.ProML is a general framework consistent with any token-level metric learning method and can be easily plugged into previous methods.We test ProML under 18 settings and find it substantially outperforms previous SOTA results by an average of 9.12% and a maximum of 34.51% in relative gains of micro F1.We perform ablation studies to show that multiple prompt schemas benefit the generalization ability for our model.We demonstrate the visualization results for embedding space to unseen entities, showing that comparing with previous SOTA, ProML learns better representations.We also present case studies and perform some analysis.

Limitations
Although we discussed different task formulations and evaluation protocols, the few-shot settings are simulated by downsampling according to existing works, which is slightly different from the real scenario.
Table 5: Ablation Study for ProML (1-shot and 5-shot).The tuple indicates which prompts are used in the support set and query set.The variant A, A refers to using the option prefix prompt only in both the support set and query set.plain+A (ρ = 0.5), plain refers to that the original inputs and option prefix prompts are used for the support set with an averaging weight ρ = 0.5, while the query set only use origin inputs.A+B, A is our ProML method.All results in this table are produced by the episode evaluation protocol.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.

Figure 1 :
Figure 1: An overview of the architecture of our proposed ProML .The prompts associated with the input sequence are passed through a transformer backbone to obtain intermediate representations.A masked weighted average is then applied to produce token-level representations.Following Das et al. (2022), Gaussian embeddings for each token are produced using linear projections.The similarity scores between query tokens and support tokens are then computed according to the distance metric.

Figure 2 :
Figure 2: A manually constructed example to illustrate different prompts.Prompted inputs for the support set are listed at the top and the tagging results of the query set for 4 prompt combinations are shown at the bottom.
propose the tag-set exten-sion formulation.Starting with a standard NER dataset (D train , D test ) with label set C, they split C into d parts, namely C 1 , C 2 , • • • , C d .Then for each label split C i , a train set D (i) train is constructed from D train by masking the labels in C i to O (representing non-entities), and the corresponding test set D (i) test is constructed from D test by masking the labels in C \ C i to O. Domain Transfer Another task formulation is the domain transfer setting.Let D S be a training set of a standard NER task, and let {D T (i) } be the test sets of standard NER tasks but from a different domain.The training set D S is referred to as a source domain, and the test sets {D T

Figure 3 :
Figure 3: TSNE visualization of token representations under the Few-NERD test set for CONTaiNER (on the left) and ProML (on the right), where each color represents an entity type (grey for non-entities).We only keep a fraction of 20% among the non-entities to make the TSNE visualization clearer.

Table 1 :
Evaluation results of ProML and 8 baseline methods in low-resource evaluation protocol for both tag-set extension and domain transfer tasks.Results with ⋆ are reported by the original paper, and those with † are reproduced in our experiments.We report the averaged micro-F1 score together with standard deviation."Onto-A" denotes group-A set of OntoNotes dataset.

Table 2 :
Evaluation results of ProML and 7 baseline methods in episode evaluation protocol for FewNERD dataset.Results with ⋆ are reported by the original paper, and those with † are reproduced in our experiments.We report the averaged micro-F1 score together with standard deviation.

Table 3 :
Statistics of Datasets

Table 7 :
Case study: An illustration of some cases from the WNUT test set.There are 6 entities: person (PER), location (LOC), product (PRO), creative work (CW), miscellaneous (MIS), group (GRO).Here blue color represents correct predictions, while red color represents mistakes.ER and kaiteP ER is so very cute and so funny i wish im ryanP ER wow emmaP ER and kaiteP ER is so very cute and so funny i wish im ryanP ER wow emmaP ER and kaiteP ER is so very cute and so funny i wish imP ER ryanP ER these trap came from taiwanLOC .these trap came from taiwanLOC .these trap came from taiwanLOC .great video !good comparisons between the ipadP RO and the ipadP RO proP RO !great video !good comparisons between the ipadP RO and theP RO ipadP RO proP RO !great video !good comparisons between the ipad and the ipad proP RO !pronounce it nye-on cat i pronounce it nye-on cat i pronounce it nye-onP RO catP RO C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank. i