Few-shot Named Entity Recognition with Supported and Dependent Label Representations

We explore the problem of few-shot named entity recognition (NER) by introducing two ideas to improve label representations. Recently, the use of token representations with a distance metric has been shown to be effective in few-shot NER, and we take an approach to use label representations along with token representations. Firstly, we add support examples to a label name (e.g., “person; example: Fed-eric Krupp, Gao, Honecker, Bush, Deverow”) when obtaining a label representation. Sec-ondly, we estimate a transition score among labels with a bilinear function among label representations. The proposed approach is evaluated on 4 open few-shot NER datasets and we found that the approach can improve the performance of one-stage few-shot NER.


Introduction
The advance of large language models (e.g., BERT, GPT) has brought some of natural language understanding tasks to be tackled with few training samples.One such task that has especially gathered an attention from researchers is named entity recognition (NER) where a simple nearest neighbor classification using an NER model and a distance metric has shown to achieve a moderate performance in a few-shot setting (Yang and Katiyar, 2020).Ma et al. (2022a) proposed a related but slightly different approach where they prepare an additional BERT to encode labels into representations.
We extend an idea to use label representations to improve few-shot NER.A simple approach to obtain label representations is to encode just label names (Ma et al., 2022a), and we add randomly sampled label examples to label names and encode the combined label names and examples to improve label representations.Figure 1 illustrates our approach to use label examples as the supports of label names.In this approach, an input text and all labels are encoded with a dual encoder architecture.
For each token, similarities against all labels are calculated to decide a label.The extension to add label examples may seem like a naive approach to improve label representations but this approach follows previous findings to obtain fine-grained label representations.Firstly, in the context of zero-shot NER, Aly et al. (2021) explored the effectiveness of using label descriptions to encode labels.They have found that the use of label name is a strong baseline to represent a label and a label description can further improve the performance of zero-shot NER depending on its quality.One downside of using label descriptions is that fine-grained descriptions are not always available in an NER dataset.Secondly, an approach to use multiple examples to estimate a label is a well-known approach of Prototypical Networks (Snell et al., 2017).In a typical Prototypical Networks setting, a label prototype can be represented as an average of multiple examples.
We further extend the approach to use label representations in few-shot NER by estimating a label dependency between two labels.A straightforward approach to model label dependencies in NER is to add a Conditional Random Field (CRF, Lafferty et al., 2001) layer after a token encoder (Lample et al., 2016).However, this CRF layer is known to be difficult to transfer since it directly learns a K × K transition matrix over K labels (Yang and Katiyar, 2020;Hou et al., 2020).We estimate a transition score between two labels with a trainable function which maps two label representations into a single scalar score.We show that this estimation works quite effectively in the dual encoder architecture.
In summary, the contributions of this paper are the followings: 1. We propose an approach to sample support examples to improve label representations for few-shot NER and confirm its effectiveness on 4 few-shot NER datasets.et al., 2022).Like these approaches, our approach is in the paradigm of one-stage few-shot NER where named entities are recognized simply as the labels of input tokens.

Two-stage Few-shot NER
Recently, the paradigm of two-stage few-shot NER (Wang et al., 2022a) where entity spans are extracted in the first stage and their types are recognized in the second stage are investigated to extend the one-stage few-shot NER.In this paradigm, span or entity prototypes are defined (Wang et al., 2022b;Ji et al., 2022;Ma et al., 2022b;Wang et al., 2022a) to achieve stronger performances with the complexity of an additional stage.

Label Representation in NER
The use of label description has been also explored in a low-resource NER.Aly et al. (2021) explored the effect of label description in a zero-shot NER and Wang et al. (2021) has utilized label descriptions along with entity representations in a few-shot and a zero-shot NER.Ma et al. (2022a) has shown that simply using label names is quite effective in few-shot NER.Our approach extends these ideas to use support samples to improve label representation in few-shot NER.

Dual Encoder Model
We followed the approach of dual encoders that was taken by Aly et al. (2021) and Ma et al. (2022a) as the base architecture of our model.As shown in Figure 1, we prepare an encoder for an input tokens and an encoder for labels.Given input tokens u I , the tokens are encoded with a language model v = LM token (u I ).The tokens of given labels u L are similarly encoded with another language model m = LM label (u L ), and the representations of CLS1 are pooled as label representations l.For each token representation v n , similarities against all label representations are calculated with a similarity function as o n = sim(v n , l).These token-label similarities are used to calculate a loss against true labels.As done in previous studies, we first pre-finetune this model on a large scale NER end if 13: end while 14: return S dataset (e.g., OntoNotes 5.0) and then fine-tune it on a few-shot NER dataset.

Support Example Sampling
The dual encoder model encodes label tokens to obtain label representations.Our idea improves label representations by extending label tokens with support examples.Algorithm 1 shows processes to sample support examples S for input text x and label y.In the case of PER label in Figure 1, n = 5 examples of Federic Krupp, Gao, Honecker, Bush and Deverow are sampled from entire training data X2 .These examples are then combined with the label name person with the fixed text snippet of "; example:"3 .One exceptional label that needs to be considered in this sampling is O label.Since O is not a label for named entity and does not have an entity boundary, we decided to samples a nonentity word from text x y .

Estimation of Transition Score
Lample et al. ( 2016) have shown that a CRF layer can be added to a token encoder to improve NER.However, the few-shot transfer of this CRF layer is known to be difficult since the prediction score is defined as s(x, y) = i A y i ,y i+1 + i P i,y i where i is a token index, A is a transition matrix and P is an emission matrix.A typical approach to realize A is to prepare a trainable K × K matrix when there are K labels.We estimate a transition score among two labels A y i ,y i+1 simply with a bilinear function as where W is a trainable weight matrix of size D×D, b is a bias and D is the embedding size of the label encoder.We call a score estimated with this approach Bilinear-transition CRF (BCRF) score.
The estimation of a transition score has been investigated in a more resource rich setting in Hu et al. (2020).Our BCRF score takes a simple estimation approach since we focus on a resource poor few-shot setting.

Model Configuration
We first pre-finetuned the dual encoder model ( §3.1) on a large-scale NER dataset.The training section is used for Few-NERD and OntoNotes 5.0 (Weischedel et al., 2013) (Yang and Katiyar, 2020).The values with parenthesis are the scores with the downsampling algorithm (Hou et al., 2020).used as the tagging scheme of NER.The Viterbi algorithm is used to decide the best label sequence as in the decoding process of CRF.The further detail of the training configuration, label configuration and dataset statistics are shown in §A.1, §A.2 and §A.3, respectively.

The Effects of Support Examples and BCRF
We examined the effects of two ideas with an ablation study on Few-NERD.The -SupEx and -BCRF scores in Table 1 shows the result of ablation study.BCRF and support examples have shown effective on all 8 settings.We further confirmed the performance of BCRF without the label encoder by using randomly initialized label embeddings as in Hu et al. (2020) (-SupEx, +BCRF[R]).The low performance of this setting indicates the strength of BCRF combined with the label encoder.

The Effects of n Support Examples
The labels encoder encodes n support examples to obtain label representations.In the experiment ( §4.3), we chose n = 5 so that our approach can consider enough examples in the 5~10 shot settings of Few-NERD.We have additionally tried n = 1, 3 on Few-NERD and found the result to be quite stable regardless of the value of n.

Conclusion
We proposed two ideas to improve label representations that can be effective for few-shot NER.These ideas have shown effectiveness to achieve strong performances compared against previous one-stage approaches and comparable performances to some of two-stage approaches.As future work, we would like to explore whether these ideas can be applied to span representations which have shown superiority compared to simpler token representations that we have explored in this study.

Limitations
Our approach has shown strong performances on 4 widely used few-shot NER datasets.Additional datasets and transfer settings have been tested in previous studies (Fritzler et al., 2019;Yang and Katiyar, 2020;Wang et al., 2021;Ma et al., 2022a;Das et al., 2022;Ma et al., 2022b;Ji et al., 2022) and our approach can be suboptimal on them.The result of few-shot domain transfer settings in CoNLL-2003, WNUT-2017 and i2b2-2014 depends on randomly sampled few-shot samples.Since these random samples differ among our approach and previous studies, the comparison is not a fair comparison in an exact manner.This variance in random samples is alleviated in Few-NERD since the episode evaluation use pre-sampled 5000 episodes.The evaluation of our approach requires certain amount of computational resources to run, especially in Few-NERD.Even though a single episode evaluation can be done quite quickly (e.g., 3 minutes), the full evaluation on Few-NERD will take 3 × 5000 × 8 minutes ≈ 2000 hours on single gpu.
broadcast news (OntoNotes 5.0), broadcast conversation (OntoNotes 5.0), telephone conversation (OntoNotes 5.0), web data (OntoNotes 5.0), social media (WNUT-2017) and clinical narratives (i2b2-2014).Protected health information in the clinical narratives are de-identified and we have made the agreement with the data provider on a research and development use of them.

A.2 Label Configuration
The model uses label names along with support examples to obtain label representations ( §3.2).We used the label names defined in Ma et al. (2022a) for CoNLL-2003, WNUT-2017 and i2b2-2014.For example, "person" is used as the label name of "PER" in CoNLL-2003.We combined the coarse type and the fine type of a named entity with hyphen in Few-NERD which is available in Table 8 of Ding et al. (2021).For example, "Location-GPE" is used for the named entity with the coarse type of "Location" and the fine type of "GPE".We additionally prepared "start of sentence" and "end of sentence" label names for BCRF which are used in the first token of a sentence and the last token of a sentence, respectively.

A.3 Dataset Statistics
Table 3 shows the number of sentences included in OntoNotes 5.0, Few-NERD, CoNLL-2003, WNUT-2017 and i2b2-3014.For OntoNotes 5.0, we used the splits of CoNLL-2012 9 following the setting of Yang and Katiyar (2020).The language of the datasets is English for all datasets.All datasets are designed to evaluate NER, and Few-NERD is specifically designed for few-shot settings.Note that the actual training splits of the experiment ( §4) are samples of the training split in Table 3.

(
cased).Transformers library 7 and PyTorch 8 are used to implement the proposed model.The number of support examples is set to n = 5.NVIDIA Tesla V100 with 32GB memory is used to train and evaluate the proposed model.The training time of the proposed model is short: 1-8 minutes for a 100 epochs fine-tuning on a target dataset.

Figure 2 :
Figure 2: The F 1 scores on Few-NERD with varying number of support examples.
Figure1: The overview of our approach to encode label representations by adding randomly sampled support examples.Encoder is a large language model such as BERT, sim is a similarity function among representations, v n is the token representation of n-th token, l is label representations and BCRF is Bilinear-trainsition CRF ( §3.3).
Algorithm 1 Support example sampling

Table 1 :
is used forCoNLL-2003,  WNUT-2017 and i2b2-2014.BERT base (cased)is used as language models and dot product is used as the similarity function of the model.IO scheme is The episode evaluation F 1 scores on Few-NERD over 5000 episodes.The shaded models are the proposed model with alternative configurations ( §5): -SupEx is without support examples, -BCRF is trained on cross-entropy loss, +BCRF[R] is trained on BCRF with randomly initialized label embeddings.The bold values are the best scores and the underlined values are the second-best scores for each N -way K-shot setting.

Table 2 :
The F 1 scores onCoNLL-2003, WNUT-2017 and i2b2-2014.The scores of the proposed models are average with standard deviation on 10 different K-shot samples.The bold values are the best scores and the underlined values are the second-best scores for each K-shot setting with the greedy sampling algorithm

Table 1
The standard deviation of F 1 score was largest on INTRA 10-way 1~2 shot with the value of 47.27 ± 0.33.The more detailed effect of the number of support examples can be confirmed in §A.4.

Table 3 :
The number of sentences included in the datasets of the experiment ( §4).