Template-Based Named Entity Recognition Using BART

There is a recent interest in investigating few-shot NER, where the low-resource target domain has different label sets compared with a resource-rich source domain. Existing methods use a similarity-based metric. However, they cannot make full use of knowledge transfer in NER model parameters. To address the issue, we propose a template-based method for NER, treating NER as a language model ranking problem in a sequence-to-sequence framework, where original sentences and statement templates filled by candidate named entity span are regarded as the source sequence and the target sequence, respectively. For inference, the model is required to classify each candidate span based on the corresponding template scores. Our experiments demonstrate that the proposed method achieves 92.55% F1 score on the CoNLL03 (rich-resource task), and significantly better than fine-tuning BERT 10.88%, 15.34%, and 11.73% F1 score on the MIT Movie, the MIT Restaurant, and the ATIS (low-resource task), respectively.


Introduction
Named entity recognition (NER) is a fundamental task in natural language processing, which identifies mention spans from text inputs according to pre-defined entity categories (Tjong Kim Sang and De Meulder, 2003), such as location, person, organization, etc. The current dominant methods use a sequential neural network such as BiLSTM (Hochreiter and Schmidhuber, 1997) and BERT (Devlin et al., 2019) is used to represent the input text, and softmax (Chiu and Nichols, 2016;Strubell et al., 2017;Cui and Zhang, 2019) or CRF (Lample et al., 2016;Ma and Hovy, 2016;Luo et al., 2020) output layers to assign named entity tags (e.g. organization, person and location) or non-entity tags * Corresponding Author on each input token. Such a system is illustrated in Figure 2(a).
Neural NER models require large labeled training data, which can be available for certain domains such as news, but scarce in most other domains. Ideally, it would be desirable to transfer knowledge from the resource-rich news domain so that a model can be used in target domains based on a few labeled instances. In practice, however, a challenge is that entity categories can be different across different domains. As shown in Figure 1, the system is required to identify location and person in the news domain, but character and title in the movie domain. Both a softmax layer and CRF layer require a consistent label set between training and testing. As a result, given a new target domain, the output layer needs adjustment and training must be conducted again using both source and target domain, which can be costly.
A recent line of work investigates the setting of few-shot NER by using distance metrics (Wiseman and Stratos, 2019;Yang and Katiyar, 2020;Ziyadi et al., 2020). The main idea is to train a similarity function based on instances in the source domain, and then make use of the similarity function in the target domain as a nearest neighbor criterion for few-shot NER.
Compared with traditional methods, distancebased methods largely reduce the domain adapta-tion cost, especially for scenarios where the number of target domains is large. Their performance under standard in-domain settings, however, is relatively weak. In addition, their domain adaptation power is also limited in two aspects. First, labeled instances in the target domain are used to find the best hyper-parameter settings for heuristic nearest neighbor search, but are not for updating the network parameters of the NER model. While being less costly, these methods cannot improve the neural representation for cross-domain instances. Second, these methods rely on similar textual patterns between the source domain and the target domain. This strong assumption may hinder the model performance when the target-domain writing style is different from the source domain.
To address these issues, we investigate a template-based method for exploiting the few-shot learning potential of generative pre-trained language models to sequence labeling. Specifically, as shown in Figure 2, BART (Lewis et al., 2020) is fine-tuned with pre-defined templates filled by corresponding labeled entities. For example, we can define templates such as " candidate_span is a entity_type entity", where entity_type can be "person" and "location", etc. Given the sentence "ACL will be held in Bangkok", where "Bangkok" has a gold label "location", we can train BART using a filled template "Bangkok is a location entity" as the decoder output for the input sentence. In terms of non-entity spans, we use a template " candidate_span is not a named entity", so that negative output sequences can also be sampled. During inference, we enumerate all possible text spans in the input sentence as named entity candidates, classifying them into entities or non-entities based on BART scores on templates.
The proposed method has three advantages. First, due to the good generalization ability of pretrained models (Brown et al., 2020;, the network can effectively leverage labeled instances in the new domain for tine-tuning. Second, compared with distance-based methods, our method is more robust even if the target domain and source domain have a large gap in writing style. Third, compared with traditional methods (pre-trained model with a softmax/CRF), our method can be applied to arbitrary new categories of named entities without changing the output layer, and therefore allows continual learning (Lin et al., 2020).
We conduct experiments in both resource-rich and few-shot settings. Results show that our methods give competitive results with state-ofthe-art label-dependent approaches on the news dataset CoNLL03 (Tjong Kim Sang and De Meulder, 2003), and significantly outperforms Wiseman and Stratos (2019), Ziyadi et al. (2020) and  when it comes to few-shot settings. To the best of our knowledge, we are the first to employ a generative pre-trained language model to address a few-shot sequence labeling problem. We release our code at https: //github.com/Nealcly/templateNER.

Related Work
Neural methods have given competitive performance in NER. Some methods (Chiu and Nichols, 2016;Strubell et al., 2017) treat NER as a local classification problem at each input token, while other methods use CRF (Ma and Hovy, 2016) or a sequence-to-sequence framework Liu et al., 2019). Cui and Zhang (2019) and Gui et al. (2020) use a label attention network and Bayesian neural networks, respectively. Yamada et al. (2020) use entity-aware pre-training and obtain state-of-the-art results on NER. These approaches are similar to ours in the sense that parameters can be tuned in supervised learning, but unlike our method, they are designed for prescribed named entity types, which makes their domain adaptation costly with new few-shot entity types.
Our work is motivated by distance-based fewshot NER, which aims to minimize domainadaptation cost. Wiseman and Stratos (2019) copy the token-level label from nearest neighbors by retrieving a list of labeled sentences. Yang and Katiyar (2020) improve Wiseman and Stratos (2019) by using a Viterbi decoder to capture label dependencies estimated from the source domain. Ziyadi et al. (2020) follow a two-step approach (Lin et al., 2019;Xu et al., 2017), which first detects spans boundary and then recognizes entity types by comparing the similarity with the labeled instance. While not updating the network parameters for NER, these methods rely on similar name entity patterns between the source domain and the target domain. One exception is , who investigate noisy supervised pre-training and self-training method by using external noisy web NER data. Compared to their method, our method does not (a) Traditional sequence labeling method.

ACL will be held in Bangkok
Bangkok is a location entity (scoring: 0.8) √ Bangkok is a person entity (scoring: 0.3) Bangkok is not an entity (scoring: 0.1) ACL will is not an entity (scoring: 0.9) √ ACL will is a person entity (scoring: 0.1) ACL will is a location entity (scoring: 0.1) Template Template (b) Inference of template-based method.
Encoder ACL will be held in Bangkok Bangkok is a location entity (c) Training of template-based method. The template we use here is " xi:j is a y k entity". rely on self training on external data, yet yields better results.
There is a line of work using templates to solve natural language understanding tasks. The basic idea is to leverage information from pre-trained models, by defining specific sentence templates in a language modeling task. Brown et al. (2020) first use prompt for few-shot learning in text classification tasks. Schick and Schütze (2020) rephrase inputs as cloze questions for text classification.  and  extend Schick and Schütze (2020) by automatically generating label words and templates, respectively. Petroni et al. (2019) extract relation between entities from BERT by constructing cloze-style templates.  use templates to construct auxiliary sentences, and transform aspect sentiment task as a sentence-pair classification task. Our work is in line with exploiting pre-trained language model for templates-based NLP. While previous work considers sentence-level task as masked language modeling or uses language models to score a whole sentence, our method uses a language model to assign a score for each span given an input sentence. To our knowledge, we are the first to apply template-based method to sequence labeling.

Background
We give the formal definition of few shot named entity recognition in Section 3.1 and traditional sequence labeling methods in Section 3.2.

Few shot Named Entity Recognition
, and the number of its labelled sequence pairs is quite limited compared with the rich-resource NER dataset (i.e., J I). Regarding the low-resource domain, the target label vocabulary V L (∀l L i , l L i ∈ V L ) might be different from V H (Figure 1). Our goal is to train an accurate and robust NER model with L and H for the low-resource domain.

Traditional Sequence Labeling Methods.
Traditional methods (Figure 2(a)) regard NER as a sequence labeling problem, where each output label consists of a sequence segmentation component B (beginning of an entity), I (internal word in an entity), O (not an entity), and an entity type tag such as "person" and "location". For example, the tag "B-person" indicates the first word in a person type entity and the tag "I-location" indicates a token of a location entity not at the beginning. Formally, given x 1:n , the sequence labeling method calculates h1:n = ENCODER(x1:n) where d h is the hidden dimension of the encoder, parameters, andl c is the label estimation for x c . We use BERT (Devlin et al., 2019) and BART (Lewis et al., 2020) as our ENCODER to learn the sequence representation.
A standard method for NER domain adaptation is to train a model using source-domain data R first, before further tuning the model using target domain instances P, if available. However, since the label sets can be different, and consequently the output can be different across domains. We train W V P and b V P from scratch using P. However, this method does not fully exploit label associations (e.g., the association between "person" and "character"), nor can it be directly used for zero-shot cases, where no labeled data in the target domain is available.

Template-Based Method
We consider NER as a language model ranking problem under a seq2seq framework. The source sequence of the model is an input text X = {x 1 , . . . , x n } and the target sequence T y k ,x i:j = {t 1 , . . . , t m } is a template filled by candidate text span x i:j and the entity type y k . We first introduce how to create templates in Section 4.1, and then show the inference and training details in Section 4.2 and Section 4.3, respectively.

Template Creation
We manually create the template, which has one slot for candidate_span and another slot for the entity_type label. We set a one to one mapping function to transfer the label set L = {l 1 , . . . , l |L| } (e.g., l k ="LOC") to a natural word set Y = {y 1 , . . . , y |L| } (e.g. y k ="location"), and use words to define templates T + y k (e.g. candidate_span is a location entity.). In addition, we create a non-entity template T − for none of the named entity (e.g., candidate_span is not a named entity.). This way, we can obtain a list of tem- In Figure 2(c), the template T y k ,x i:j is " x i:j is a y k " and T − x i:j is " x i:j is not a named entity", where x i:j is a candidate text span.

Inference
We first enumerate all possible spans in the sentence {x 1 , . . . , x n } and fill them in the prepared templates. For efficiency, we restrict the number of n-grams for a span from one to eight, so 8n templates are created for each sentence. Then, we use the fine-tuned pre-trained generative language model to assign a score for each template T y k ,x i:j = {t 1 , . . . , t m }, formulated as We calculate a score f (T + y k ,x i:j ) for each entity type and f (T − x i:j ) for the none entity type by employing any pre-trained generative language model to score templates. Then we assign x i:j the entity type with the largest score to the text span. In this paper, we take BART as the pre-trained generative language models.
Our datasets do not contain nested entities. If two spans have text overlap and are assigned different labels in the inference, we choose the span with higher score as the final decision to avoid possible prediction contradictions. For instance, given the sentence "ACL will be held in Bangkok", the n−gram "in Bangkok" and "Bangkok" can be labeled "ORG" and "LOC", respectively, by using local scoring function f (·). In this case, we compare f (T + ORG,"in Bangkok" ) and f (T + LOC,"Bangkok" ), and choose the label which has a larger score to make the global decision.

Training
Gold entities are used to create template during training. Suppose that the entity type of x i:j is y k . We fill the text span x i:j and the entity type y k into T + to create a target sentence T + y k ,x i:j . Similarly, if the entity type of x i:j is a none entity text span, the target sentence T − x i:j is obtained by filling x i:j into T − . We use all gold entities in the training set to construct (X, T + ) pairs, and additionally create negative samples (X, T − ) by randomly sampling non-entity text spans. The number of negative pairs is 1.5 times that of positive pairs. Given a sequence pair (X, T), we feed the input X to the encoder of the BART, and then we obtain hidden representations of the sentence At the c th step of the decoder, h enc and previous output tokens t 1:c−1 are then as inputs, yielding a representation using attention (Vaswani et al., 2017) h dec c = DECODER(h enc , t1:c−1) The conditional probability of the word t c is defined as: where W lm ∈ R d h ×|V| and b lm ∈ R |V| . |V| represents the vocab size of pre-trained BART. The cross-entropy between the decoder's output and the original template is used as the loss function.

Transfer Learning
Given a new domain P with few-shot instances, the label set L P (Section 4.1) can be different from what has been used for training the NER model. We thus fill the templates with the new domain label set for both training and testing, with the rest of the model and algorithms unchanged. In particular, given a small amount of (X P , T P ), we create sequence pairs with the method described above for the low-resource domain, and fine-tuning the NER model trained on the rich-source domain. This process has low cost, yet can effectively transfer label knowledge. Because the output of our method is a natural sentence instead of specific labels, both resource-rich and low-resource label vocabulary are subset of the pre-trained language model vocabulary (V R , V P V). This allows our method to make use of label correlations such as "person" and "character", and "location" and "city", for enhancing the effect of transfer learning across domains.

Experiments
We compare template-based BART with several baselines on both resource-rich settings and fewshot settings. We use the CoNLL2003 (Tjong Kim Sang and De Meulder, 2003)

Template Influence
There can be different templates for expressing the same meaning.
For instance " candidate_span is a person" can also be expressed by " candidate_span belongs to the person category". We investigate the impact of manual templates using the CoNLL03 development set. Table 1 shows the performance impact of different choice of templates. For instance, " candidate_span should be tagged as entity_type " and " candidate_span is a entity_type entity" give 76.80% and 95.27% F1 score, respectively, indicating the template is a key factor that influences the final performance. Based on the development results, we use the top performing template " candidate_span is a entity_type entity" in our experiments.

CoNLL03 Results
Standard NER setting. We first evaluate the performance under the standard NER setting on CoNLL03. The results are shown in Table 2, where state-of-the-art methods are also compared. In particular, the sequence labeling BERT gives a strong baseline, F1 score at 91.73%. We can see that even though the template-based BART is designed for few-shot named entity recognition, it performs competitively in resource-rich setting as well. For instance, our method outperforms sequence label-ing BERT by 1.80% on recall, which shows that our method is more effective in identifying the named entity, but also selecting irrelevant span. Noted that though both sequence labeling BART and templatebased BART make use of BART decoder representations, their performances have a large gap, where the latter outperforms the former by absolutely 1.30% on F1 score, demonstrating the effectiveness of the template-based method. The observation is consistent with that of Lewis et al. (2020), which shows that BART is not the most competitive for sequence classification. This may result from the nature of its seq2seq-based denoising autoencoder training, which is different from masked language modeling for BERT.
To explore if templates are complementary for each other, we train three models using the first three templates reported in Table 1, and adopt an entity-level voting method to ensemble these three models. There is a 1.21% precision increase using ensemble, which shows that different templates may capture different type of knowledge. Finally, our method achieves a 92.55 % F1 score by leveraging three templates, which is highly competitive with the best reported score. For computational efficiency, we use a single model for the subsequent few-shot experiments.
In domain few-shot NER setting. We construct a few-shot learning scenario on the CoNLL03, where the number of training instances for some specific categories is quite limited by downsampling. In particular, we set "MISC" and "ORG" as the resource-rich entities, and "LOC" and "PER" as the low-resource entities. We down-sample the CoNLL03 training set, yielding 3,806 training instances, which includes 3,925 "ORG", 1,423 "MISRC", 50 "LOC" and 50 "PER". Since the text style is consistent in rich-resource and low-resource entity categories, we call the scenario in domain few-shot NER. Table 3, sequence labeling BERT and template-based BART show similar performance in resource-rich entity types, while our method significantly outperforms BERT by 11.26 and 12.98 F1 score in "LOC" and "MISC", respectively. It demonstrates that our method has a stronger modeling capability for in-domain fewshot NER, and indicates that the proposed method can better transfer the knowledge between different entity categories.

Cross-domain Few-Shot NER Result
We evaluate the model performance when the target entity types are different from the source-domain, and only a small amount of labeled data is available for training. We simulate the cross-domain low-resource data scenarios by random sampling training instances from a large training set as the training data in the target domain. We use different numbers of instances for training, randomly sampling a fixed number of instances per entity type (10, 20, 50, 100, 200, 500 instances per entity type for MIT Movie and MIT restaurant, and 10, 20, 50 instances per entity type for ATIS). If an entity has a smaller number of instances than the fixed number to sample, we use all of them for training. The results on few-shot experiments using MIT Movie, MIT Restaurant and ATIS are shown in Table 4, where the methods of Wiseman and Stratos (2019), Ziyadi et al. (2020) and  are also compared. We first consider a training-from-scratch setting, where no source-domain data is used. Distancebased methods cannot suit this setting. Compared with the traditional sequence labeling BERT method, our method can make better use of fewshot data. In particular, with as few as 20 instances per entity type, our method gives a F1 score of 57.1%, higher than BERT using 100 instances per entity type on MIT Restaurant.
We further investigate how much knowledge can  be transferred from the news domain (CoNLL03).
In this setting, we further train the model which is trained on the news domain. It can be seen from the Table 4 that on all the three datasets, the fewshort learning methods outperform sequence labeling BERT and BART methods when the number of training instances is small. For example, when there are only 10 training instances, the method of Ziyadi et al. (2020) gives a F1 score of 40.1% on MIT Movie, as compared to 28.3% by BERT, despite that BERT requires re-training with a different output layer on both CoNLL03 and MIT Movie. However, as the number of training instances increase, the advantage of baseline few-shot methods decreases. When the number of instances grows as large as 500, BERT outperforms all existing methods. Our method is effective in both 10 instances and 500 instances, outperforming both BERT and baseline few-shot methods. Compared with the distance-based method (Wiseman and Stratos, 2019;Ziyadi et al., 2020;, our method shows more improvement when the number of target-domain labeled data increases, because the distance-based method just optimizes its searching threshold rather than updating its neural network parameters. We can see that the performance of distance-based methods remains the same as the labeled data increasing. For example, the performance of  increases only 1.9% F1 score when the number of instances per entity type increase from 10 to 500. Both BERT and our method perform better than training from scratch. Our model average increases 6.6, 6.9 and 5.4 F1 score on MIT restaurant, MIT movie and ATIS, respectively, which is significantly higher than 3.1, 1.9 and 4.3 F1 score in BERT. This shows that our model is more successful in transferring knowledge learned from the source domain. One possible explanation is that our model makes more use of the correlations between different entity type labels in the vocabulary as mentioned earlier, which BERT cannot achieve due to treating the output as discrete class labels.

Discussion
Impact of entity frequencies in training data. To explore the relation between recognition accuracy and the frequency of an entity type in training, we split ATIS test set into three subset based on the entity frequency in training. The most 33% frequency entities are put into high frequency subset, the last 33% frequency entities are put into low frequency subset, and the remaining are put into mid frequency subset. Figure 3 shows the F1 score of    BERT and our method against the number of training instance in the three subsets. As the number of training instances increases, the performance of all models increases. Our method outperforms sequence labeling BERT by a large margin, especially on the mid frequency and low frequency subsets, which demonstrates that our method is more robust in few-shot settings.
Continual Learning NER In continual learning setting (Lin et al., 2020), all the baselines that we have in Table 4  Visualization. We explore why our model works well in the low-resource domain by visualizing the output layer. We train BERT and our method on all four datasets, and use t-SNE (van der Maaten and Hinton, 2008) to project the output layer into 2dimensions, where the output layer for sequence labeling BERT and template-based BART are W V R in Eq 1 and W lm in Eq 5, respectively. In Figure 5, each dot represents a row in the output matrix (corresponding to a label embedding). We can see that output layer embeddings of BERT are clustered based on dataset while the vectors of templatebased BART are sparsely distributed in the space. It indicates that our output matrix is more domain independent, and our method enjoys better generalization ability across different domains.
Error Types We find that most mistakes are caused by the domain distance between highresource data and low-source NER data. As shown in Fig 5, Template-based methods rely on label semantics. If the embedding of the word with a few-shot labels is far from that with in-domain labels, the model shows lower performance on that label type. Taking 50 examples per entity type on MIT movie as an example, "ACTOR" is similar to "PERSON" in CoNLL03, and achieves 84.81 F1. The embedding of "SONG" is far from the existing labels in CoNLL03, and only achieves 34.97 F1. In contrast, sequence labeling BERT does not suffer from this distance, because BERT cannot draw label correlation between two domains, it achieves 53.98 and 40.13 on "ACTOR" and "SONG", respectively.

Conclusion
We investigated template-based few-shot NER using BART as the backbone model. In contrast to the traditional sequence labeling methods, our method is more powerful on few-shot NER, since it can be fine-tuned for the target domain directly when new entity categories exist. Experiment results show that our model achieves competitive results on a rich-resource NER benchmark, and outperforms traditional sequence labeling methods and distancebased methods significantly on the cross-domain and few-shot NER benchmarks.