Template-free Prompt Tuning for Few-shot NER

Prompt-based methods have been successfully applied in sentence-level few-shot learning tasks, mostly owing to the sophisticated design of templates and label words. However, when applied to token-level labeling tasks such as NER, it would be time-consuming to enumerate the template queries over all potential entity spans. In this work, we propose a more elegant method to reformulate NER tasks as LM problems without any templates. Specifically, we discard the template construction process while maintaining the word prediction paradigm of pre-training models to predict a class-related pivot word (or label word) at the entity position. Meanwhile, we also explore principled ways to automatically search for appropriate label words that the pre-trained models can easily adapt to. While avoiding the complicated template-based process, the proposed LM objective also reduces the gap between different objectives used in pre-training and fine-tuning, thus it can better benefit the few-shot performance. Experimental results demonstrate the effectiveness of the proposed method over bert-tagger and template-based method under few-shot settings. Moreover, the decoding speed of the proposed method is up to 1930.12 times faster than the template-based method.


Introduction
Pre-trained language models (LMs) have led to large improvements in NLP tasks (Devlin et al., 2019;Liu et al., 2019;Lewis et al., 2020).Popular practice to perform downstream classification tasks is to replace the pretrained model's output layer with a classifier head and fine-tune it using a task-specific objective function.Recently, a new paradigm, prompt-based learning, has achieved great success on few-shot classification tasks by reformulating classification tasks as cloze * Equal contribution.

Obama is a [MASK] entity. Input LM predictions
Obama was born in America . [CLS] [SEP] Obama was is a questions.Typically, for each input [X], a template is used to convert [X] into an unfilled text (e.g., "[X] It was __."), allowing the model to fill in the blank with its language modeling ability.For instance, when performing sentiment classification task, the input "I love the milk."can be converted into "I love the milk.It was __.".Consequently, the LM may predict a label word "great", indicating that the input belongs to a positive class.Two main factors contribute to the success of prompt-based learning on few-shot classification.First, re-using the masked LM objective helps alleviate the gap between different training objectives used at pre-training and fine-tuning.Therefore, the LMs can faster adapt to downstream tasks even with a few training samples (Schick and Schütze, 2021a,b;Brown et al., 2020).Second, the sophisticated template and label word design helps LMs better fit the task-specific answer distributions, which also benefits few-shot performance.As proved in previous works, proper templates designed by manually selecting (Schick and Schütze, 2021a,b), gradient-based discrete searching (Shin et al., 2020), LM generating (Gao et al., 2021) and continuously optimizing (Liu et al., 2021) are able to induce the LMs to predict more appropriate answers needed in corresponding tasks.
However, the template-based prompt methods are intrinsically designed for sentence-level tasks, and they are difficult to adapt to token-level classification tasks such as named entity recognition (NER).First, searching for appropriate templates is harder as the search space grows larger when encountering span-level querying in NER.What's worse, such searching with only few annotated samples as guidance can easily lead to overfitting.Second, obtaining the label of each token requires enumerating all possible spans, which would be time-consuming.As an example in Fig. 1, the input "Obama was born in America."can be converted into "Obama was born in America.[Z] is a __ entity.",where [Z] is filled by enumerating all the spans in [X] (e.g., "Obama", "Obama was") for querying.Fig. 1 shows that obtaining all entities in "Obama was born in America ."requires totally 21 times to query the LMs with every span.Moreover, the decoding time of such an approach would grow catastrophically as sentence length increasing, making it impractical to document-level corpus.
In this work, we propose a more elegant way for prompting NER without templates.Specifically, we reformulate NER as an LM task with an Entity-oriented LM (EntLM) objective.Without modifying the output head, the pre-trained LMs are fine-tuned to predict class-related pivot words (or label words) instead of the original words at the entity positions, while still predicting the original word at none-entity positions.Next, similar to template-based methods, we explore principled ways to automatically search for the most appropriate label words.Different approaches are investigated including selecting discrete label words based on the word distribution in lexiconannotated corpus or LM predictions, and obtaining the prototypes as virtual label words.Our approach keeps the merits of prompt-based learning as no new parameters are introduced during finetuning.Also, through the EntLM objective, the LM are allowed to perform NER task with only a slight adjustment of the output distribution, thus benefiting few-shot learning.Moreover, wellselected label words accelerate the adaptation of LM distribution towards the desired predictions, which also promotes few-shot performance.It's also worth noting that the proposed method requires only one-pass decoding to obtain all entity labels in the sentence, which is significantly more efficient compared to the time-consuming enumeration process of template-based methods.Our codes are publicly available at https:// github.com/rtmaww/EntLM/.
To summarize the contribution of this work: • We propose a template-free approach to prompt NER under few-shot setting.
• We explore several approaches for label word engineering accompanied with intensive experiments.
• Experimental results verify the effectiveness of the proposed method under few-shot setting.Meanwhile, the decoding speed of the proposed method is 1930.12times faster than template-based baseline.

Problem Setup
In this work, we focus on few-shot NER task.Different from previous works that assume a richresource source domain and available support sets during testing, we follow the few-shot setting of (Gao et al., 2021)

Approach
In this work, we propose a template-free prompt tuning method, Entity-oriented LM (EntLM) fine-tuning, for few-shot NER.We first give a description of the template-based prompt tuning.
Then we introduce the EntLM method along with the label word engineering process.

Template-based Prompt Tuning
The standard fine-tuning process for NER is replacing the LM head with a token-level classification head and optimizing the newlyintroduced parameters and the pre-trained LM.Different from standard fine-tuning, prompt-based tuning reformulates classification tasks as LM tasks, and fine-tunes LM to predict a label word.Formally, a prompt consists of a template function T prompt (•) that converts the input x to a  a) is the standard fine-tuning method, which replace the LM head with a classifier head and perform label classification.(c) is the template-based prompt learning method, which induces the LM to predict label words by constructing a template.(b) is the proposed Entity-oriented LM fine-tuning method, which also re-uses the LM head and leads the LM to predict label words through an Entity-oriented LM objective.(For entities with multiple spans, the model predicts the same label word at each position, which is similar to the "IO" labeling scheme.)prompt input x prompt = T prompt (x), and a set of label words V which are connected with the label space through a mapping function M : Y → V.The template is a textual string with two unfilled slot: a input slot [X] to fill the input x and an answer slot [Z] that allows LM to fill label words.For instance, for a sentiment classification task, the template can take the form as "[X] It was [Z].".The input is then mapped to "x It was [Z].".Specifically, when using a masked language model (MLM) for prompt-based tuning, [Z] is filled with a mask token [MASK].By feeding the prompt into the MLM, the probability distribution over the label set Y is modeled by: where W lm are the parameters of the pre-trained LM head.Unlike in standard fine-tuning, no new parameters are introduced in this approach, therefore the model can easier fit the target task with few samples.Also, the LM objective reduce the gap between pre-training and fine-tuning, thus benefiting few-shot training (Gao et al., 2021).

Problems of Prompt-based NER
However, when applied to NER, such promptbased approach becomes complicated.given an input X = {x 1 , . . ., x n }, we need to obtain the label sequence Y = {y 1 , . . ., y n }, y i ∈ Y corresponding to each token of X.Therefore, an additional slot [S] is added in the template to fill a token x i or a continual span s i j = {x i , . . ., x j } that starts from x i and ends with x j .For example, the template can take the form as "[X] [S] is a [Z] entity.",where the LMs are fine-tuned to predict an entity label word at [Z] (e.g., person) corresponding to an entity label (e.g., PERSON).
During decoding, obtaining the labels Y of the whole sentence requires enumeration over all the spans: (2) Such a decoding way is time-consuming and the decoding time increasing as the sequence length getting longer.Therefore, although efficient in fewshot setting, template-based prompt tuning is not suitable for NER task.

Entity-Oriented LM Fine-tuning
In this work, we propose a more elegant way to prompt NER without templates, while maintaining the advantages of prompt-tuning.Specifically, we also reformulate NER as a LM task.However, instead of forming templates to re-use the LM objective, we propose a new objective, Entityoriented LM (EntLM) objective for fine-tuning NER.As shown in Fig. 2 (b), when fed with "Obama was born in America", the LM is trained to predict a label word "John" at the position of the entity "Obama" as an indication of the label "PER".While for none-entity word "was", the LM remains to predict the original word.
Formally, to fine-tune the LM with EntLM objective, we first construct a label word set V l which is also connected with the task label set through a mapping function M : Y → V l .
Next, given the input sentence X = {x 1 , . . ., x n } and the corresponding label sequence Y = {y 1 , . . ., y n }, we construct a target sentence X Ent = {x 1 , . . ., M(y i ), . . ., x n } by replacing the token at the entity position i (here we assume y i is an entity label) with corresponding label word M(y i ), and maintaining the original words at none-entity positions.Then, given the original input X, the LM is trained to maximize the probability P (X Ent |X) of the target sentence X Ent : where Noted that W lm are also the parameters of the pre-trained LM head.By re-using the whole pretrained model, no new parameters are introduced during this fine-tuning process.Meanwhile, the EntLM objective serves as a LM-based objective to reduce the gap between pre-training and finetuning.In this way, we avoid the complicated template constructing for NER task, and keep the good few-shot ability of prompt-based method.
During testing, we directly feed the test input X into the model, and the probability of labeling the i th token with class y ∈ Y is modeled by: Noted that we only need one-pass decoding process to obtain all labels for each sentence, which is intensively more efficient than template-based prompt querying.

Label Word Engineering
Previous template-based studies have verified the significant impact of template engineering on fewshot performance.Similarly, in this work, we explore approaches for automatically selecting proper label words.Since the EntLM object lead all entities that belong to a class to predict the same label word, we believe that the purpose of label word searching is to find a pivot word that can mostly represent the words in each class.

Low-resource Label word selection
When selecting label words with only few annotated samples as guidance, the randomness of sampling will largely affect the selection.In order to obtain more consistent selection, we explore the usage of unlabeled data and lexicon-based annotation as a resource for label word searching.This is a practical setting since unlabeled data of a target domain or a general domain is usually available, and for NER, the entity lexicon of target classes are usually easy to access.To obtain annotation via entity lexicon, we adopt the KB-matching approach proposed by Liang et al. (2020), which leverages an external KBs, wikidata,  the discrete label words and the continuous vectors as virtual label words.To search for the discrete label words, we select the high-frequency words in data or LM output distribution, or combine these two ways.To search for virtual label words, we calculate the mean vectors of the high-frequency words of each class as prototypes.
as the source of lexicon annotation.Such lexiconbased annotation is inevitably noisy.However, our approach do not suffers a lot from the noise since we only regarded it as an indication of the data distribution and do not train the model directly with the noisy annotation.

Label word searching
With the help of lexicon-annotated data D lexicon = {(X i , Y * i )} N i=1 , we explore three methods for label word searching.

Searching with data distribution (Data search)
The most intuitive method is to select the most frequent word of the given class in the corpus.Specifically, when searching for label words for class C, we calculate the frequency φ(x = w, y * = C) of each word w ∈ V labeled as C and select the most frequent words by ranking: Searching with LM output distribution (LM search) In this approach, we leverage the pretrained language model for label word searching.Specifically, we feed each sample (X, Y * ) into LM and get the probability distribution p(x i = w|X) of predicting each word w ∈ V at each position j.Suppose I topk (x i = w|X, Y * ) → {0, 1} is the indicator function indicating whether w belongs to the topk predictions of x i in sample (X, Y * ).The label word of class C can be obtained by: Searching with both data & LM output distribution (Data&LM seach) In this approach, we select label words by simultaneously considering the data distribution and LM output distribution.Specifically, the label word of class C can be obtained by: ) Virtual label word (Virtual) Instead of using real words, in this approach, we search for continuous vectors that can be regarded as virtual label words.One intuitive way is to follow the practice of Prototypical Networks (Snell et al., 2017), which uses the mean vector of the embeddings of words belonging to each class as a prototype.Since averaging the embeddings of all the words belong to a class is expensive, here we simply use the mean vector of the topk high-frequency words selected by the previous approaches: where V C is the set of label words obtaining by finding the top k words with Eq. 5,6,7, and f φ (•) denotes the embedding function of the pre-trained model.

Removing conflict label words
The selected high-frequency label words are potentially high-frequency words among all the classes.Using such label words will result in conflicts when training for different classes.Therefore, after label word selection, we remove the conflict label words of a class C by: (9) where T h is a manually set threshold.

Experiments
In this section, we conduct few-shot experiments to verify the effectiveness of the proposed method.We also conducts intensive analytical experiments for label words selection.

Experimental settings
As mentioned in Section 2, in this work, we focus on few-shot setting that no source domain data yet only K samples of each class are available for training on a new NER task.To better evaluate the models' few-shot ability, we conduct experiments with K ∈ {5, 10, 20, 50}.For each K-shot experiment, we sample 3 different training set and repeat experiments on each training set for 4 times.Few-shot data sampling.Different from sentencelevel few-shot tasks, in NER, a sample refers to one entity span in a sentence.One sampled sentence might include multiple entity instances.
In our experiments, we conduct an exact sampling strategy to ensure that we sample exactly K samples for each class.The details of the algorithm can be found at Appendix A.2.

Datasets and Implementation Details
We evaluate the proposed method with three benchmark NER datasets from different domains: the CoNLL2003 dataset (Sang and De Meulder, 2003) from the newswire domain, Ontonotes 5.0 dataset (Weischedel et al., 2013) from general domain and the MIT-Movie dataset (Liu et al., 2013) 1 from the review domain.As we focus on named entities, we omit the value/numerical/time/date entity types (e.g.,"Cardinal", "Money", etc) in OntoNotes 5.0.Details of the datasets are shown in Table 1.,10,20,50).We report mean (and deviation in brackets) performance over 3 different splits (4 repeated experiments for each split).
Labeling multi-span entities.For entities with multiple spans (including multiple words or subtokens after tokenization), we let the model predict the same label word at each position.This labeling method is the same with the "IO" labeling schema, which is consistent to our baseline implementation.
To ensure a few-shot scenario, we didn't use a development set for model choosing.Instead, we use the model of the last epoch for predicting.For lexicon-based annotation, we use the KBmatching method of Liang et al. (2020) 2 .For more implementation details (e.g., the learning rate, etc.), please refer to Appendix A.1 or our codes.

Baselines and Proposed Models
In our experiments, we compare our method with competitive baselines, involving both metriclearning based and prompt-based approaches.
BERT-tagger (Devlin et al., 2019) The BERTbased baseline which fine-tunes the BERT model with a label classifier.
NNShot and StructShot (Yang and Katiyar, 2020) Two metric-based few-shot learning approaches for NER.Different from Prototypical Network, they leverage a a nearest neighbor classifier for few-shot prediction.StructShot is an extension of NNShot which proposes a viterbi algorithm during decoding.We extend these two approaches to our few-shot setting.Noted that the viterbi algorithm in the original paper calculates the data distribution of a source domain, 2 https://github.com/cliang1453/BONDyet in our setting, the source domain is unavailable.Therefore, we also use the lexicon-annotated data for performing this method.
TemplateNER (Cui et al., 2021) A templatebased prompt method.By constructing a template for each class, it queries each span with each class separately.The score of each query is obtained by calculating the generalization probability of the query sentence through a generative pre-trained LM, BART (Lewis et al., 2020).
EntLM The proposed method.
EntLM+Struct Based on the proposed method, we further leverages the viterbi algorithm proposed in (Yang and Katiyar, 2020) to boost the performance.For more details please refer to (Yang and Katiyar, 2020) or our codes.
In Appendix A.5, we also compare with the roberta-base baselines from (Huang et al., 2020).

Few-shot Results
Table 2 show the results of the proposed method and baselines under few-shot setting.From the table, we can observe that: (1) On all the three datasets, for all few-shot settings, the proposed method performs consistently better than all the baseline methods, especially for 5-shot learning.Also, the performance of the proposed method is more stable (according to the deviation) than the compared baselines.(2) BERT-tagger method shows poor ability of few-shot learning, and the proposed method achieves up to 9.45%, 11.83%, 9.58% improvement over BERT-tagger on CoNLL03, OntoNotes 5.0 and MIT-Movie,  respectively.These results show the advantages of the proposed method over standard fine-tuning, which introduces no new parameters and uses an LM-like objective to reduce the gap between pre-training and fine-tuning.
(3) The proposed method consistently outperforms the templatebased prompt method, Template NER, which shows the advantage of the proposed method over standard template-based method.(4) When no richresource source domain is available, the metricbased methods (NNShot) do not show advantages over BERT-tagger, which shows the limitation of these method under more practical few-shot scenarios.
(5) Among all baselines, the StructShot is a competitive baseline that also leverages lexicon and unlabeled data for structure-based decoder, yet our method can also benefit from the viterbi decoder and outperform StructShot.

Efficiency Study
In this section, we perform an efficiency study on all the three datasets.We calculate the decoding time of each method on a TiTan XP GPU with batch size=8.(The source codes of Template NER do not allow us to change the batch size, so we keep the original batch size=45, which is the enumeration number of a 9-gram span. ) From Tab.4, we can observe that: 1) EntLM can achieve comparable speed with BERT-tagger, as only one pass of token classification is required for decoding each batch.
2) The decoding speed of TemplateNER is severely slow, while EntLM is up to 1930.12 times faster than TemplateNER.These results show the advantages of EntLM over template-based prompt tuning methods in NER task.

Label Word Selection
In Sec.3.3, we have presented different ways for label word selection.In this section, we conduct experiments on these methods and the results are reported in table 3. We can observe that: 1) The virtual word selection approach is always better than the discrete word selection.While among all virtual selection methods, choosing highfrequency words with the combination of data and LM distribution shows advantages over other methods.The reason of these results might be that simultaneously considering both data distribution gives not only the data prior in the target dataset, but also the contextualized information from the PLM, thus benefiting the performance.2) Searching only with LM distribution leads to poor results especially under 5-shot setting, showing that the general knowledge learned from pretrained might be less helpful than the data-specific knowledge under few-shot settings.

Impact of Lexicon Quality on Label Word Selection
Note that we leverage unlabeled data and lexicon annotation for label word selection.In this experiment, we study how the quality of lexicon impacts the performance on the OntoNotes* dataset.Specifically, we obtain different sizes of  lexicon (5% to 80% of the original lexicon size) by sampling entity words in the original lexicon with the weights of entity frequency.This sampling method follows the real-world situation since highfrequency entities are easier to obtain.Fig. 4 shows the results of EntLM and baseline methods against lexicon size.We can observe that: (1) EntLM with the Data&LM+Virtual selection method illustrates consistent high performance even with 5% lexicon.This means our method is not limited to the lexicon quality, and we only require a small lexicon to reach acceptable few-shot performance.
(2) Compared with Data&LM+Virtual method, the Data&LM is much more fragile regarding the lexicon quality.However, it still performs better than the compared baselines.
We further conduct experiments on different sizes of the unlabeled dataset by uniformly sampling 5%-80% of the original data.As shown in Fig. 5, the proposed method also shows high robustness to the amount of unlabeled data.

Effect of Further Pre-training
When predicting label words on task-specific data during fine-tuning, there is an intrinsic gap between the LM output distribution and the target data distribution.Therefore, it is natural to conduct a further pre-training approach on the target-domain unlabeled data to boost the LM predictions towards target distribution.In Table 5, we show the results of our method and BERT-tagger trained after further pre-training with MLM objective on domain-specific unlabeled data.As seen, the further pre-training practice can largely boost the few-shot learning ability of EntLM, while showing less helpful for classifier-based fine-tuning method.This might because the LM objective used in EntLM can benefit more from a task-specific LM output distribution, showing the superiority of EntLM in better leveraging the pre-trained models.
5 Related Works

Template-based prompt learning
Stem from the GPT models (Radford et al., 2019;Brown et al., 2020), prompt-based learning have been widely discussed.These methods reformulate downstream tasks as cloze tasks with textual templates and a set of label words, and the design of templates is proved to be significant for promptbased learning.Schick and Schütze (2021a,b) uses manually defined templates for prompting text classification tasks.Jiang et al. (2020) proposes a mining approach for automatically search for templates.Shin et al. (2020) searches for optimal discrete templates by a gradient-based approach.(Gao et al., 2021) generates templates with the T5 pre-trained model.Meanwhile, several approaches have explore continuous prompts for both text classification and generation tasks Li and Liang (2021); Liu et al. (2021); Han et al. (2021).Also, several approaches are proposed to enhance the templates with illustrative cases (Madotto et al., 2020;Gao et al., 2021;Brown et al., 2020) or context (Petroni et al., 2020).Although templatebased methods are proved to be useful in sentencelevel tasks, for NER task (Cui et al., 2021), such template-based method can be expensive for decoding.Therefore, in this work, we propose a new paradigm of prompt-tuning for NER without templates.

Few-shot NER
Recently, many studies focuses on few-shot NER (Hofer et al., 2018;Fritzler et al., 2019;Li et al., 2020;Ding et al., 2021;Chen et al., 2021).Among these, Fritzler et al. (2019) leverages prototypical networks for few-shot NER.Yang and Katiyar (2020) propose to calculate the nearest neighbor of each queried sample instead of the nearest prototype.Huang et al. ( 2021) experimented comprehensive baselines on different datasets.Tong et al. (2021) proposes to mine the undefined classes for few-shot learning.Cui et al. (2021) leverages prompts for few-shot NER.However, most of these studies follow the manner of episode training or assume a rich-resource source domain.In this work, we follow the more practical few-shot setting of Gao et al. (2021), which assumes only few samples each class for training.We also adapt previous methods to this setting as competitive baselines.

Conclusion
In this work, we propose a template-free prompt tuning method, EntLM, for few-shot NER.Specifically, we reformulate the NER task as a Entity-oriented LM task, which induce the LM to predict label words at entity positions during fine-tuning.In this way, not only the complicated template-based methods can be discarded, but also the few-shot performance can be boosted since the EntLM objective reduces the gap between pretraining and fine-tuning.Experimental results show that the proposed method can achieve significant improvement on few-shot NER over BERT-tagger and template-based method.Also, the decoding speed of EntLM is up to 1930.12 times faster than the template-based method.experiments using the Data&LMSearch+Virtual method on CoNLL 5-shot dataset.We can see that the performance of the proposed method is robust to the choice of k, since it can consistently achieve good results when k >= 3.In our main experiments, we simply choose k = 6 for all datasets.

A.5 Comparison with Comprehensive few-shot NER benchmark
We also conduct experiments on the few-shot benchmark provided by (Huang et al., 2021), in order to compare with the competitive baselines in the paper.These methods are implemented with the "Roberta-base" pretrained model.Therefore, we also implement our method based on "Robertabase" for fair comparison.Since the sampled data of OntoNotes is not available, we only experimented on the CoNLL'03 and MIT-Movie datasets.The results are shown in Table 6.
The results show that, our method outperforms over all baselines.Notice that the NSP method leverages the 6.8GB WiFiNE dataset for pretraining, and that the ST method performs selftraining on the unlabeled data.However, our method still shows better results, which illustrates the effectiveness of the proposed objective over standard fine-tuning.Also, the proposed method can be further boosted with NSP and ST.We leave this for future works.

A.6 Case Study
In Table 7, we show the label words selected with the Data&LM+Virtual method as examples.
Algorithm 1 Few-shot Sampling

Figure 2 :
Figure 2: Comparison of different fine-tuning methods for NER.(a) is the standard fine-tuning method, which replace the LM head with a classifier head and perform label classification.(c) is the template-based prompt learning method, which induces the LM to predict label words by constructing a template.(b) is the proposed Entity-oriented LM fine-tuning method, which also re-uses the LM head and leads the LM to predict label words through an Entity-oriented LM objective.(For entities with multiple spans, the model predicts the same label word at each position, which is similar to the "IO" labeling scheme.)

Figure 3 :
Figure3: Searching for two types of label words: the discrete label words and the continuous vectors as virtual label words.To search for the discrete label words, we select the high-frequency words in data or LM output distribution, or combine these two ways.To search for virtual label words, we calculate the mean vectors of the high-frequency words of each class as prototypes.
6) where φ topk (x i = w, y * = C) = I topk (x i = w|X, Y * ) • I(y * i = C) denotes the frequency of w occurring in the top k predictions of the positions labeled as class C.

Figure 4 :
Figure 4: Impact of different lexicon sizes.

Figure 7 :
Figure 7: Effect of the choice of top k number for virtual method.

Table 2 :
Main results of EntLM on three datasets under different few-shot settings (K=5

Table 3 :
Comparison of our label word selection methods.We report mean (and standard deviation) performance.

Table 4 :
The decoding time (s) of different methods.

Table 5 :
Impact of further pre-training.

Table 6 :
Comparison with the methods presented in(Huang et al., 2021).LC is linear classifier fine-tuning method.P is prototype-based training using a nearest neighbor objective.NSP is noising supervised pretraining and ST is self-training.Notice that our method shows better results even without NSP and ST, and can also be further boosted by these two methods.