Generalizing Few-Shot Named Entity Recognizers to Unseen Domains with Type-Related Features

Few-shot named entity recognition (NER) has shown remarkable progress in identifying entities in low-resource domains. However, few-shot NER methods still struggle with out-of-domain (OOD) examples due to their reliance on manual labeling for the target domain. To address this limitation, recent studies enable generalization to an unseen target domain with only a few labeled examples using data augmentation techniques. Two important challenges remain: First, augmentation is limited to the training data, resulting in minimal overlap between the generated data and OOD examples. Second, knowledge transfer is implicit and insufficient, severely hindering model generalizability and the integration of knowledge from the source domain. In this paper, we propose a framework, prompt learning with type-related features (PLTR), to address these challenges. To identify useful knowledge in the source domain and enhance knowledge transfer, PLTR automatically extracts entity type-related features (TRFs) based on mutual information criteria. To bridge the gap between training and OOD data, PLTR generates a unique prompt for each unseen example by selecting relevant TRFs. We show that PLTR achieves significant performance improvements on in-domain and cross-domain datasets. The use of PLTR facilitates model adaptation and increases representation similarities between the source and unseen domains.


Introduction
Named entity recognition (NER) aims to detect named entities in natural languages, such as locations, organizations, and persons, in input text (Zhang et al., 2022;Sang and Meulder, 2003;Yang et al., 2017).This task has gained significant attention from both academia and industry due to its wide range of uses, such as question answering and document parsing, serving as a crucial component in natural language understanding (Nadeau and Sekine, 2007;Ma and Hovy, 2016;Cui and Zhang, 2019;Yamada et al., 2020).The availability of labeled data for NER is limited to specific domains, leading to challenges for generalizing models to new domains (Lee et al., 2022;Cui et al., 2021;Ma et al., 2022).
To overcome this issue, recent research focuses on enabling models to effectively learn from a few labeled examples in new target domains (Lee et al., 2022;Ma et al., 2022;Das et al., 2022;Chen et al., 2022a;Wang et al., 2022Wang et al., , 2023) ) or on exploring data augmentation techniques, leveraging automatically generated labeled examples to enrich the training data (Zeng et al., 2020).However, these methods still require manual labeling for target domains, limiting their applicability in zero-shot scenarios with diverse domains.
Recently, Yang et al. (2022) have explored a new task, few-shot cross-domain NER, aiming to generalize an entity recognizer to unseen target domains using a small number of labeled in-domain examples.To accomplish this task, a data augmentation technique, named FactMix, has been devised.FactMix generates semi-fact examples by replacing the original entity or non-entity words in training instances, capturing the dependencies between entities and their surrounding context.Despite its success, FactMix faces two challenges: Augmentation is limited to the training data.Since the target domain is not accessible during training, FactMix exclusively augments the training data from the source domain.As a result, there is minimal overlap between the generated examples and the test instances at both the entity and context levels.For instance, only 11.11% of the entity words appear simultaneously in both the generated data (by FactMix) and the AI dataset (target domain).At the context level, as demonstrated (b) Type-related features in the source domain.
Figure 1: (a) Average SBERT similarities (Reimers and Gurevych, 2019) between pairs of sentences that contain the same type of entities.The source domain dataset is CoNLL2003 (Sang andMeulder, 2003); the target domain datasets include AI, Music, and Science (Liu et al., 2021).In the "Cross-domain" setting, one sentence is from the source domain and the other is from the target domain.In the "FactMix" setting, one sentence is from augmented data by FactMix (Yang et al., 2022), and the other is from the target domain.In the "In-domain" setting, both sentences are from the target domain.(b) Examples of type-related features in the source domain.
in Fig. 1(a), the average sentence similarity between the augmented instances and the test examples is remarkably low.These gaps pose severe challenges in extrapolating the model to OOD data.
To address this problem, we incorporate natural language prompts to guide the model during both training and inference processes, mitigating the gap between the source and unseen domains.Knowledge transfer is implicit and insufficient.
Intuitively, better generalization to unseen domains can be accomplished by incorporating knowledge from the source domain (Ben-David et al., 2022).However, in FactMix, the transfer of knowledge from the source domain occurs implicitly at the representation level of pre-trained language models.FactMix is unable to explicitly identify the typerelated features (TRFs), i.e., tokens strongly associated with entity types, which play a crucial role in generalization.E.g., as illustrated in Fig. 1(b), the words "established" and "along with" exhibit a close relationship with organization and person entities, respectively, in both domains.This knowledge can greatly assist in recognizing organizations and persons in the target domain.
To tackle this limitation, we introduce mutual information criteria to extract informative TRFs from the source domain.Furthermore, we construct a unique prompt for each unseen instance by selecting relevant TRFs.Intuitively, these generated prompts serve as distinctive signatures, linking unfamiliar examples to the knowledge within the source domain.Contributions.In this paper, we present a framework, named prompt learning with type-related features (PLTR) for few-shot cross-domain NER, to effectively leverage knowledge from the source domain and bridge the gap between training and unseen data.As Fig. 2 shows, PLTR is composed of two main phases: (i) type-related feature extraction, and (ii) prompt generation and incorporation.To identify valuable knowledge in the source domain, PLTR uses mutual information criteria to extract entity type-related features (TRFs).PLTR implements a two-stage framework to mitigate the gap between training and OOD data.Firstly, given a new example, PLTR constructs a unique sequence by selecting relevant TRFs from the source domain.Then, the constructed sequences serve as prompts for performing entity recognition on the unseen data.Finally, a multi-task training strategy is employed to enable parameter sharing between the prompt generation and entity recognition.Similar to FactMix, PLTR is a fully automatic method that does not rely on external data or human interventions.PLTR is able to seamlessly integrate with different few-shot NER methods, including standard fine-tuning and prompt-tuning approaches.
In summary, our contributions are: (i) to the best of our knowledge, ours is the first work to study prompt learning for few-shot cross-domain NER; (ii) we develop a mutual information-based approach to identify important entity type-related features from the source domain; (iii) we design a two-stage scheme that generates and incorporates a prompt that is highly relevant to the source domain for each new example, effectively mitigating the gap between source and unseen domains; and (iv) experimental results show that our proposed PLTR achieves state-of-the-art performance on both in-domain and cross-domain datasets.
Consequently, due to their high running costs and underwhelming performance, we do not consider recent LLMs as the basic model of our proposed framework (refer to Sec. 3.2).As mentioned in Sec. 1, previous few-shot NER methods primarily focus on in-domain settings and require manual annotations for the target domain, which poses a challenge for generalizing to OOD examples.The field of few-shot cross-domain learning is inspired by the rapid learning capability of humans to recognize object categories with limited examples, known as rationale-based learning (Brown et al., 2020;Shen et al., 2021;Chen et al., 2022a;Baxter, 2000;Zhang et al., 2020).In the context of NER, Yang et al. (2022) introduce the fewshot cross-domain setting and propose a two-step rationale-centric data augmentation method, named FactMix, to enhance the model's generalization ability.
In this paper, we focus on few-shot cross-domain NER.The most closely related work is Fact-Mix (Yang et al., 2022).FactMix faces two challenging problems: (i) augmentation is limited to the training data, and (ii) the transfer of knowledge from the source domain is implicit and insufficient.
In our proposed PLTR, to identify useful knowledge in the source domain, mutual information criteria are designed for automatic type-related feature (TRF) extraction.In addition, PLTR generates a unique prompt for each unseen example based on relevant TRFs, aiming to reduce the gap between the source and unseen domains.

Task settings
A NER system takes a sentence x = x 1 , . . ., x n as input, where x is a sequence of n words.It produces a sequence of NER labels y = y 1 , . . ., y n , where each y i belongs to the label set Y selected from predefined tags {B t , I t , S t , E t , O}.The labels B, I, E, and S indicate the beginning, middle, ending, and single-word entities, respectively.The entity type is denoted by t ∈ T = {PER, LOC, ORG, MISC, . ..}, while O denotes non-entity tokens.The source dataset and out-ofdomain dataset are represented by D in and D ood , respectively.Following Yang et al. (2022), we consider two settings in our task, the in-domain setting and the out-of-domain (OOD) setting.Specifically, we first train a model M in using a small set of labeled instances from D in .Then, for in-domain and OOD settings, we evaluate the performance of M in on D in and D ood , respectively.

Basic models
Since our proposed PLTR is designed to be modelagnostic, we choose two popular NER methods, namely standard fine-tuning and prompt-tuning respectively, as our basic models.As mentioned in Sec. 2, due to their high costs and inferior performance on the NER task, we do not consider recent large language models (e.g., GPT series) as our basic models.Standard fine-tuning method.We employ pre-trained language models (PLMs) such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) to generate contextualized word embeddings.These embeddings are then input into a linear classifier with a softmax function to predict the probability distribution of entity types.The process involves feeding the input token x into the feature encoder PLM to obtain the corresponding contextualized word embeddings h: where h represents the sequence of contextualized word embeddings derived from the pre-trained language models.To recognize entities, we optimize the cross-entropy loss L NER as: where N denotes the number of classes, y is a binary indicator (0 or 1) indicating whether the gold label c is the correct prediction for observation o, and p is the predicted probability of c for o.
Prompt-tuning method.The prompt-tuning method for NER tasks involves the use of maskand-infill techniques based on human-defined templates to generate label words.We adopt the recent EntLM model proposed by Ma et al. (2022) as our benchmark for this method.First, a label word set V l is constructed through label word engineering, which is connected to the label set using a mapping function M : Y → V l .Next, entity tokens at entity positions are replaced with the corresponding label word M(y i ).The resulting modified input is then denoted as The language model is trained by maximizing the probability P x Ent | x .The loss function for generating the prompt and performing NER is formulated as: where N represents the number of classes.The initial parameters of the predictive model are obtained from PLMs.

Method
In this section, we present the two primary phases of the proposed PLTR method, as depicted in Fig. 2: (i) type-related feature extraction (see Sec. 4.1), and (ii) prompt generation and incorporation (see Sec. 4.2).

Type-related feature extraction
As mentioned in Sec. 1, type-related features (TRFs), which are tokens strongly associated with entity types, play a crucial role in the few-shot cross-domain NER task.To extract these features, we propose a mutual information based method for identifying TRFs from the source domain.Here, we define S i as a set that contains all sentences from the source domain where entities of the ith type appear, and S\S i as a set that contains sentences without entities of the i-th type.In our method, we consider a binary variable that indicates examples (texts) from S i as 1, and examples from S\S i as 0. To find tokens closely related to S i , we first calculate the mutual information between all tokens and this binary variable, and then select the top l tokens with the highest mutual information scores.However, the mutual information criteria may favor tokens that are highly associated with S\S i rather than with S i .Thus, we introduce a filtering condition as follows: where represents the count of this m-gram w m in all source domains except for S i , and ρ is an m-gram frequency ratio hyperparameter.By applying this criterion, we ensure that w m is considered part of the TRF set of S i only if its frequency in S i is significantly higher than its frequency in other entity types (S\S i ).Since the number of examples in S i is much smaller than the number of examples in S\S i , we choose ρ ≥ 1 but avoid setting it to a large value.This allows for the inclusion of features that are associated with S i while also being related to other entity types in the TRF set of S i .In our experiments, we set ρ = 3 and only consider 1-gram texts for simplicity.Note that the type-related feature extraction module we designed is highly efficient with a computational complexity of O(|D in | • l avg • |T |), where |D in |, l avg , and T represent the number of sentences in the training dataset, the average sentence length, and the entity type set, respectively.This module is able to compute the mutual information criteria in Eq. 4 for all entity types in T and each token by traversing the tokens in every training sentence just once.

Prompt generation and incorporation
To connect unseen examples with the knowledge within the source domain, we generate and incorporate a unique prompt for each input instance.This process involves a two-stage mechanism: first, relevant TRFs are selected to form prompts, and then these prompts are input into the PLM-based basic model for entity label inference.Automatic type-related feature selection.Given an input sentence x and the extracted TRF set R, (5) By inputting f (x) into M b , we compute the hidden vector h [MASK] of [MASK].Given a token r ∈ R, we compute the probability that token r can fill the masked position: where r is the embedding of the token r in the PLM M b .For each [MASK], we select the token with the highest probability as the relevant TRF for x, while discarding any repeating TRFs.For example, as illustrated in Fig. 2, for the sentence "Bolton's spokesperson told CBS News.", the most relevant TRFs include "Spokesmen", "News" and "Corp".
To train M b for TRF selection, we define the loss function L gen as follows: where ϕ(x, i) denotes the label for the i-th [MASK] token in x.To obtain ϕ(x), we compute the Euclidean distance between the PLM-based embeddings of each r ∈ R and each token in x, selecting the top-K features.Note that our designed automatic selection process effectively filters out irrelevant TRFs for the given input sentence, substantially reducing human interventions in TRF extraction (refer to Sec. 7).Prompt incorporation.To incorporate the entity type information into prompts, we generate a unique prompt given the selected relevant TRFs R ′ (x) ⊆ R for input x.This is achieved using the following prompt template function f ′ (x): where t i ∈ T is the entity type name (e.g., PER or ORG).Given sentence x, R ′ (x, t i ) ⊆ R ′ (x) represents selected TRFs related to entity type t i .Note that, if R ′ (x, t i ) = ∅, the entity type name, and relevant TRFs R ′ (x, t i ) are excluded from f ′ (x).For example, as depicted in Fig. 2, the unique prompt f ′ (x) corresponding to x = "Bolton's spokesperson told CBS News." can be represented as follows: f ′ (x) = "Bolton's spokesperson told CBS News.
[SEP]PER:Spokesmen[SEP]ORG:News, Corp". (9) Then, we input f ′ (x) into M b to recognize entities in the given sentence x.

Joint training
To enable parameter sharing between prompt generation and incorporation, we train our model using a multi-task framework.The overall loss function is defined as follows: where L ′ NER denotes the normalized loss function for the NER task loss L NER (refer to Sec. 3.2).α is the weight assigned to L ′ NER with prompts as inputs.The weight 1 − α is assigned to the loss function L gen for type-related feature selection.In our experiments, we optimize the overall loss function using AdamW (Loshchilov and Hutter, 2019).Sec.A.1 gives the detailed training algorithm of PLTR.

Experiments
We aim to answer the following research questions: (RQ1) Does PLTR outperform state-of-the-art finetuning methods on the few-shot cross-domain NER task?(Sec.6.1) (RQ2) Can PLTR be applied to prompt-tuning NER methods?(Sec.6.2) Micro F1 is adopted as the evaluation metric for all settings.

Datasets
Detailed statistics of both in-domain and out-ofdomain datasets are shown in Table 1.In-domain dataset.We conduct in-domain experiments on the CoNLL2003 dataset (Sang and Meulder, 2003).It consists of text in a style similar to Reuters News and encompasses entity types such as person, location, and organization.Additionally, to examine whether PLTR is extensible to different source domains and entity types, we evaluate PLTR using training data from OntoNotes (Weischedel et al., 2013) (refer to Sec.A.3). OntoNotes is an English dataset consisting of text from a wide range of domains and 18 types of named entities, such as Person, Event, and Date.Out-of-domain datasets.We utilize the OOD dataset collected by Liu et al. (2021), which includes new domains such as AI, Literature, Music, Politics, and Science.The vocabulary overlaps between these domains are generally small, indicating the diversity of the out-of-domain datasets (Liu et al., 2021).Since the model trained on the source domain dataset (CoNLL2003) can only predict person, location, organization, and miscellaneous entities, we assign the label O to all unseen labels in the OOD datasets.

Experimental settings and baselines
We compare PLTR with recent baselines in the following two experimental settings: Fine-tuning.Following Yang et al. (2022), we employ the standard fine-tuning method (Ori) based on two pre-trained models with different parameter sizes: BERT-base, BERT-large, RoBERT-base, and RoBERT-large.All backbone models are implemented using the transformer package provided by Huggingface.2For fine-tuning the NER models in a few-shot setting, we randomly select 100 instances per label from the original dataset (CoNLL2003) to ensure model convergence.The reported performance of the models is an average across five training runs.Prompt-tuning.Similar to Yang et al. (2022), we adopt the EntLM model proposed by Ma et al. (2022) as the benchmark for prompt-tuning.The EntLM model is built on the BERT-base or BERTlarge architectures.We conduct prompt-based experiments using a 5-shot training strategy (Ma et al., 2022).Additionally, we select two representative datasets, TechNews and Science, for the OOD test based on the highest and lowest word overlap with the original training domain, respectively.
Additionally, we include a recent data augmentation method CF (Zeng et al., 2020)  of-the-art cross-domain few-shot NER framework FactMix (Yang et al., 2022) as baselines in both of the above settings.Note that, we report the results of FactMix's highest-performing variant for all settings and datasets.

Implementation details
Following Yang et al. (2022), we train all models for 10 epochs and employ an early stopping criterion based on the performance on the development dataset.The AdamW optimizer (Loshchilov and Hutter, 2019) is used to optimize the loss functions.
We use a batch size of 4, a warmup ratio of 0.1, and a learning rate of 2e-5.The maximum input and output lengths of all models are set to 256.

Experimental results
To answer RQ1 and RQ2, we assess the performance of PLTR on both in-domain and crossdomain few-shot NER tasks.This evaluation is conducted in two settings: a fine-tuning setting with 100 training instances per type, and a prompttuning setting with 5 training instances per type.

Results on few-shot fine-tuning (RQ1)
Table 2 and 3 show the in-domain and cross-domain performance in the fine-tuning setting, respectively.Based on the results, we have the following observations: (i) PLTR achieves the highest Micro F1 scores for all datasets and settings, indicating its superior performance.For instance, when using RoBERTa-large as the backbone, PLTR achieves an 88.03% and 75.14%F1 score on the CoNLL2003 and TechNews datasets, respectively.(ii) PLTR significantly outperforms the previous state-of-the-art baselines in both in-domain and cross-domain NER.For example, PLTR exhibits a 1.46% and 10.64% improvement over FactMix, on average, on in-domain and cross-domain datasets, respectively.(iii) Few-shot cross-domain NER is notably more challenging than the in-domain setting, as all methods obtain considerably lower F1 scores.The performance decay in TechNews is smaller than in other domains, due to its higher overlap with the training set.In summary, PLTR demonstrates its effectiveness in recognizing named entities from both in-domain and OOD examples.The use of type-related features (TRFs), along with the incorporation of prompts based on TRFs, are beneficial for in-domain and cross-domain few-shot NER.

Results on few-shot prompt-tuning (RQ2)
To explore the generalizability of PLTR, we report in-domain and OOD results for the prompt-tuning setting in Table 4 and 5, respectively.We obtain the following insights: (i) Due to data sparsity, the overall performance for the prompt-tuning setting is considerably lower than the results of 100-shot finetuning.(ii) Even with only 5-shot training instances per entity type, PLTR achieves the highest performance and outperforms the state-of-the-art baselines by a significant margin, demonstrating the effectiveness and generalizability of PLTR.For example, in the in-domain and cross-domain datasets, PLTR achieves an average improvement of 11.58% and 18.24% over FactMix, respectively.In summary, the PLTR framework not only effectively generalizes fine-tuning-based NER methods to unseen domains, but also attains the highest F1 scores in the prompt-tuning setting.

Analysis
Now that we have answered our research questions, we take a closer look at PLTR to analyze its performance.We examine whether the prompts are designed appropriately.Besides, we study how the number of training samples and selected type-related features influence the performance (Sec.A.2), how PLTR affects representation similarities between the source and target domains, and whether PLTR is extensible to different source domains and entity types (Sec.A.3). Furthermore, we provide insights into the possible factors that limit further improvements.Ablation studies.To investigate the appropriateness of our prompt design, we conduct ablation studies on few-shot cross-domain NER in both finetuning and prompt-tuning settings.The results are presented in Table 6.In the "NP" variant, prompts are removed during test-time inference.In this case, the F1 scores across all datasets and settings suffer a significant drop compared to our proposed PLTR.This demonstrates the crucial role of incorporating prompts during both the training and inference processes.In the "RDW" and "REW" variants, prompts are constructed using randomly selected words from the source domain and the given example, respectively.The performance of both the "RDW" and "REW" model variants consistently falls short of PLTR, indicating that PLTR effectively identifies important knowledge from the source domain and establishes connections between unseen examples and the knowledge within the source domain.
Additionally, to explore the efficacy of typerelated feature selection (refer to Sec. 4.2), we conducted an evaluation of PLTR (BERT-base) using various frequency ratios ρ (in Eq. 4).The results are presented in Table 7.As the value of ρ increases, TRFs extracted using Eq.4 become less closely associated with the specified entity type but become more prevalent in other types.When the value of ρ is raised from 3 to 9, we observed only a slight decrease in the F1 scores of PLTR.When the value of ρ is raised to 20, the F1 score of PLTR drops, but still surpasses the state-of-the-art   Analysis of sentence similarities.In our analysis of sentence similarities, we investigate the impact of PLTR on the representation similarities between the source and target domains.We compute the    average SBERT similarities for sentence represenin PLTR (BERT-base) between the source and target domains; the results are presented in Fig. 4. With the prompts generated by PLTR, the representation similarities between the source and unseen domains noticeably increase.This is, PLTR facilitates a more aligned and connected representation space, mitigating the gap between the source and target domains.Error analysis.Although our proposed PLTR outperforms state-of-the-art baselines, we would like to analyze the factors restricting further improvements.Specifically, we compare the performance of PLTR (BERT-base) on sentences of different lengths in the test sets of the CoNLL2003 (Indomain), AI, and Science datasets.The results of the standard fine-tuning setting are provided in Table 8.We observe that the F1 scores of PLTR on sentences with more than 35 words ("> 35") are substantially higher than the overall F1 scores.In contrast, the F1 scores on sentences with 25 to 35 words ("25-35") or less than 25 words ("< 25") consistently fall below the overall F1 scores.This suggests that it may be more challenging for PLTR to select TRFs and generate appropriate prompts with less context.

Conclusions
In this paper, we establish a new state-of-the-art framework, PLTR, for few-shot cross-domain NER.
To capture useful knowledge from the source domain, PLTR employs mutual information criteria to extract type-related features.PLTR automatically selects pertinent features and generates a unique prompt for each unseen example, bridging the gap between domains.Experimental results show that PLTR not only effectively generalizes standard fine-tuning methods to unseen domains, but also demonstrates promising performance when incorporated with prompt-tuning-based approaches.Additionally, PLTR substantially narrows the disparity between in-domain examples and OOD instances, enhancing the similarities of their sentence representations.

Limitations
While PLTR achieves a new state-of-the-art performance, it has several limitations.First, the number of type-related features for prompt construction needs to be manually preset.Second, PLTR relies on identifying TRFs, which are tokens strongly associated with entity types.Extracting and incorporating more complex features, such as phrases, represents a promising direction for future research.
In the future, we also plan to incorporate PLTR with different kinds of pre-trained language models, such as autoregressive language models.Update Θ by optimizing L (Eq. 10); 14: end for 15: end while f ′ (x) for each input x, and these prompts are then fed into M b for entity recognition (lines 8-9).Finally, we iteratively refine the parameters Θ by jointly optimizing two loss functions: the NER task loss function L ′ NER and the TRF selection loss function L gen (line 11).Note that, during inference, PLTR generates a unique prompt for each sentence within the unseen target domain using extracted TRFs R. In this way, knowledge from the source domain is explicitly integrated into both the training and inference phases.

A.2 Influence of the number of selected type-related features
We evaluate PLTR based on BERT-base in finetuning setting, with the number of selected relevant type-related features K varying from 10 to 60.The results are shown in Fig. 5. Our observations indicate that as the number of type-related features increases, the performance (F1 score) of PLTR initially improves because the model incorporated with more features is able to encode more useful knowledge from the source domain.But notice that the performance drops when the number of typerelated features is too large.In our experiments, we set the number of type-related features to 40 on all Average similarities between pairs of sentences.In 1972 , LaHaye helped establish the Ins�tute for Crea�on Research (ORG) at San Diego Chris�an College in El Cajon , California, along with Henry M. Morris (PER)fledgling company , established in a ground-floor office here over the last two weeks , has received venture financing from Bessemer Venture Partners of Menlo Park.PER: Along with Mayfiar at 277 for the tournament were Steve Stricker , who had a 68 , and Duffy Waldorf , with a 66.
we formulate the selection of relevant TRFs as a cloze-style task for our PLM-based basic model M b (refer to Sec. 3.2).Specifically, we define the following prompt template function f (•) with K [MASK] tokens: f (x) = "x[SEP]type-related features:[MASK]...[MASK]".

Figure 3 :
Figure 3: Influence of training instances on TechNews and Science (BERT-base).

A. 1
Training algorithm of PLTRAlgorithm 1 gives the detailed training algorithm of PLTR.To start, we establish a basic model M b based on Pre-trained Language Models (PLM) and initialize its parameters Θ (lines 1-2).To capture knowledge from the source domain, PLTR identifies type-related features using mutual information criteria (line 3).Next, given an input sentence x ∈ D in , PLTR automatically selects relevant TRFs R ′ (x) ⊆ R by formulating the selection process as a cloze-style task for M b (line 7).Furthermore, to incorporate entity type information into prompts, PLTR constructs a unique prompt Algorithm 1 Training Algorithm for PLTR.Require: The source dataset D in ; the basic model M b with parameters Θ; the frequency ratio ρ; the number of selected type-related features K; the loss weight α; the number of epochs epoch.Ensure: The extracted type-related features R and the trained basic model M ′ b ; 1: Establish the basic model M b ; 2: Initialize model parameters Θ; 3: Extract type-related features R for all entity types from the source dataset D in (Eq.4); 4: while i ≤ epoch do 5: for Sample a batch X ⊆ D in do 6: for all sentences x ∈ X do 7: Select relevant TRFs R ′ (x) for in-8: put x (Eq. 5 and 6); 9: Transform x into the prompt tem-10: plate f ′ (x) (Eq.8); 11: Input f ′ (x) into M b for prediction;

Table 1 :
Statistics of the datasets used.
and the state-

Table 6 :
Ablation studies on TechNews and Science.

Table 8 :
Error analysis on sentence lengths in test sets (BERT-base, fine-tuning).