Automatic Creation of Named Entity Recognition Datasets by Querying Phrase Representations

Most weakly supervised named entity recognition (NER) models rely on domain-specific dictionaries provided by experts. This approach is infeasible in many domains where dictionaries do not exist. While a phrase retrieval model was used to construct pseudo-dictionaries with entities retrieved from Wikipedia automatically in a recent study, these dictionaries often have limited coverage because the retriever is likely to retrieve popular entities rather than rare ones. In this study, we present a novel framework, HighGEN, that generates NER datasets with high-coverage pseudo-dictionaries. Specifically, we create entity-rich dictionaries with a novel search method, called phrase embedding search, which encourages the retriever to search a space densely populated with various entities. In addition, we use a new verification process based on the embedding distance between candidate entity mentions and entity types to reduce the false-positive noise in weak labels generated by high-coverage dictionaries. We demonstrate that HighGEN outperforms the previous best model by an average F1 score of 4.7 across five NER benchmark datasets.


Introduction
Named entity recognition (NER) models often require a vast number of manual annotations for training, which limits their utility in practice. In several studies, external resources such as domain-specific dictionaries have been employed as weak supervision to reduce annotation costs (Shang et al., 2018;Liang et al., 2020;Meng et al., 2021). However, such dictionaries exist only for certain domains and building a dictionary for a new domain requires a high level of expertise and effort.
To address this problem, a recent study proposed a framework called GeNER, which generates NER datasets without hand-crafted dictionaries (Kim et al., 2022). In GeNER, user questions that reflect the needs for NER are received as inputs i t y B a n d Figure 1: The most frequent ten entities in the top 1,000 phrases retrieved from the 2018-12-20 version of Wikipedia for the three questions: "Which politician?", "Which city?", and "Which band?". The x-and y-axis of the graph represent the rank and frequency of the entities, respectively. Due to a bias in the entity popularity (Chen et al., 2021), a current phrase retrieval model, DensePhrases (Lee et al., 2021), primarily returns popular entities, limiting the coverage of dictionaries.
(e.g., "Which city?"), and an open-domain questionanswering (QA) system, DensePhrases (Lee et al., 2021), is used to retrieve relevant phrases (i.e., answers) and evidence sentences from Wikipedia. The retrieved phrases constitute a 'pseudo' dictionary, which serves as weak supervision in place of hand-crafted dictionaries. The evidence sentences are annotated based on string matching with the pseudo dictionary, resulting in the final dataset. This approach allows NER models to adapt to new domains for which training data are scarce and domain-specific dictionaries are unavailable. However, because the entity popularity of Wikipedia is biased (Chen et al., 2021;Leszczynski et al., 2022), existing open-domain QA models tend to retrieve popular entities rather than rare ones. This limits the coverage of dictionaries generated by GeNER. Figure 1 shows examples of a bias in the entity population retrieved from the open-domain QA model. "David Cameron," "Beijing," and "The Beatles" frequently appear in the top 1,000 retrieved phrases for each type of question. Low-coverage dictionaries created from these biased results can cause incomplete annotations (i.e., false-negative entities), which impedes the training of NER models. Unfortunately, increasing the number of retrieved phrases (i.e., larger top-k) is not an appropriate solution because it is computationally inefficient and causes a high falsepositive rate in the dictionary. Therefore, a new search method that can efficiently retrieve diverse entities with a reasonable top-k and a new NER dataset generation framework based on this search method are needed.
In this study, we present HighGEN, an advanced framework for generating NER datasets with automatically constructed 'high-coverage' dictionaries. Specifically, we first obtain phrases and sentences and constitute an initial dictionary in a similar manner to GeNER. Subsequently, we expand the initial dictionary using a phrase embedding search, in which the embeddings of the retrieved phrases are averaged to re-formulate query vectors. These new queries specify contexts in which different entities of the same type appear, allowing our retriever to search over a vector space in which various entities are densely populated. 1 The expanded dictionary is used to annotate the retrieved sentences. Because a larger dictionary can induce more false-positive annotations during rule-based string matching, we introduce a new verification process to ensure that weak labels annotated by the string matching are correct. The verification process is performed by comparing the distance between the embeddings of a candidate entity and the target entity type.
We trained recent NER models Lee et al., 2020;Liang et al., 2020;Meng et al., 2021) with the datasets generated by HighGEN and evaluated the models on five datasets. Our models outperformed the baseline models trained using the previous best model GeNER by an average F1 score of 4.7 (Section 4). In addition, we show an additional advantage of HighGEN over GeNER, which generates datasets using only a few hand-labeled examples without input user questions. HighGEN outperformed few-shot NER models on two datasets (Section 5). Finally, we perform an analysis of the factors affecting the retrieval diversity and NER performance (Section 6). We make the following contributions: 2 • We propose a HighGEN framework that generates NER datasets with entity-rich dictionaries that are automatically constructed from unlabeled Wikipedia corpus. • We present two novel methods in HighGEN: (i) phrase embedding search to overcome the limitations of the current open-domain phrase retriever and successfully increase the entity recall rate and (ii) distance-based verification to effectively reduce the false-positive noise in weak labels. • HighGEN outperformed the previous-best weakly-supervised model GeNER by an F1 score of 4.7 on five datasets. In few-shot NER, HighGEN created datasets using few-shot examples as queries and outperformed current few-shot NER models on two datasets.

Weakly Supervised NER
The aim of NER is to identify named entities in text and classify them into predefined entity types. Let is a list of N corresponding token-level label sequences. While supervised learning relies on the human-annotated labels, Y, to train models, in weakly supervised NER, the weak labelsŶ are generated using string matching between a domainspecific dictionary, V, built by experts and the unlabeled sentences, X (Yang et al., 2018;Shang et al., 2018;Peng et al., 2019;Cao et al., 2019;Yang and Katiyar, 2020;Liang et al., 2020;Meng et al., 2021). Hand-crafted labeling rules are utilized in another line of studies (Fries et al., 2017;Ratner et al., 2017;Safranchik et al., 2020;Zhao et al., 2021); however, these rules are difficult to apply to new entity types. Recently, Kim et al. (2022) proposed GeNER, in which weak labels are generated with a pseudo-dictionary,V, created using a phrase retrieval model. We follow their approach but present an advanced framework for addressing the low-coverage problem and obtaining more entity-rich dictionaries and NER datasets.

DensePhrases
DensePhrases (Lee et al., 2021) is a phrase retrieval model that finds relevant phrases for natural language inputs in a Wikipedia corpus. Unlike the retriever-reader approach, which first retrieves evidence passages from Wikipedia and then finds the  igure 2: Overview of HighGEN. The framework comprises three stages: (1) natural language search to obtain unlabeled sentencesX 1 and an initial dictionaryV 1 , (2) phrase embedding search to further obtainV 2 , and (3) dictionary matching and verification to annotate sentences based on the embedding distance between a candidate entity (e.g., "Rome") and entity type (e.g, "city") that is the average of phrase vectors in the dictionary.
answer (Chen et al., 2017), DensePhrases retrieves answers directly from dense phrase vectors of the entire English Wikipedia as follows: where s is a phrase, a sequence of words from evidence text x (i.e., sentence, passage, etc.); W is the set of all phrase-evidence pairs in Wikipedia. The input question q is converted into the query vector q by the question encoder, E q . Subsequently, relevant phrases are retrieved based on the similarity scores between the query vector q and phrase vector s, which is represented as the concatenation of the start and end vectors of the phrase, produced by the phrase encoder, E s . All phrase vectors are 'pre-indexed' before inference, which greatly improves run-time efficiency (Seo et al., 2019;Lee et al., 2021). In the context of weakly supervised NER, DensePhrases can be used as a database to obtain candidate entities for specific NER needs, along with sentences to construct the final NER corpus (Kim et al., 2022).

Method
HighGEN comprises three stages of natural language search, phrase embedding search, and dictionary matching and verification ( Figure 2). We highlight that the natural language search is similarly used in GeNER, but the last two stages are novel and first proposed in our study.

Natural Language Search
Query formulation. Let T = {t 1 , ..., t L } be a set of L target entity types. The concrete needs for these entity types are translated into simple questions. The questions follow the template of "Which [TYPE]?," where the [TYPE] token is substituted for each entity type of interest. For instance, the question is formulated as "Which city?" if the target entity type t is city.
Retrieval. Input questions are fed into the phrase retrieval model, DensePhrases, to retrieve the top k phrases s * and sentences x * (see Section 2.2). For L different questions, a total of k 1 + · · · + k L sentences are used as the unlabeled sentences, X 1 . The retrieved phrases are used as the pseudo-dictionary, V 1 , which comprises phrase s and corresponding type t pairs (e.g., Beijing-city).

Phrase Embedding Search
Query re-formulation. As mentioned in Section 1, the coverage of the initial dictionaryV 1 is often limited because of the entity popularity bias. Our solution to search for diverse entities is very simple. We re-formulate queries by averaging the phrase vectors as follows: where s n and x n are the n-th top phrase and corresponding sentence from the natural language search. We used only the top 100 phrases for each question (i.e., N = 100) because a larger number of phrases did not improve retrieval quality in our initial experiments.
Retrieval. For L new queries obtained by Equation (2), a total of k ′ 1 +· · ·+k ′ L phrases are additionally retrieved by Equation (1) and constitute a new dictionaryV 2 . Subsequently, we mergeV 1 andV 2 to obtain the final dictionaryV. Note that we do not use the retrieved sentencesX 2 because we found using onlyX 1 as the final unlabeled sentences (i.e., X) resulted in better NER performance. 3 Interpretation. Natural language search results in the retriever performing 'broad' searches for all the Wikipedia contexts relevant to the target entity class. In contrast, phrase embedding search, which averages phrase vectors of the same entity type, can be viewed as providing prompts that implicitly represent certain contextual patterns in which entities of the target class often appear. Having the retriever perform 'narrow' searches by focusing on specific contexts leads to a wide variety of entities with less bias towards popular ones. This is because (1) the same entities rarely appear repeatedly in a specific context, (2) whereas different entities of the same 3 A related analysis is included in Section 6.2. type frequently appear in a similar context as they are generally interchangeable.
Our qualitative analysis supports our claim above. We retrieved 5k sentences using two questions, "Which actor?" and "Which athlete?", and manually analyzed 100 sentences sampled from them. Table 1 shows that sentences by the phrase embedding search exhibit clear patterns in their contexts, whereas those by the natural language search do not. Specifically, 91 and 94 of the 100 sentences for the actor and athlete types had similar patterns, respectively. Further analysis shows that this property of the phrase embedding search contributes significantly to improving entity diversity (Section 6.1) and NER performance (Section 6.2).

Dictionary Matching & Verification
Dictionary matching. AfterX andV are obtained, dictionary matching is performed to generate weak labels,Ŷ. Specifically, if a string in the unlabeled sentence matches an entity name in the dictionary, the string is labeled with the corresponding entity type. However, this method cannot handle label ambiguity inherent in entities 4 because it relies only on lexical information without leveraging contextual information of phrases. The falsepositive noise due to label ambiguity is amplified as the dictionary size increases, making it difficult to effectively use our expanded dictionaryV.
Verification. Candidate annotations provided by dictionary matching are passed to the verification stage. Let e be a matched string in the sentence and T be the matched entity types (a subset of T ). The verification function L is defined as follows: where d is the Euclidean distance function; e is the phrase vector of the candidate string; t l is the l-th type vector; τ is the cut-off value. The string is labeled with the nearest type t * , or unlabeled if the distance is higher than the cut-off value. The type vector is calculated by averaging all the retrieved phrase vectors of the entity type, based on the assumption that the mean vector of phrases is a good representative of the entity class. In addition, the cut-off value is also calculated using phrase vectors. Specifically, the function d computes the distance scores between the type vector t l and all the phrase vectors of the type. The distribution of the distance scores is then standardized, and the score of 'z' times the standard deviation from the mean is used as the cut-off value (e.g., z = 3). 5

Experiments
In this experiment, it was assumed that humanannotated datasets did not exist; thus, our models were trained only using synthetic data {X,Ŷ} by HighGEN. To avoid excessive hyperparameter search, we used the same sets of input questions and the same number of sentences for each question (i.e., k 1 , . . . , k L ) as those used in the previous study (Kim et al., 2022). A new hyperparameter introduced in HighGEN, the number of phrases retrieved by phrase embedding search (i.e., k ′ 1 , . . . , k ′ L ), was set to 30k. Please refer to in Appendix A for the full list of hyperparameters and implementation details. For metrics, the entity-level precision, recall, and F1 scores were used (Tjong Kim Sang and De Meulder, 2003).

Datasets
We used five datasets from four domains. Following Kim et al. (2022), we did not use the MISC and other classes because they are vague 5 The distribution of the distance scores is generally balanced; thus, we used a usual method to compute the cut-off value without any other tricks to balance the distribution.
to represent with some user questions. (i) CoNLL-2003 (Tjong Kim Sang andDe Meulder, 2003) consists of Reuters news articles with three coarsegrained entity types of person, location, and organization. (ii) Wikigold (Balasuriya et al., 2009) is a small-size dataset that consists of Wikipedia documents with the same entity types as CoNLL-2003. (iii) WNUT-16 (Strauss et al., 2016) consists of nine entity types annotated in tweets, such as TV show, movie, and musician. (iv) Two biomedical domain datasets, NCBI-disease (Dogan et al., 2014) and BC5CDR (Li et al., 2016), are collections of PubMed abstracts with manually annotated diseases (NCBI-disease) or disease and chemical entities (BC5CDR). The benchmark statistics are listed in Table B.2 (Appendix).

NER Models
We trained three types of NER models on our synthetic data. We provide descriptions of the models below, but we cannot cover all the details; readers interested in details are therefore recommended to refer to Liang et al. (2020) and Meng et al. (2021). Note that we did not use validation sets to find the best model parameters during training to avoid excessive parameter tuning. The implementation details are provided in Appendix A.
Standard: This type of model consists of a pretrained language model for encoding input sequences and a linear layer for token-level prediction. We used RoBERTa ) as the language model for the news, Wikipedia, and Twitter domains and BioBERT (Lee et al., 2020) for the biomedical domain.
BOND (Liang et al., 2020): This model is based on self-training, which is a learning algorithm that corrects weak labels with the power of large-scale language models. Specifically, a teacher model (similar to the standard model above) is initially trained on the weakly-labeled corpus and used to reannotate the corpus based on its predictions. This re-annotation process allows the model to remove noisy labels and further identify missing entities. A student model with the same model structure as the teacher model is trained on the re-annotated corpus. The teacher model is updated by the student model's parameters in the next round and performs the re-annotation process again. This process is repeated until the maximum training step is reached.  RoSTER (Meng et al., 2021): In RoSTER, the generalized cross-entropy (GCE) loss is used to a standard model, which is designed to be more robust to noise than a normal cross-entropy loss. During the GCE training, weak labels are removed at every update step if the model assigns low confidence scores. Using the algorithm described above, five randomly initialized models are trained, and a new model is trained to approximate the average predictions of the five models. Finally, the new model is further trained with language model augmented self-training, which jointly approximates the teacher model's predictions for the given (1) original sequence and (2) augmented sequence with some tokens replaced by a language model.

In-domain Resources
Baseline models are classified into two categories based on the amount of in-domain resources required during training.
GeNER (Kim et al., 2022): GeNER is the only baseline model that uses the same amount of resources as HighGEN. GeNER retrieves phrases and unlabeled sentences using natural language search and performs string matching to create datasets.
Full dictionary: Full-dictionary models use large-scale dictionaries that comprises numerous entities hand-labeled by experts. For the CoNLL-2003, Wikigold, and WNUT-16 datasets, each dictionary was constructed using Wikidata and dozens of gazetteers compiled from multiple websites (Liang et al., 2020). For NCBI-disease and BC5CDR, the dictionary was constructed by com-bining the MeSH database and Comparative Toxicogenomics Database (more than 300k disease and chemical entities) (Shang et al., 2018). These dictionaries were used to generate weak labels based on string matches with in-domain corpus, which is an unlabeled version of the original training corpus. Table 2 shows that HighGEN outperformed GeNER on five datasets by average F1 scores of 4.2, 3.0, and 4.7 for the standard, BOND, and RoS-TER models, respectively. Performance improvements were particularly evident in recall. When the verification method was not applied (i.e., w/o L), the performance dropped by an average F1 score of 5.4 (mostly in precision). A high NER performance can be expected with full dictionaries, but they cannot be built without tremendous effort of experts. We emphasize that our method of automatically creating high-coverage pseudo-dictionaries and NER datasets is a promising way to achieve competitive performance with minimal effort.

Few-shot NER
We show an additional use case for HighGEN to create NER datasets using only a few hand-labeled examples, without using input questions. This can eliminate a tuning/engineering effort of users that might be required for designing appropriate questions to identify NER needs, which is a distinct advantage of HighGEN over GeNER. Specifically, HighGEN takes sentences with annotated phrases as input and retrievesX 2 andV 2 using the phrase   (1) and (2)), which are used as the unlabeled sentences and pseudo-dictionary to produce the final dataset. We tested two types of models. (1) The entitylevel model uses every annotated phrase as a separate query; thus, the number of queries equals the number of human annotations. On the other hand, (2) the class-level model first averages phrase vectors of the same entity types and uses them as queries; thus, the number of queries equals the number of entity types. The entity-level model would have an advantage in terms of entity recall and the class-level model can mitigate noise that each phrase vector may contain.
Setups. We sampled datasets from CoNLL-2003 and BC5CDR so that each dataset consists of five sentences per entity type, which results in 20 and 10 examples for CoNLL-2003 and BC5CDR, respectively. 6 All experimental results were averaged over five sampled datasets. We used the models of Huang et al. (2021) andJia et al. (2022) as baselines, and among them, QUIP (Jia et al., 2022) is the previous best model in few-shot NER (details on the models are presented in Appendix C). 7 For HighGEN, we retrieved the same number of sentences for each query, and the total number of sentences was 120k for CoNLL-2003 and and 10k for BC5CDR. We initially trained RoSTER on our synthetic data and then fine-tuned the model on few-shot examples. Table 3 shows that our entity-and classlevel models outperformed QUIP by an average F1 score of 2.1 and 3.0 on the two datasets, respec-6 Unlike the experiments in Section 4, the MISC type was included for a fair comparison with baseline models. 7 Other few-shot NER models were excluded because they used a sufficient amount of 'source' data (Yang and Katiyar, 2020;Cui et al., 2021), which is different from our setups.  tively. For CoNLL-2003, the entity-level model was better than the class-level model because entities of the same entity type often belong to different sub-categories. For instance, "Volkswagen" and "University of Cambridge" belong to the same organization type in CoNLL-2003 but their subcategories are "company" and "institution," respectively. Therefore, it is difficult to group them into a single vector and it is important to widely cover various entities using separate queries for each subcategory. On the other hand, entities in BC5CDR can be naturally grouped by disease or chemical type, which allows the class-level model to perform well. Additionally, biomedical entity names often contain domain-specific terms, numbers, special characters, and abbreviations that are difficult to encode with a general-purpose phrase encoder, making their vector representations relatively more error-prone. The class-model can produce good representations by averaging phrase vectors.

Retrieval Performance
We compared natural language search and phrase embedding search in terms of their accuracy and diversity. With reference to Kim et al. (2022), we used 11 fine-grained questions within the following four coarse-grained entity types of (i) person (athlete, politician, actor), (ii) location (country, city, state in the USA), (iii) organization (sports team, company, institution), and (iv) biomedicine (disease, drug). We report the average scores for each coarse-grained entity type.
Metrics. (i) The precision at 100 (P@100) represents the accuracy of the top 100 retrieved phrases. Because there are no gold annotations for the retrieved phrases, we manually determined whether the phrases correspond to the correct entity types.
(ii) Diversity at 10k (Div@10k) calculates the percentage of unique phrases out of the top 10k phrases based on their lowercase strings.
Results. The phrase embedding search largely outperformed the natural language search by a macro average of 28.1 diversity across the four types without loss of accuracy. The diversity scores for the location entity types did not improve significantly because there are only limited numbers of names for locations such as countries in the real world, but the diversity scores for the other types improved dramatically (+ 37.4 diversity). While both queries produced accurate top results (P@100), the accuracy tends to decrease as the topk increased, which makes it difficult to increase the dictionary size by retrieving more phrases. Thus, retrieving diverse entities with a reasonable top-k is not only important for computational efficiency but also helps the retriever to maintain accuracy. In this regard, phrase embedding search has a huge advantage over natural language search. We discuss this further in Section 6.2. In addition, examples of the top phrases retrieved by both search methods are listed in Table D.3 (Appendix).

Data Size
Effect of dictionary size. Figure 3a shows the NER performance of RoSTER models according to the size of the additional dictionary added to the initial dictionaryV 1 . We expanded the dictionary using the natural language search (red line in the graph) or phrase embedding search (blue). F1 scores were measured on the BC5CDR test set.
The performance of both models increased initially but decreased after the peaks, indicating that there was a trade-off between the size and accuracy of the dictionary. The optimal size of the additional dictionary by the phrase embedding search (i.e., 45k) was larger than that of the natural language search (i.e., 30k). As shown in Figure 3b, the natural language search required a much larger number of sentences (more than twice as much) than the phrase embedding search to obtain the required dictionary size, which caused more false-positive results to be included in the dictionary.
Effect of Additional Sentences. In addition to using the additional dictionaryV 2 obtained using phrase embedding search, we tried to use additional sentencesX 2 along withX 1 (see the 'black' line in Figure 3a). The performance was higher than the (a) F1 Scores of RoSTER Models (b) Required Top-k (1k) Figure 3: Performance of RoSTER models with different sizes of the additional dictionary and the top-k to reach a certain dictionary size by the natural language search (red) and phrase embedding search (blue). The black line represents the performance of the model trained with additional sentences by the phrase embedding search (i.e.,X 1 +X 2 ). The x-axis indicates the size of the 'additional' dictionary (x = 0: the initial dictionary consisting of 12k entities.) The y-axis of the graphs (a) and (b) indicates F1 scores on the BC5CDR test set and the 'required' number of retrieved phrases for each entity type, respectively. other models at low top-k (x = 15k), but the performance degraded rapidly as the dictionary size grew. As discussed in Section 3.2, the sentences from the phrase embedding search have similar patterns, and from this result, we conjecture that the limited contextual patterns hindered the model's generalizability. In conclusion, using onlyX 1 for the unlabeled corpus and bothV 1 andV 2 for the dictionary would result in the best NER performance in most cases. However, as shown in Section 5, usingX 2 andV 2 can be a good alternative if users want to avoid effort required in query tuning. Table 5 shows several examples of how a large dictionary induced noise annotations in dictionary matching and how these annotations were corrected by the verification method. We used nine finegrained entity types belonging to the person, location, and organization types, which were used in the experiments in Section 6.1. We denote the initial dictionary (i.e.,V 1 ) as a small dictionary and  Table 5: Case study of dictionary sizes and dictionary matching methods. Small V: initial dictionary (i.e.,V 1 ) consisting of 12k entities. Large V: expanded dictionary (i.e.,V 1 +V 2 ) consisting of 134k entities. String: rule-based string matching. Verif.: the verification method. Red: incorrect annotations. Blue: correct annotations.

Case Study
the expanded dictionary that consists of the initial and additional dictionaries (i.e.,V 1 +V 2 ) as a large dictionary. While the small dictionary could not match the entity "Alexander Downer" owing to its limited coverage, the entity was correctly annotated by a large dictionary. However, the large dictionary incorrectly annotated "Central" as a company, indicating that there is a trade-off between the coverage and accuracy of a dictionary. Also, "Barcelona" appeared mainly as a sports team in the small dictionary, whereas in the large dictionary it frequently appeared as a city and was therefore incorrectly annotated by the latter. In contrast, our verification method had the advantages of both dictionaries; it preserved the high accuracy of the small dictionary while retaining the high coverage of the large dictionary, resulting in correct annotations.

Conclusion
In this study, we presented an advanced dataset generation framework, HighGEN, which combines (1) phrase embedding search to address the problem of efficiently retrieving various entities using an open-domain retriever and (2) verification method to deal with false positives in a large dictionary.
In the experiments, we demonstrated the superiority of HighGEN using five NER benchmarks and performed extensive ablation studies, comparison of retrieval performance, and analysis of potential uses of the phrase embedding search in few-shot NER scenarios. We hope that our study will provide practical help in several data-poor domains and valuable insights into entity retrieval and weakly supervised NER.

Limitations
Inappropriate initial user questions can negatively affect NER performance. If they are not proper, the QA model returns incorrect phrases, and the phrase embedding queries generated from them will also be erroneous. The absence of a component for controlling this error cascade in our framework should be addressed in future studies. In addition, our method is dependent on the phrase encoder of DensePhrases. Because the phrase encoder is a general-purpose model trained on Wikipedia-based datasets, its capability may be limited for domain-specific entities. In fewshot NER, the phrase encoder can be sensitive to the quality of given example sentences. Future studies should thoroughly analyze the effect of the phrase encoder's performance on the resulting NER datasets and NER performance. for CoNLL-2003 and the biomedical domain datasets, and 20 epochs for the other small datasets (Wikigold and WNUT-16).
• BOND: We initially trained the teacher model for one epoch and also self-trained the model for additional one epoch. For the other hyperparameters, we used the ones suggested by the authors.
• RoSTER: We referred to the official repository to select hyperparameters. We used the default hyperparameters suggested by the authors, except for noise training epochs and self-training epochs that were set to 1. In addition, when training models on biomedical domain datasets by HighGEN, we used a threshold value of 0.1 in the noisy label removal step.

C Few-shot Models
Supervised: A standard model (described in Section 4.2) is trained directly on few-shot examples using a token-level cross-entropy loss.
Noisy supervised pre-training (NSP) (Huang et al., 2021): The model is initially trained on a large-scale weakly-labeled corpus, called WiNER (Ghaddar and Langlais, 2017), which consists of Wikipedia documents with weak labels generated using the anchor links and coreference resolution. Subsequently, the model is fine-tuned on few-shot examples.
Self-training (Huang et al., 2021): This model is trained using a current semi-supervised learning method (Xie et al., 2020). Specifically, the model is initially trained using few-shot examples and fine-tuned by self-training on unlabeled training sentences. Note that the detailed algorithm can be different from the self-training methods used in BOND and RoSTER; therefore, please refer to the papers for details.
QUIP (Jia et al., 2022): QUIP was used as the state-of-the-art few-shot model in our experiment. The model is pre-trained with approximately 80 million question-answer pairs that are automatically generated by the BART-large model (Lewis