Self-Adaptive Named Entity Recognition by Retrieving Unstructured Knowledge

Although named entity recognition (NER) helps us to extract domain-specific entities from text (e.g., artists in the music domain), it is costly to create a large amount of training data or a structured knowledge base to perform accurate NER in the target domain. Here, we propose self-adaptive NER, which retrieves external knowledge from unstructured text to learn the usages of entities that have not been learned well. To retrieve useful knowledge for NER, we design an effective two-stage model that retrieves unstructured knowledge using uncertain entities as queries. Our model predicts the entities in the input and then finds those of which the prediction is not confident. Then, it retrieves knowledge by using these uncertain entities as queries and concatenates the retrieved text to the original input to revise the prediction. Experiments on CrossNER datasets demonstrated that our model outperforms strong baselines by 2.35 points in F1 metric.


Introduction
Named entity recognition (NER) helps us to extract entities from text in various domains such as biomedicine , disease (Dogan et al., 2014), and COVID-19 . However, accurate neural NER requires a massive amount of training data (Chiu and Nichols, 2016;Ma and Hovy, 2016;Yadav and Bethard, 2018). As well, the annotation of a domain-specific NER dataset costs a lot of money because it requires the involvement of domain experts.
To compensate for the lack of training data in NER, researchers have utilized external knowledge. Traditional feature-based NER uses features based on gazetteers or name lists (Florian et al., 2003;Cohen and Sarawagi, 2004;Luo et al., 2015) as external knowledge. Although recent neural NER methods can even benefit from gazetteers and name lists (Seyler et al., 2018;Liu et al., 2019;Mengge Figure 1: Concept of self-adaptive NER: the model predicts entity candidates to conduct entity-level retrieval from the unstructured KB; then it revises the prediction with reference to the retrieved knowledge. et al., 2020), only a few domains with structured knowledge bases (gazetteers) have this merit. Thus, several studies have resorted to using raw text (unstructured knowledge) to perform weaklysupervised learning on general-domain structured knowledge (Cao et al., 2019;Mengge et al., 2020;Liu et al., 2021a).
In this paper, we explore the potential of utilizing unstructured knowledge in the NER task by referring to it at inference time. Our basic idea is inspired by recent retrieval-augmented language models (LMs) (Guu et al., 2020). These models are pre-trained with retrieval-augmented masked language model (MLM), so that they can perform well in open-domain question answering (ODQA) by retrieving relevant unstructured knowledge using a question as a query. However, as we will later confirm in the experiments, the models designed for ODQA are not effective in the NER task because it requires an understanding of many entities in the input text.
To deal with this problem, we propose a retrievalaugmented model capable of determining which entities to focus on in the input text for knowledge retrieval. The proposed self-adaptive NER (SA-NER) with unstructured knowledge model searches an unstructured knowledge base (UKB) when it lacks confidence in its prediction. We cre-ate the UKB automatically by splitting a raw text corpus into pieces and assigning dense vectors as keys to each piece of unstructured knowledge. To help in understanding local semantics, we design a retrieval system tailored for NER; our model predicts the entities and then retrieves knowledge in terms of those it is not confident in predicting.
To evaluate our method's capability of retrieving useful knowledge about entities, we conducted experiments on various NER datasets (Tjong Kim Sang and De Meulder, 2003;Salinas Alvarado et al., 2015;Liu et al., 2021b), some of which have domain-specific types.
Our contributions are summarized as follows: • We are the first to integrate retrievalaugmentation into NER. SA-NER retrieves entity-level knowledge dynamically for NER.
• In experiments, SA-NER outperformed strong baselines pre-trained in a supervised and selfsupervised fashion by 1.22 to 2.35 points.
• We reveal why knowledge retrieval is useful for NER. We found that our model is effective on entities not included in the general-domain pre-training dataset.

Task Settings
We developed SA-NER to solve the problems of NER with unstructured knowledge. NER is a sequence tagging task in which the model inputs a token sequence X ∈ V L , where V is the vocabulary and L is the maximum sequence length. The model outputs a BIO label sequence of the same length. Let C be the number of types. Then, the number of the BIO labels is 2C + 1. SA-NER assumes a corpus as an unstructured knowledge, which is split into token sequences of length L, following the existing retrievalaugmented language model (LM) (Borgeaud et al., 2021), in order to store a large corpus efficiently. We retrieve m pieces of knowledge and concatenate them into X. We feed the concatenated text X + ∈ V (m+1)L to the model.

Related Work
Here, we review NER that uses raw text (unstructured knowledge) without structured knowledge, with in-domain structured knowledge, with generaldomain structured knowledge, and for pre-training of billion-scale LMs . Also, we review the retrievalaugmented LMs.

NER with unstructured knowledge
Researchers have utilized various clues to retrieve useful raw text for NER. Traditional NER models focus on surrounding contexts (Sutton and McCallum, 2004;Finkel et al., 2005;Krishnan and Manning, 2006) and linked documents (Plank et al., 2014) to capture non-local dependencies. More recent neural NER models benefit from neighbor sentences to obtain better contextualized word representations (Virtanen et al., 2019;Luoma and Pyysalo, 2020). Meanwhile, Banerjee et al. (2019) and  encode knowledge contexts on entity types such as questions, definitions, and examples taken from in-domain structured KBs (e.g., UMLS Meta-thesaurus). In this study, we developed a generic method that retrieves useful raw text (unstructured text) for NER.
Distant supervision (Mintz et al., 2009) uses structured knowledge to annotate raw text with pseudo labels. Performing distantly supervised fine-tuning with in-domain structured knowledge after the MLM pre-training is effective in domainspecific NER (Wang et al., 2021;Trieu et al., 2022). However, domain-specific distant supervised learning depends on the structured knowledge's coverage of the label set of the downstream task.
Weakly supervised learning with generaldomain structured knowledge (Cao et al., 2019;Liang et al., 2020;Mengge et al., 2020;Liu et al., 2021a) can transfer general-domain knowledge to the target domain. Its methods learn the entity knowledge through weakly supervised learning, even though the target task has domain-specific entities and types (Liu et al., 2021a). We confirmed that our model achieved a performance gain by using raw text as unstructured knowledge at inference time because the world knowledge cannot be stored in the limited-sized model.
Pre-trained LMs memorize factual knowledge in their models through pre-training on unstructured corpus (Petroni et al., 2019;Cao et al., 2021;Dhingra et al., 2022). Recently, billion-scale generative pre-trained LMs have been proposed (Raffel et al., 2020;Brown et al., 2020). Although the generative models cannot be applied naively to structured prediction tasks such as NER, some papers tackled NER with the generative LMs (Paolini et al., 2021;Yan et al., 2021;Zhang et al., 2022;Chen et al., 2022). One of the advantages of retrieval-augmented LMs over billion-scale LMs is ease of maintenance; For instance, the models can use up-to-date Wikipedia as the UKBs.

Retrieval-Augmented Language Models
LMs using external knowledge have recently been proposed (Guu et al., 2020;Lewis et al., 2020;Izacard and Grave, 2021;Singh et al., 2021;Borgeaud et al., 2021). However, they focus on language modeling and ODQA, and successful retrievalaugmented LMs in NER have not been reported. They obtain queries for knowledge retrieval in such a way that each query represents the whole input or a fixed-length chunk split from the input. Therefore, they cannot retrieve knowledge that tells the usages of the entities, which is important for NER. In addition, because an input may include many entities, the model should focus on only those entities whose knowledge is not stored in the model. However, retrieval-augmented LMs have not incorporated such a mechanism to create and filter multiple queries. Wang et al. (2022) and Shinzato et al. (2022) found that retrieving knowledge from the training data is also useful, as it provides knowledge not stored in the trained model. Therefore, we implemented SA-NER in such a way that it uses both labeled and unlabeled UKBs. de Jong et al. (2022) used a virtual knowledge base whose values are vector representations. Focusing on entity knowledge, they extracted mentions from hyperlinks in Wikipedia to learn their representations. They reported that the virtual KB was less accurate but more efficient than FID (Izacard and Grave, 2021), which reads the input and textual knowledge with attention.

Method
Here, we present SA-NER. We explain the construction of the unstructured knowledge base ( §4.1), the encoder architecture ( §4.2), the two-stage NER algorithm which revises the prediction using the unstructured knowledge ( §4.3), the training method ( §4.4), and the pre-training method ( §4.5).

Unstructured KB Construction
We create an unlabeled UKB from raw text and a labeled UKB from the training data. We assume in-domain text as a source of unlabeled unstructured knowledge and split it into token sequences of length L, which is equal to the maximum length of the SA-NER inputs. In addition, following Wang et al. (2022), we add the model's training data as labeled unstructured knowledge. We set L = 64 to avoid truncating most of the original inputs.
The unstructured knowledge is stored in the UKBs with associated keys. The keys of the sequence are the sentence embedding and the n-gram embeddings. Huang et al. (2021) showed that the average of the token embeddings is more useful for sentence embedding than the first [CLS] embedding and that the embeddings in the lower layers are also important, as well as those in the last layer. Therefore, we define the sentence embedding and n-gram embedding as the average pooling of the token representations. The token representations are the concatenations of the frozen BERT input and output, so that both the context-free and contextualized meanings are considered.
To select only entity-like n-grams as the keys, we remove those n-grams that have stop words or have no capital letters. In addition, we use string matching for filtering. We hold only the knowledge that includes the n-grams appearing in the training data for the UKBs used at training time. Also, we hold the knowledge that includes the n-grams appearing in the training or development (test) data for the UKBs at the inference on the development (test) data. Instead of string matching, we can use a summarization-based filtering for n-gram keys, as detailed in Appendix C. We formulate the extraction of a fixed number of representative n-grams from a sequence as an extractive summarization. We use a sub-modular function as the objective (Lin and Bilmes, 2011); thus, the greedy algorithm has a (1 − 1/e) approximation guarantee.
Following Wang et al. (2022), we use the labeled UKB even in training to reduce the training-test discrepancy; in such case, the model does not retrieve the input itself from the labeled UKB.

Encoder
We use BERT (Devlin et al., 2019) and a linear classifier with a softmax activation as the encoder f . Figure 2 shows the encoder structure. To represent the label information from the labeled knowledge base in the model, we provide additional tokentype embeddings. Though the token type is always zero in the conventional BERT model for NER, we use 2C + 3 token-type IDs;  Figure 2: Overview of our self-adaptive NER with knowledge retrieval from UKBs, which store text with n-gram and sentence embeddings as keys. The labeled UKB has text with labels encoded as token type embeddings. The queries are embeddings of unconfident entities and input. We use a sparse matrix in the self-attention modules in BERT. where l i is the label of the labeled knowledge, and X + is the concatenated text.
In the self-attention module, we use the sparse attention technique to reduce the space and time complexity from O(m 2 L 2 ) to O(mL 2 ). As shown in Figure 2, we mask the inter-knowledge interaction. 1 Let k be a function that returns 0 as the sentence id if the sequence is the input X and 1, ..., m if the sequence is the knowledge. Accordingly, the attention matrix before the softmax operation is where i, j are the token indices, d k is the number of dimensions of the attention head, and Q and K ∈ R (m+1)L×d k are query and key matrixes.

Two-stage Tagging of Self-Adaptive NER
SA-NER performs two-stage tagging, i.e., calculation of P = f (X), and calculation of P + = f (X + ). The purpose of the first stage is to find the entities that require additional information and obtain queries for knowledge retrieval. The second stage is to refine the labels with the retrieved knowledge. The motivation behind this design is to retrieve useful entity-wise knowledge to disambiguate individual tokens in NER. We predict the entity spans for entity-level retrieval. We use only the unconfident entities as the entity-based queries in order to exclude unnecessary knowledge from the retrieved results. The pseudo-code of the model is listed in Algorithm 1. We obtain the classification probabilities of the given text P = f (X) ∈ R L×(2C+1) or that of the Algorithm 1 Two-stage self-adaptive NER Require: input X, KBs, hyperparamters m, λ conf 1: Predict probability P = f (X) 2: Compute confidence score ce = mini∈I e P i,ŷ i for each predicted entity e ∈ E with span Ie 3: Obtain unconfident entities U = {e|e ∈ E, ce < λ conf } 4: Add the sentence and unconfident-entity embeddings to the queries, Q 5: Initialize the retrieval results R = Φ 6: for query qi in the queries, Q do 7: Retrieve m nearest-neighbor keys for qi from the KBs 8: Store their values with the distance in R 9: end for 10: Deduplicate R to obtain top-m knowledge K m 1 from R 11: Output probabilities P and where the vectors after position L are ignored. The model parameters are shared in the two stages.
First Stage We collect unconfident entities U in X and feed X to the model to obtain the classification probability P ∈ R L×(2C+1) . Then, we extract the entities E from X in accordance with the predicted labelsŷ = argmax c P ·c ∈ R L . The confidence score of a predicted entity e is c e = min i∈Ie P i,ŷ i , where I e is the span of e ∈ E.
If the type predictions are inconsistent in an entity (e.g., [B-LOC, I-PER]), we set c e = 0. We collect the unconfident entities U ⊆ E whose confidence scores are less than a threshold λ conf . Then, we obtain the queries, which are the sentence and entity embeddings. The sentence embedding is the average pooling over all token embeddings. Each unconfident entity u ∈ U has multiple entity embeddings: average-pooled vectors of ngrams which share at least one token with u. The n-grams are filtered out similarly as in the UKB construction ( § 4.1). E denotes the number of entity embeddings (which are embeddings of n-grams overlapping with u ∈ U). Each token embedding is a concatenation of the BERT input and output.
Note that we only consider sentence-to-sentence and entity-to-n-gram matching. We retrieve the topm nearest neighbors of the sentence embedding from the sentence embeddings in the UKBs and of the entity embeddings from the n-gram embeddings. Then, we select the top-m nearest knowledge from the collected 2(E + 1)m knowledge while deduplicating the backbone knowledge sequence by keeping the knowledge having the minimum distance.
Second Stage We concatenate the knowledge K m 1 to the input X and obtain the classification . Finally, the model outputs BIO labels in accordance with P for the tokens in the confident entities and in accordance with P + for the other tokens.

Training
To train our two-stage SA-NER, we utilize supervision on the training data to refine unconfident entities and design the loss function.

Unconfident Entity Collection
In the training phase, we add the misclassified entities, i.e., those of which the prediction is not correct, to the unconfident entities U described in §4.3.

Loss Function
We use two cross-entropy losses, L 1 for the model prediction without knowledge (the first step) and L 2 for the model prediction with knowledge (the second step). The total loss function is L 2 + λ 1 L 1 , where λ 1 is a hyperparameter.

Pre-training
As is done in retrieval-augmented language models for ODQA (Guu et al., 2020;Borgeaud et al., 2021), we add a retrieval-augmented pre-training stage before the fine-tuning. We propose two methods for NER-aware retrieval-augmented pretraining. The first method uses a general domain NER dataset, CoNLL03 (Tjong Kim Sang and De Meulder, 2003). The model is pre-trained with the method described above ( §4.1~ §4.4).
The second method involves a large-scale selfsupervised pre-training following NERBERT (Liu et al., 2021a). Although the UKB in SA-NER and the pre-training data overlapped in some cases, SA-NER can use the knowledge effectively by referring to it at inference time.
NERBERT The pre-training corpus is Wikipedia. If the consecutive words in the corpus have a hyperlink, the words are labeled as an entity. We categorize such entities with the DBpedia Ontology (Mendes et al., 2012). If the entity exists in the ontology, we categorize it to its type. If it does not exist or it belongs to multiple types, we categorize it to the special "ENTITY" type.
We split the corpus into fixed-length token sequences, 2 and extract the sequences with tokens labeled with the DBpedia types. We reduce the proportion of "ENTITY" labels by using filtering rules and down sampling. The resulting dataset has 33M examples, 939M tokens, and 404 types.
We add a final linear layer with a trainable parameter W pre ∈ R d×(2Cpre+1) to the top of BERT, where d is the hidden size of BERT and C pre is the number of types. Before fine-tuning, the final layer is replaced with a randomly initialized linear layer whose output dimension is determined by the downstream task. Refer to Appendix B and the original paper (Liu et al., 2021a) for details.

Knowledge Retrieval
We use the SA-NER model in the pre-training to reduce the pre-training and fine-tuning discrepancy. We use the pre-training data itself as UKBs. We retrieve knowledge with its pseudo-labels from the data as labeled knowledge and randomly delete the pseudo-labels to make the knowledge unlabeled. We set the deletion probability as 0.95 to simulate downstream tasks where the unlabeled UKB is larger than the labeled UKB. For efficiency, we use Wikipedia hyperlinks as the keys and queries of the retrieval. Instead of a two-stage prediction, we sample m pieces of knowledge that includes an entity in the original input.

Evaluation
We conducted experiments on three NER datasets to evaluate the effectiveness of our self-adaptive NER with unstructured knowledge. We used the entitylevel F 1 as the metric, following the literature.

Dataset
CrossNER (Liu et al., 2021b)   The label sets are different among the domains.
Finance (Salinas Alvarado et al., 2015) is a medium-scale NER dataset collected from U.S. SEC filings. We used the Wikipedia articles in the finance domain as the textual corpus D to construct the unlabeled UKB. The label set is person, organization, location, and miscellaneous.
CoNLL03 (Tjong Kim Sang and De Meulder, 2003) is a widely used large-scale NER dataset collected from Reuters news stories between August 1996 and August 1997. We used the Reuters-21578 text classification dataset (Lewis, 1997), which was collected from Reuters in 1987, as D. The label set is the same as that of Finance.

Compared Models
Our text encoder and tokenizer were the pre-trained BERT-base-cased model (Devlin et al., 2019) or DistilBERT-base-cased model (Sanh et al., 2019). All experiments used the hyperparameters determined on the development set of CrossNER-Politics; refer to Appendix A.
We pre-trained the compared models on the data. Also, we report the effect of overlapping entities in the pre-training data and CrossNER dataset on the performance in Appendix D CoNLL03 or NERBERT (Liu et al., 2021a) 4 datasets before fine-tuning. In addition to the BERT model (i.e., BERT with CoNLL03 or NERBERT pre-training), we implemented the NER version of REALM (REALM-NER). For REALM-NER, we replaced the retrieval-augmented MLM of REALM with our retrieval-augmented pre-training methods tailored for NER to assess the effectiveness of our knowledge retrieval. Also, we set m = 1, removed the entity-level retrieval, and ignored the labeled UKB. We cited the results of the previous models: BERT, NERBERT, and DAPT (Gururangan et al., 2020), which is the domain-adapted BERT baseline. 5 We compared our model with models consisting of BERT and a linear classifier because the classifier architecture is out of the scope of our study.  The improvement is typically larger in the lowerresource domain with more types, because per-type supervision is limited in such case.

Main Results
Does self-adaptive NER improve the performance of the NER-aware pre-training? SA-NER outperformed BERT with CoNLL03 and NER-BERT pre-training. This indicated that the selfadaptation using unstructured knowledge at inference time has the effect of obtaining additional knowledge that is not stored in the model, even though the model has seen the unstructured knowledge in the pre-training. Moreover, because we can increase the unlabeled UKB after pre-training, the model can acquire new knowledge more efficiently than by conducting additional pre-training.
Does self-adaptive NER improve the performance of the retrieval-augmented LM baseline? SA-NER outperformed REALM-NER. SA-NER retrieves knowledge with the entity-level retrieval from the labeled and unlabeled UKB and encodes large pieces of knowledge due to the sparse attention. These techniques improved the usefulness of the knowledge for NER. The contributions of each component are discussed in the ablation studies. We also found that REALM-NER tends to be not good in the setting # Train > 1000. Because REALM-NER retrieves a piece of knowledge with only the sentence-level query, knowledge retrieval is not always useful in that setting. Table 3 shows the results of the ablation studies. We used the best performing SA-NER with NER-BERT pre-training as the full model. We found that all components of SA-NER improved performance.

Ablation Studies
Does the entity-level retrieval improve performance? First, we confirmed the usefulness of self-adaptive knowledge retrieval, because knowledge retrieval based on the model's entity prediction is more useful for NER than conventional sentence-level retrieval (∆1.12 vs. ∆0.79). Also, we found that both knowledge retrievals improve NER performance.
Does the distinction about confidence improve the performance? Second, we investigated the efficacy of distinguishing the predicted entities in terms of confidence. The model retrieves knowledge about unconfident entities U = {e|c e < λ conf , e ∈ E}, and then refines the prediction for only the unconfident entities with the retrieved knowledge. We set λ conf > 1 to remove the distinction. We observed that ignoring confident entities in creating queries is slightly effective (∆0.42), because we can restrict the retrieval results to informative knowledge for NER. Then, we used the second-step prediction for all tokens. We found that reusing the first-step prediction for confident entities improved performance slightly (∆0.36). Using the first-step prediction is important for confident entities because the retrieved knowledge is likely to be irrelevant to them. We consider that making the distinction is more useful in the smaller m setting where the amount of knowledge is limited.
Do the labeled and unlabeled UKBs improve the performance? Finally, we confirmed that both the labeled and unlabeled UKBs are important (∆1.10 and ∆0.51). The unlabeled UKB covers various contexts, and the labeled UKB has supervision. The two types of UKB have different roles in helping the model recognize entities.

Discussion
Does the performance of our model depend on the amount of knowledge? Figure 3 plots F 1 score versus the amount of knowledge m. We can see that more pieces of knowledge led to higher F 1 scores. Because the time and space complexity of the sparse attention is linear in the number of pieces of knowledge, the sparse attention is suitable for large m. However, the dense attention did not improve performance in the case of large m. We consider that the sparse attention represents the intra-and inter-sequence interactions more effectively than the naive dense attention can.   What types of entity require external knowledge? Table 4 lists the results for when the target entities were restricted to each type, which is defined in terms of whether the supervision of an entity was included in the training and pre-training data. The proposed model outperformed NERBERT on all types. The improvement was 1.15 points for the "seen in training" type and 1.64 points for the "unseen in training" type. Therefore, selfadaptation has an effect regardless of whether or not the entity exists in the training data; we also observed this effect in the ablation studies.
Regarding the "unseen in pre-training" type, the proposed model improved performance by 3.28 points. The pre-training dataset collected from Wikipedia shares a lot of entities in the CrossNER dataset created from Wikipedia, and thus whether the tokens are labeled as entities in the pre-training dataset (i.e., the tokens have Wikipedia hyperlinks) has a large effect on performance. We confirmed that PRE-training data is more valuable than one might think, similarly to the findings of Wang et al. (2022) that the reference to the training data at inference time is worthwhile.
Is the self-adaptive NER sensitive to the unconfidence threshold? To investigate the sensitivity of SA-NER to the hyperparameter, we set λ conf to various values at inference time after we trained the model with λ conf = 0.9.    Table 5 shows the results. The performance is on par if λ conf ∈ [0.8, 0.95]. Therefore, SA-NER is not sensitive to λ conf . We also confirmed that modifying the prediction of the high-confidence entities is harmful (λ conf = 1) and thus using λ conf is useful. Moreover, we observed that modifying the prediction of certain entities (3.6% of the total number) is important. These entities are ones in which the token-level predictions were inconsistent, and their confidence c e were set to 0.
Does the self-adaptive NER depend on the filtering method of the n-grams? We compared the two filtering methods for n-gram embeddings in the UKB. The string matching method used the information of the n-grams appearing in the training or development (test) splits in the evaluation on the development (test) set. The summarization-based method just set the maximum number of n-grams in each piece of knowledge. Table 6 shows the results. Both methods outperformed the no-knowledge baseline (NERBERT) and the ablated model without the entity-level knowledge retrieval. The summarization-based filtering requires fewer assumptions and is computationally efficient, although it is less accurate.   Table 7 shows examples of our model. The first example is a case in which the self-adaptation improved the model prediction. The original input itself does not have evidence that the House of Freedoms is a political party. However, the knowledge provides this evidence by mentioning it in the context of an election. The second example is the most common fault in the political domain. Because of the imbalance between the training labels of person and politician, the person entities tend to be misclassified as politician entities. Although both the input and the knowledge indicate that Bob Weinstein is not a politician, the model made the wrong prediction.

Conclusions
We proposed SA-NER, which is designed for NER to retrieve knowledge from the labeled and unlabeled UKBs by using unconfident entities and given inputs as queries. It encodes many pieces of knowledge efficiently with sparse attention. In experiments, SA-NER outperformed DistilBERT and BERT baselines pre-trained on the CoNLL03 and NERBERT datasets by 1.22 to 2.35 points. We found that the entity-level retrieval, the focus on the unconfident entities, the labeled and unlabeled UKBs, and the large m that is enabled by the sparse attention all contribute to SA-NER's performance.
We believe that SA-NER can help application providers to develop NER services in their target domain with domain-specific entity types that they have defined, even if they do not have an annotated dataset sufficiently.

Limitations
SA-NER would be of benefit to low-resource domains and languages. However, for languages that have no word segmentation, such as Chinese, the method of constructing UKB based on n-grams and capitalization may not be suitable. For such languages, we can use a traditional word segmenter and POS tagger to extract entity-like n-grams. Although we did not conduct any such data preprocessing in our experiments, it may also be useful for English.  without entities that were not labeled as "ENTITY." We reduced the proportion of "ENTITY" labels by using filtering rules and down sampling. We randomly filtered the sentences to reduce these labels. If all entities in a sentence were the top-20 frequent labels, the sentences were randomly removed from the dataset: 30% if the number of "ENTITY" entities was three, 50% if the number was four, and 70% if the number was more than four. In the pre-training, we used weighted sampling. The sampling weight of the sentence was min 0≤i≤l |E c i | −0.3 , where E c is the number of entities of type c in the dataset, and c i is the type of the i-th token. As a result, the final dataset had 33M examples, 939M tokens, and 404 types. 12 With the exception of the loss function, initialization, and the use of the retrieval-augmented model, we followed the procedure of the NERBERT pre-training algorithm.
Loss Function In addition to the cross-entropy loss used in the original NERBERT, we incorporated a multi-task loss to efficiently learn the NER ability by ignoring the very frequent "ENTITY" type in the entity typing. For the entity extraction, we performed three-class classification tasks. We summed the output probabilities of the final linear layer after the softmax activation to obtain the probabilities of "B-[type]", "I-[type]", and "O." In the entity typing, we masked the output logits of the final linear layer corresponding to the "ENTITY" label. Then, we performed the 2C pre − 1 classification task. The total loss was the sum of the two cross-entropy losses.
Initialization We had to initialize the weight of the final linear layer and the token-type embeddings because of the mismatch of the set of the labels between the downstream and pre-training tasks. Instead of a random initialization from N (0, σ 0 ), where σ 0 ∈ R is a fixed standard deviation, we used the learned distribution N (µ, σ), where µ, σ ∈ R d is the bias and the standard deviation of the weight of the final linear layer and the token-type embeddings in the pre-trained model.

C Summarization-Based Filtering
To assign n-gram keys to each piece of knowledge, we removed those n-grams that had any stop words or had no capital letter, so as to collect entity-like n-grams. In addition, we used filtering methods based on the string matching and the extractive summarization. The summarization-based filtering enabled us to limit the number of n-grams in each piece of knowledge.
We formulated the extraction of a fixed number of representative n-grams from a sequence as an extractive summarization task, as follows. Here, let h i be an n-gram embedding whose start position is i, regardless of whether the n-gram is filtered out or not. S ∈ R L×L is the cosine similarity matrix of h i (0 ≤ i < L). We denote the token spans as {I s }; each span is a maximal token span that does not include stop words but includes a capital letter. We should extract n-grams from different spans to increase the diversity of n-grams. I s is the set of such spans.
We defined the optimization problem as follows: Z ⊆ {0, 1, · · · , L − 1} denotes the set of n-grams. We used a sub-modular function as the objective to be maximized, under the constraint |Z| ≤ N max (Lin and Bilmes, 2011  L div (Z) =
The hyperparameters are α = 0.1, λ div = 10, and N max = 3. We also required Z to meet the filtering condition (that is, the inclusion of a capital letter and no stop word). L cov (Z) measures the coverage of the n-grams and L div (Z) measures the diversity of the n-grams. Because this objective function is a sub-modular function, the greedy algorithm has a (1 − 1/e) approximation guarantee. Therefore, we can use a lightweight computation to extract the most important n-grams.

D Effect of Overlapping Entities
To confirm that the effectiveness of NERBERT is not due to the overlapping entities in the pre-training and fine-tuning dataset, we conducted experiments where we removed sequences including the entities that appeared in the CrossNER dataset from the NERBERT corpus. Table 10 shows the results. We confirmed that the NER ability learned from the NERBERT corpus itself improved performance and SA-NER outperformed NERBERT in both settings.
However, we also found that the performance of NERBERT is overestimated because of entity overlap. Brown et al. (2020) and Dodge et al. (2021) also noted that leakage of the benchmark datasets from the pre-training corpus affects the performance of GPT-3 (Brown et al., 2020) and T5 (Raffel et al., 2020). The community should solve this problem in future.