Exploring Partial Knowledge Base Inference in Biomedical Entity Linking

Biomedical entity linking (EL) consists of named entity recognition (NER) and named entity disambiguation (NED). EL models are trained on corpora labeled by a predefined KB. However, it is a common scenario that only entities within a subset of the KB are precious to stakeholders. We name this scenario partial knowledge base inference; training an EL model with one KB and inferring on the part of it without further training. In this work, we give a detailed definition and evaluation procedures for this practically valuable but significantly understudied scenario and evaluate methods from three representative EL paradigms. We construct partial KB inference benchmarks and witness a catastrophic degradation in EL performance due to dramatically precision drop.Our findings reveal these EL paradigms can not correctly handle unlinkable mentions (NIL), so they are not robust to partial KB inference. We also propose two simple-and-effective redemption methods to combat the NIL issue with little computational overhead.


Introduction
Biomedical entity linking (EL) aims to identify entity mentions from biomedical free texts and link them to the pre-defined knowledge base (KB, e.g. UMLS (Bodenreider, 2004)), which is an essential step for various tasks in biomedical language understanding including relation extraction (Li et al., 2016;Lin et al., 2020b;Hiai et al., 2021; and question answering (Jin et al., 2022).
EL naturally contains two subtasks: named entity recognition (NER) and named entity disambiguation (NED). NER is designed for mention detection, while NED aims to find the best match entities from KB. One direct way for EL is executing NER and NED sequentially (Liu et al., 2020;Zhang et al., 2021a;Yuan et al., 2022b). Neural Corresponding Author.
Contributed equally. Ordering is determined by dice rolling. NER and NED models are usually trained by corpora labeled with a KB. However, potential users of biomedical EL, including doctors, patients, and developers of knowledge graphs (KGs) may only be interested in entities inside a subset of KB such as SNOMED-CT (Donnelly et al., 2006), one semantic type of entities in UMLS, or KB customized by medical institutions. Besides, doctors from different medical institutions have different terminology sets. Some hospitals are using ICD-10, while some hospitals are still using ICD-9 and even custom terminology set. Patients are only interested in specific diseases, symptoms, and drugs. As for developers of KGs, they may need to build a KG for special diseases like diabetes (Chang et al., 2021) and COVID-19 (Reese et al., 2021), or particular relation types like drug-drug interaction (Lin et al., 2020a). All scenarios above need to infer EL using a partial KB. Off-the-shelf models trained on a comprehensive KB will extract mentions linked to entities outside the users' KB. Although retraining models based on users' KBs can obtain satisfactory performances, it is not feasible under most scenarios because users can have significantly different KBs and may have difficulties with computational resources in finetuning large-scale models. Therefore, we propose a scenario focusing on inference on the partial KB. We name this scenario partial knowledge base inference: Train an EL model with one KB and infer on partial of this KB with-out further training. Fig. 1 provides a case of this scenario. This scenario is widely faced in the medical industry but remains understudied.
This work reviews and evaluates current stateof-the-art EL methods under the partial KB inference scenario. To be specific, we evaluate three paradigms: (1) NER-NED (Yuan et al., 2021(Yuan et al., , 2022c, (2) NED-NER (Zhang et al., 2022), (3) simultaneous generation (Cao et al., 2021a). The first two paradigms are pipeline methods, whose difference is the order of NER and NED. The last paradigm is an end-to-end method that generates mention and corresponding concepts by language models. We construct partial KB inference datasets based on two widely used biomedical EL datasets: BC5CDR (Li et al., 2016) and MedMentions (Mohan and Li, 2019). Our experimental findings reveal the different implicit mechanisms and performance bottlenecks within each paradigm which shows partial KB inference is challenging.
We also propose two redemption methods based on our findings, post-pruning and thresholding, to help models improve partial KB inference performance effortlessly. Post-pruning infers with a large KB and removes entities in the large KB but not in partial KB. Post-pruning is effective but memoryunfriendly for storing embeddings of entities in the large KB. Thresholding removes entities with scores below a threshold. These two redemption methods are all designed to reduce the impact of NIL entities and boost EL performances. To our best knowledge, this is the first work that researches partial KB inference in biomedical EL. Our main contributions are the following: • We extensively investigate partial KB inference in biomedical EL. We give a detailed definition, evaluation procedures, and open-source curated datasets.
• Experiment results show that the NED-NER paradigm behaves more robust towards partial KB inference, while the other paradigms suffer from sharp degradation caused by NIL.
• We propose two redemption techniques to address the NIL issue with little computational overhead for better partial KB inference.
2 Related Work NER and NED In biomedical and general domains, NER and NED are two extensively studied sub-fields in NLP. As mentioned, EL can be decomposed and approached by NER and NED. NER is often considered a sequential labeling task (Lample et al., 2016). Neural encoders like LSTM (Gridach, 2017;Habibi et al., 2017;Cho and Lee, 2019) or pretrained language models (Weber et al., 2021) encode input text and assign BIO/BIOES tags to each word. Many biomedical pretrained language models are proposed to enhance NER performances (Beltagy et al., 2019;Peng et al., 2019;Lee et al., 2020;Gu et al., 2021;Yuan et al., 2021). Concerning NED, most methods embed mentions and concepts into a common dense space by language models and disambiguate mentions by nearest neighbor search (Bhowmik et al., 2021;Ujiie et al., 2021a;Lai et al., 2021).  and Agarwal et al. (2021)  In this work, we further explore partial KB inference by analyzing performance in these two steps and reveal how the design and order of NER and NED infer EL performance in partial KB inference.
Entity Linking Although EL can be handled by a direct pipeline of NER and NED, there is limited research focusing on the task as a whole in biomedical. As EL may enjoy the mutual benefits from supervision of both subtasks, Zhao et al. (2019) deal with biomedical EL in a multi-task setting of NER and NED. MedLinker (Loureiro and Jorge, 2020) and Ujiie et al. (2021b) approach biomedical EL by sequentially dealing with NER and NED using a shared language model and they devise a dictionary-matching mechanism to deal with concepts absent from the training annotations.
In the general domain, GENRE (Cao et al., 2021a,b) is proposed and formulated EL as a seq2seq task. They detect and disambiguate mentions with constrained language generation in an end-to-end fashion. We categorize GENRE as simultaneous-generate EL. EntQA (Zhang et al., 2022) provides a novel framework by first finding probable concepts in texts and then treating each extracted concept as queries to detect corresponding mentions in a question-answering fashion which is categorized as NED-NER in our frame-work. Simultaneous-generate and NED-NER fashion are not widely examined in biomedical EL, and they interest us to examine their performances for biomedical EL and partial KB inferences.
Partial KB inference in EL In the biomedical domain, there is no prior work considering this setting to the best of our knowledge. NILINKER (Ruas and Couto, 2022) is the most related work which focuses on linking NIL entities out of the training KB, while ours aim to infer EL on part of the training KB and discard NIL entities.

Problem Definition
Entity Linking Let E denote a target KB comprises of a set of biomedical concepts. Given a text s with length n, an EL model aims to find the mentions m and corresponding concepts e ∈ E. Concretely, the model can be regarded as a mapping f : s → P E , where P E = {(i, j, e)|0 ≤ i ≤ j ≤ n, e ∈ E} denotes the possible target mention-concept pairs, and i, j mark the start and end positions of the mention spans in s.

Partial KB inference
In the conventional EL scenario, the target KB is the same in training and inference. In this paper, we consider a partial KB inference scenario containing two different KBs, E 1 and E 2 , and assume E 1 ⊋ E 2 . The larger KB E 1 corresponds to the training KB while the smaller KB E 2 corresponds to the partial inference KB. Models are required to map a text s to a different label set P E 2 during inference, rather than P E 1 during training, and we have P E 1 ⊋ P E 2 . There exists a label distribution shift in this scenario. We investigate whether current entity linking models are robust for partial KB inference and how models perform under the shifted distribution of targets.

Experiments
In this section, we introduce our experimental setup, which includes implementation details of EL methods we investigated ( §4.1) and datasets we create for investigating partial KB inference ( §4.2).

Direct Partial KB Inference
There are three widely-used paradigms for entity linking: (1) NER-NED; (2) NED-NER; (3) Simultaneous Generation. We introduce representative methods for each paradigm and how methods are accommodated to partial KB inference with minimal change. To be noticed, these paradigms are not aware of the KB E 1 during partial KB inference. The top subgraph in Fig. 2 depicts the overview of the three paradigms. We also describe how directly applying these methods to partial KB inference, which corresponds to the Direct inference method in Fig. 2. Hyper-parameters for experiments are reported in Appx. §A.

NER-NED
A straightforward solution for entity linking is a two-phase paradigm that first detects the entity mentions by NER models and then disambiguates the mentions to concepts in KBs by NED models, shown in the left top subgraph of Fig. 2. We finetune a pre-trained biomedical language model for token classification as the NER model in this paradigm. Specifically, we use KeBioLM (Yuan et al., 2021) as our language model backbone. We use CODER (Yuan et al., 2022b) as our NED model which is a self-supervised biomedical entity normalizer pre-trained on UMLS synonyms with contrastive learning. CODER disambiguates mentions by generating embedding from each concept synonym and recognized mentions into dense vectors and then finding the nearest concept neighbors of each mention vector by maximum inner product search (MIPS).
In partial KB inference, although the NER model is not aware of the changes in KB, the NED model only needs to search for the nearest concept within a partial KB. Smaller inference KB is challenging for the NED model. For a mention m and its corresponding concept e ∈ E 1 , if e / ∈ E 2 , the NED model will return an incorrect or less accurate concept from E 2 . Since the users are only interested in concepts within E 2 , these kinds of mention m should be linked as unlinkable entities (NIL).

NED-NER
NED-NER methods are also formatted as a twophase pipeline, which is shown in the middle top subgraph of Fig. 2. This paradigm first retrieves the concepts mentioned in the text, then identifies mentions based on retrieved concepts. This paradigm is proposed along with the method En-tQA (Zhang et al., 2022). In the concept retrieval phase of EntQA, a retriever finds top-K related concepts for texts by embedding both into a common dense space using a bi-encoder, then searches nearest neighbors for texts by MIPS within the partial KB E 2 . This phase retrieves concepts from raw texts directly and we view it as the NED phase.

In-KB Training
Train set (Partial)

Test set (Partial)
Inference Figure 2: Overview of three different entity linking paradigms and settings of partial KB inference. The top sub-graph demonstrates three EL paradigms we investigated in this work ( §4.1). The middle sub-graph shows the relation of the large training KB and partial KB in inference ( §3). The bottom sub-graph shows two EL models obtained from full and partial training and three partial KB inference settings. The direct partial KB inference is the naive setting described in §4.1. Thresholding and post pruning are two simple redemption methods we propose and describe in §5.2.
Following its original setting, We initialize the retriever from BLINK (Wu et al., 2019) checkpoints and further fine-tune the bi-encoder on our datasets with its contrastive loss functions. In the following phase, a reader is trained to identify mentions in a question-answering fashion where mentions and concepts correspond to answers and queries respectively. This phase is viewed as NER. In partial KB inference, only concepts from the partial KB will be encoded into dense vectors for MIPS.

Simultaneous Generation
In the generative paradigm for entity linking, NER and NED are achieved simultaneously, which is shown in the right top subgraph of Fig. 2. Entity linking is modeled as a sequence-to-sequence (seq2seq) task where models insert special tokens and concept names into texts with a constrained decoding technique via a Trie. We follow the detailed model design in GENRE. Given a input text s, the target sequence is built as: . . x j are the mention tokens in s, e is a token sequence of the concept name, and M B , M E , E B , E E are special tokens marking the beginning and ending of mentions and concepts. The model is trained in seq2seq fashion by maximum log-likelihood with respect to each token. During inference, a token prefix trie is built to constrain model only output concepts within the given KB. For partial KB inference, only concept names from the partial KB are added to build the prefix Trie in GENRE. This will ensure all entity linking results will only be referred to the partial KB.

Datasets
We conduct experiments on two widely-used biomedical EL datasets and select several partial KBs used as inference. Selection biases of partial KBs may be introduced into our setting because different partial KBs may result in different target distributions of mention-concept annotations, as this may lead to different difficulties in EL due to different KB sizes, the semantics of entities, and entity occurrence frequencies in the training set.
To eliminate this effect as much as possible, we not only evaluate on partial KBs mentioned above but also their complement KBs to the training KBs.
We add ∁ to indicate the complements. The detailed statistics of datasets are listed in Tab. 6 of Appx. §B. BC5CDR (Li et al., 2016) is a dataset that annotates 1,500 PubMed abstracts with 4,409 chemicals, 5818 disease entities, and 3,116 chemical-disease interactions. All annotated mentions are linked to concepts in the target knowledge base MeSH. We use MeSH as the training KB and we consider a smaller KB MEDIC (Davis et al., 2012) as the partial KB for inference. MEDIC is a manually curated KB composed of 9,700 selected disease concepts mainly from MeSH.
MedMentions (Mohan and Li, 2019) is a largescale biomedical entity linking datasets curated from annotated PubMed abstracts. We use the st21pv subset which comprises 4,392 PubMed abstracts, and over 350,000 annotated mentions linked to concepts of 21 selected semantic types in UMLS (Bodenreider, 2004). We use UMLS as the training KB and we select three representative partial KBs which are concepts from semantic types T038 (Biologic Function) and T058 (Health Care Activity) in UMLS and SNOMED.

Results
In this section, we present the main results of partial KB inference ( §5.1) Then, we provided two redemption methods for enhancing model performance in partial KB inference ( §5.2). In the end, we discuss the factors related to difficulties hindering partial KB inference performance ( §5.3).

Main Results
EL Tab. 1 shows entity linking results on different partial KB settings. First of all, we witness a significant and consistent performance drop in precision among all methods on MedMentions. En-tQA has the least precision drop (5.36%) while GENRE and KEBioLM+CODER have a more obvious decrease, which is 16.68% and 14.35%, respectively. On the opposite, recalls in partial KBs remained the same even slightly increased. KeBi-oLM+CODER shows the largest average recall increase (6.71%), followed by EntQA (2.13%), while the average recall of GENRE remains the same (only drops 0.11%). Due to the stability in precision, the average change of F1 by EntQA even slightly increases (-0.7%). However, the average F1 of GENRE and KeBioLM+CODER drops significantly on partial KBs, which are 12.04% and 9.96%. The same pattern appears in BC5CDR. EntQA shows extraordinary robustness in direct partial KB inference in contrast to the degradation of GENRE and KeBioLM. For individual partial KBs, a consistent pattern of precision and F1 drop is observed in GENRE and KeBioLM+CODER and EntQA is more robust compared to others. The F1 degradation led to by precision decrease reflects that the models detect redundant mentions that are out of the partial KBs.    These results reveal that models learn the mapping between related mentions and concepts and are not biased by the out-of-KB annotations. The shrunk concept space of partial KBs makes the disambiguation task easier and leads to performance improvement.
Conclusion We can conclude that (1) NER-NED and Generative frameworks are not robust to direct partial KB inference, while the performance of NED-NER framework is more stable; (2) degradation of entity linking performance is mainly a result of drastically degenerated mention detection performance on partial KBs and entity disambiguation abilities are stable; (3) EntQA potentially handled NILs via filtering out irrelevant entities before NER, while other methods suffer from low precision due to mislinking NILs to existing entities.

Simple Redemptions
In former subsections, we identify performance drops in partial KB inference mainly due to precision drops in the mention detection. We introduce two simple-yet-effective methods to redeem the performance drops for partial KB inference: Post-pruning and Thresholding, which are shown in Fig. 2   Appx. §C.1. Two methods are motivated by removing NIL mentions for improving mention detection performances.
Post-Pruning asks the model to infer using E 1 and remove mention-entity pairs out of the partial KBs E 1 − E 2 . This redemption method is naive but needs to know E 1 .
Thresholding uses E 2 for inference. After obtaining mention-entity pairs, it will search a fixed threshold θ on the development set to maximize F1 and remove results with scores under the threshold. This method is not aware of E 1 . Specifically, for KeBioLM+CODER, we set a threshold on the cosine similarities of detected mentions and their most similar concepts: where m represents the mention extracted by NER and h represents embeddings.
For EntQA, we obtain K entities from the retriever and we compute the score for k th entity with starting and ending index s, t: score = P re (e k |e 1:K )P st (s|e k , s)P ed (t|e k , s) where P re computes the probability of e k among all entities and P st and P ed computes probabilities of s and t are the starting and ending of e k . The original implementation of EntQA integrated thresholding during inference, so the partial KB inference is equivalent to inference with thresholding.
For GENRE, we use the log-likelihoods for the generated mention span and concepts names in the output sequences s e m = {M B , x i , . . . x j , M E , E B , e, E E } as scores: score = 1 |s e m | x∈s e m log(P ar (x)) where P ar represents the token's probability autoregressively conditioned on its preceding tokens. We compare two methods with direct partial KB inference. We also include a setting where models are trained on the partial KB E 2 for comparison. We dub this 'in-domain' setting as In-KB train.
Redemptions Performances Tab. 4 shows results of partial KB inference on MEDIC and MEDIC ∁ on BC5CDR. We also identify the same pattern in other subset KBs (Appx. §C.2). Paradigms behave differently under these settings.
For KeBioLM+CODER, the best improvements are brought by thresholding. Mention-concept pairs with low similarities can be categorized into concepts within E 1 − E 2 or incorrect mention spans. These two kinds of pairs are removed by thresholding which results of the improvement of NER and NED. Post-pruning also brings improvement of NED by removing concepts within E 1 − E 2 , but it cannot deal with incorrect mention spans.
For EntQA, direct partial KB transfer achieves similar results to In-KB training. The great performance of direct partial KB transfer is due to the integration of the thresholding mechanism.
For GENRE, the best performance is achieved uniformly by post-pruning. Post-pruning removes concepts within E 1 − E 2 to boost performance. Thresholding also has significant improvement and performs better than In-KB training. The reason thresholding performs worse than post-pruning may be the log-likelihood is not a direct estimate of mention-entity pair validity.
Another observation is two redemption methods can outperform direct In-KB training, which suggests additional supervision from KB E 1 − E 2 can benefit partial KB inference on E 2 .

Discussion
In this section, we provide a further investigation into what causes performance variance across different partial KBs. In training data, annotations associated with different partial KBs may take different proportions of total annotations. Models may over-fit the frequency of mention annotations existing in training samples. We visualize F1 performance drop of entity linking and mention detection against the proportion of partial KB annotations in training data. As shown in Fig. 3(a)(b), the performance drop is negatively correlated with the annotation proportions for GENRE and Ke-BioLM+CODER. The relation is more prominent for mention detection. For EntQA, performances barely change in terms of entity linking and mention detection due to its robustness. This negative correlation suggests mention detection of GENRE and KeBioLM+CODER over-fit annotation frequency. EntQA detects mentions according to retrieved concepts. This explicit modeling makes it more robust since it handles out-of-KB mentions by filtering out irrelevant concepts in the retrieving stage.
For NED as shown in Fig. 3(c), there exists no obvious trend between accuracy drops and annotation proportions. For GENRE and KeBi-oLM+CODER, the disambiguation performances are improved when inference on partial KBs. Improvements are also observed for EntQA on concept retrieval R@100. Concept spaces are shrunk for partial KBs and therefore the disambiguation problem becomes easier to approach. Contrarily, the disambiguation accuracy of EntQA drops, which is probably because of the distribution shift of retrieved concepts between training and inference which serve as inputs for the reader. The distribution shifts in a way that for the same number of top retrieved concepts many concepts with lower ranks may be unseen for the reader in partial KB inference. This illustrates EntQA is still influenced by partial KB inference although it is robust for detecting mentions.

Conclusion
In this work, we propose a practical scenario, namely partial KB inference, in biomedical EL and give a detailed definition and evaluation procedures for it. We review and categorize current state-ofthe-art entity linking models into three paradigms. Through experiments, we show NER-NED and simultaneous generation paradigms have vulnerable performance toward partial KB inference which is mainly caused by mention detection precision drop. The NED-NER paradigm is more robust due to well-modeled mention-concept reliance. We also propose two methods to redeem the performance drop in partial KB inference and discover out-KB annotations may enhance the in-KB performance. Post-pruning and thresholding can both improve the performance of NER-NED and simultaneous generation paradigms. Although post-pruning is easy-to-use, it needs to store the large KB E 1 (with their embeddings or trie) which has large memory consumption. Thresholding does not rely on large KB E 1 which also has better performance on the NER-NED paradigm. Our findings illustrate the importance of partial KB inference in EL which shed light on the future research direction.

Limitations
We only investigate representative methods of three widely-used EL paradigms. However, there are more EL methods and paradigms we may not cover, and we leave them as future works. Furthermore, more auxiliary information in the biomedical domain can be introduced to address the NIL issue we identify in this work. For example, a hierarchical structure exists for concepts in KBs in the biomedical domain. Therefore, NIL may be solved by linking them to hypernym concepts in the partial KBs (Ruas and Couto, 2022). We consider the hierarchical mapping between NILs and in-KB concepts as a potential solution for performance degradation in partial KB inference.
Users can obtain different entity-linking results based on their own KBs which have the potential risk of missing important clinical information from the texts.

Ethics Statement
Datasets used for building partial KB inference do not contain any patient privacy information.

A Hyper-parameters
We demonstrates the hyper-parameters we used in training three EL models on MedMEntions and BC5CDR in Tab. 5. All other hyper-parameters in training and inference that are not mentioned in this table are the same from the public codes and scripts of GENRE 1 , EntQA 2 , KeBioLM 3 , and CODER 4 . Models are implemented on single NVIDIA V100 GPU with 32GB memory.

B Datasets Statistics
Tab. 6 shows the detailed statistics of data we used for partial KB inference. We use MeSH and MEDIC in the BC5CDR corpus 5 . The BC5CDR dataset has been identified as being free of known restrictions under copyright law. We use UMLS, MeSH and SNOMED from the 2017 AA release of UMLS. To meet the assumption that MEDIC forms a subset of MeSH, we ditch the concepts in MEDIC that do not exist in MeSH. And we use st21pv version of MedMentions 6 . The MedMentions dataset is under CC0 licence. We follow GenBioEL 7 for preprocessing the concepts and synonyms in the original KBs. To meet the assumption that the partial KBs do not contain concepts out of training KB, we ditch the concepts in partial KBs that do not exist in UMLS.
We use precision, recall, and F1 as metrics for entity linking and mention detection, and accuracy on correctly detected mentions for disambiguation performance. We also use the top 100 recall (R@100) to illustrate the performance of EntQA retriever.

C.1 Illustrative Example
We show an entity linking result on an example from BC5CDR: Indomethacin induced hypotension in sodium and volume depleted rats. After a single oral dose of 4 mg/kg indomethacin (IDM) to sodium and volume depleted rats plasma renin activity (PRA) and systolic blood pressure fell significantly within four hours.
The entity linking results are shown in Table 7. In Post-Pruning, the final results (marked blue) are those linked to a concept in the partial KB MEDIC. In Thresholding, the final results (marked blue) are those scores larger than a fix threshold, for KeBioLM+CODER is 0.8, GENRE is -0.15 and EntQA is 0.043.   SNOMED and SNOMED ∁ . Results on these table also supports the conclusion we provides at §6. We find thresholding and post-pruning benefit EntQA in this additional results whereas we witness a significantly performance drop in Tab