Knowledge-Rich Self-Supervision for Biomedical Entity Linking

Entity linking faces significant challenges such as prolific variations and prevalent ambiguities, especially in high-value domains with myriad entities. Standard classification approaches suffer from the annotation bottleneck and cannot effectively handle unseen entities. Zero-shot entity linking has emerged as a promising direction for generalizing to new entities, but it still requires example gold entity mentions during training and canonical descriptions for all entities, both of which are rarely available outside of Wikipedia. In this paper, we explore Knowledge-RIch Self-Supervision ($\tt KRISS$) for biomedical entity linking, by leveraging readily available domain knowledge. In training, it generates self-supervised mention examples on unlabeled text using a domain ontology and trains a contextual encoder using contrastive learning. For inference, it samples self-supervised mentions as prototypes for each entity and conducts linking by mapping the test mention to the most similar prototype. Our approach can easily incorporate entity descriptions and gold mention labels if available. We conducted extensive experiments on seven standard datasets spanning biomedical literature and clinical notes. Without using any labeled information, our method produces $\tt KRISSBERT$, a universal entity linker for four million UMLS entities that attains new state of the art, outperforming prior self-supervised methods by as much as 20 absolute points in accuracy.


Introduction
Entity linking maps mentions to unique entities in a target knowledge base (Roth et al., 2014). It can be viewed as the extreme case of named entity recognition and entity typing, where the category number swells to tens of thousands or even millions. Entity linking is particularly challenging in * These authors contributed equally to this research. † Work done as an intern at Microsoft Research.
high-value domains such as biomedicine, where variations and ambiguities abound. For instance, depending on the context, "PDF" may refer to a gene (Peptide Deformylase, Mitochondrial), or file type (Portable Document Format). Similarly, "ER" could refer to emergency room, the organelle endoplasmicreticulum, or the estrogen receptor gene. Moreover, the number of entities in domains such as biomedicine can be very large. The Unified Medical Language System (UMLS), a representative ontology for biomedicine, contains over three million entities (Bodenreider, 2004). Standard classification approaches such as MedLinker (Loureiro and Jorge, 2020) require example gold mentions for each entity and cannot effectively handle new entities for which there are no labeled examples in training. Recently, zeroshot entity linking has emerged as a promising direction for generalizing to unseen entities (Logeswaran et al., 2019;, by learning to encode contextual mentions for similarity comparison against reference entity descriptions. Existing methods, however, require example gold entity mentions during training, as well as canonical descriptions for all entities. While applicable to Wikipedia entities, these methods are hard to generalize to other domains, where such labeled information is rarely available at scale. In this paper, we explore Knowledge-RIch Self-Supervision (KRISS) for entity linking by leveraging readily available domain knowledge to compensate for the lack of labeled information ( Figure 1). For entity linking, the most relevant knowledge source is the domain ontology. The core of an ontology is the entity list, which specifies the unique identifier and canonical name for each entity and is the prerequisite for entity linking. Our method only requires the entity list and unlabeled text, which are readily available in any domain.
In training, KRISS uses the entity list to generate self-supervised mention examples from unlabeled  ... Their initial treatment in the emergency room is the essential link between first aid in the field and ...

Minibatch of Contextual Mention Pairs Contrastive Loss
Push apart Down-regulation of estrogen receptor gene expression was enhanced by the development of the disease ...

Contextual Mention Encoder
Positive Pair ... Their initial treatment in the emergency room is the essential link between first aid in the field and ...

Push together
Emergency room crowding has become a wide-spread problem in hospitals across the United States ...

Unlabeled Text
Figure 1: Illustration of knowledge-rich self-supervised entity linking.
text, and trains a contextual mention encoder using contrastive learning (Gao et al., 2014;, by mapping mentions of the same entity closer. For inference, KRISS samples prototypes for each entity from the self-supervised mentions. Given a test mention, KRISS finds the most similar prototype and returns the entity it represents. Prior methods that leverage domain ontology for entity linking often resort to string matching (against entity names and aliases), making them vulnerable to both variations and ambiguities. Recently, a flurry of methods have been proposed to conduct biomedical entity representation learning from synonyms in the ontology, such as BIOSYN (Sung et al., 2020), SapBERT (Liu et al., 2021), and others (Lai et al., 2021). These methods can resolve variations to some extent, but they completely ignore mention contexts and cannot resolve ambiguities. Given an entity mention, they only predict a surface form, rather than a unique entity as required by entity linking (e.g., see footnote 2 in SapBERT (Liu et al., 2021)). As we will show in §4.5, their predicted surface forms are often ambiguous and can't be mapped to a unique entity. Unfortunately, starting from BIOSYN, these papers all adopt an incorrect evaluation method that simply ignores the ambiguity and declares the predicted surface form as correct. Consequently, their reported "entity linking" scores are often highly inflated and do not represent true linking performance. In §4.5, we provide a detailed analysis to illustrate this problem, which we hope would contribute to rectifying this significant evaluation error in future entity linking work.
We conduct our study on biomedicine, which serves as a representative high-value domain where prior methods are hard to apply. Among the three million biomedical entities in UMLS, less than 6% have any description available. Gold mention labels are available for only a tiny fraction of entities. E.g., MedMentions (Mohan and Li, 2019), the largest biomedical entity linking dataset, only covers 35 thousand entities.
We applied our method to train KRISSBERT, a universal entity linker for all three million biomedical entities in UMLS, using only the entity list in UMLS and unlabeled text in PubMed 1 .
KRISSBERT can also incorporate additional domain knowledge in UMLS such as entity aliases and ISA hierarchy. We conducted extensive evaluation on seven standard biomedical entity linking datasets spanning biomedical literature and clinical notes. KRISSBERT demonstrated clear superiority, outperforming prior state of the art by 10 points in average accuracy and by over 20 points in MedMentions.
KRISSBERT can be directly applied to lazy learning ( §3.7) with no additional training, by simply using gold mention examples as prototypes during inference. This universal model already attains comparable results as dataset-specific state-of-the-art supervised methods, each tailored to an individual dataset by limiting entity candidates and using additional supervision sources and more complex methods (e.g., coreference rules and joint inference). We released KRISSBERT to facilitate research and applications in biomedical entity linking.

Related Work
Entity linking Many applications require mapping mentions to unique entities. E.g., knowing that some drug can treat some disease is not very useful, unless we know the specific drug and disease. Entity linking is inherently challenging given the large number of unique entities. Prior work often adopts a pipeline approach that first narrows entity candidates to a small set (candidate generation) and then learns to classify contexts of the mention and a candidate entity (candidate ranking) (Bunescu and Paşca, 2006;Cucerzan, 2007;Ratinov et al., 2011). Candidate generation often resorts to string matching or TF-IDF variants (e.g., BM25), which are vulnerable to variations. Ranking features are manually engineered or learned via various neural architectures Ganea and Hofmann, 2017;Kolitsas et al., 2018). Additionally, entity relations (e.g., concept hierarchy) and joint inference have been explored for improving accuracy (Gupta et al., 2017;Murty et al., 2018;Cheng and Roth, 2013;Le and Titov, 2018). These methods are predominantly supervised, and suffer from the scarcity of annotated examples, especially given the large number of entities to cover. By contrast, KRISSBERT leverages self-supervision using readily available domain knowledge and unlabeled text, and can effectively resolve variations and ambiguities for millions of entities.
Knowledge-rich self supervision Domain ontology such as UMLS has been applied to selfsupervise biomedical named entity recognition (Zhang and Elhadad, 2013;Almgren et al., 2016). Recently, Sung et al. (2020); Liu et al. (2021) pro-pose SapBERT for mention normalization by conducting contrastive learning over synonyms from UMLS. However, SapBERT completely ignores mention contexts. It can resolve some variations but not ambiguity 2 . By contrast, we apply contrastive learning on mention contexts, and leverage unlabeled text to generate self-supervised examples. SapBERT relies on synonyms to learn spelling variations. Our approach can learn with just the canonical name for each entity, as self-supervised mention examples naturally capture contexts where synonymous mentions may appear in.
3 Knowledge-Rich Self-Supervision for Entity Linking Entity linking grounds textual mentions to unique entities in a given database/dictionary. Formally, the goal of entity linking is to learn a function Link : (m, T ) → e that maps mention m in the context T to the unique entity e. Self-supervised entity linking assumes no access to any gold mention examples. The knowledge-rich self-supervision setting (KRISS) assumes that only a domain ontology O and an unlabeled text corpus T are available.
In particular, we require the availability of an entity list, which specifies for each entity the unique identifier and a canonical name. Entity list is the prerequisite for entity linking, as it provides the targets for linking. Our framework can also incorporate other knowledge in the ontology ( §3.5).

Generating Self-Supervision
To generate self-supervised mention examples, we first compile a list of entity names from preferred terms in UMLS. We then build a trie from these names (case preserved) to efficiently search them in plain text. When an exact match is found, a fixed-size window around the mention will be returned as context. Some preferred terms are shared by multiple entities. To reduce noise for training and inference, we skip the ambiguous terms. We conducted this process on PubMed abstracts and obtained over 1.6 billion mention examples, each of which is uniquely linked to an entity in UMLS.
The estimated linking accuracy based on random samples is 85%. Note that not all UMLS entities have self-supervised examples, as they have never been mentioned in PubMed. This is not an issue for training as our goal is to learn a general encoder that maps mentions of the same entity closer ( §3.2). For inference, the ISA hierarchy in UMLS can be leveraged to compensate for the lack of selfsupervised examples ( §3.5).

Contrastive Learning
Given the self-supervised mentions, we train a mention encoder using contrastive learning by mapping mentions of the same entity closer and mentions of different entities farther apart. Specifically, each mention m is encoded into a contextual vector c using a Transformer-based encoder (Vaswani et al., 2017), with the following input format: where 1 [k̸ =i] ∈ {0, 1} is an indicator function evaluating to 1 iff k ̸ = i and τ denotes a temperature parameter. The final loss is computed across all positive pairs in a minibatch:

Mention Masking and Replacement
Skipping ambiguous names improves the quality of mention examples ( §3.1), but models trained with such self-supervision tend to over-index on surface matching, limiting generalizability. To overcome this, we propose two strategies to augment alternative views of the encoder input during training: Mention Masking With a probability p mask , we mask the mention using [MASK], which regularizes the model from lexical memorization and encourages it to leverage cues from surrounding context.

Mention Replacement
With a probability p replace , the mention is replaced with its synonym in UMLS while the context is kept unchanged. This yields a new mention of the same entity, encouraging the model to generalize across entity variations.

Linking with Self-Supervised Prototypes
At test time, for each entity e in the entity list E compiled in §3.1, we sample a small set of selfsupervised mentions as reference prototypes, denoted as Proto(e). Given a test/query mention m q , we return the entity e with the most similar prototype m p based on the self-supervised encoding: For efficient linking, we pre-compute the contextual vectors of all reference prototypes and leverage fast nearest neighbor search tool that can scale to millions of entities (Johnson et al., 2019).

Incorporating Additional Knowledge
Our self-supervised entity linking formulation can easily incorporate other knowledge available in an ontology, either by generating additional mention examples from unlabeled text, or by creating special entity-centric examples, which can be used both for learning and inference. This is especially important for entities without self-supervised mentions from PubMed ( §3.1).
Aliases Ontology often includes aliases for some entities. The alias lists are generally incomplete and aliases such as acronyms are highly ambiguous. So they can't be used as a definitive source for candidate generation. However, aliases can be used in KRISS to generate additional self-supervised mentions from unlabeled text, just like the preferred terms. To reduce noise, we similarly skip ambiguous aliases shared by multiple entities.
Semantic hierarchy Ontology often organizes entities in a hierarchy via ISA relationship among entities. For instance, in UMLS, the ER gene is assigned a Semantic Tree Number (A1.2.3.5), which specifies the ISA path from root to its entity type (Gene or Genome). For each entity in UMLS, we concatenate its semantic tree number (stn), entity type, as well as aliases to generate an entity-centric reference in the following form: We introduce a separate encoder to compute the vector representation r e from the last-layer hidden state of [CLS] for entity e. For learning, besides the contextual vectors {c 1 , ..., c 2N } for N entities, a minibatch includes N entity-centric references {r e 1 , ..., r e N }. Given a positive pair (c i , r e j ), we treat the other N − 1 entity-centric references as negatives and compute the InfoNCE loss: where π is a temperature parameter. The final loss between mentions and entity-centric references is computed across all positive pairs in a minibatch: We jointly optimize two contrastive losses αL + βL ′ , with weights α and β. For inference, we include entity-centric references in Link(m q ) as: Entity description For a small fraction of common entities, manually written descriptions may be available. In UMLS, less than 6% of entities have description, so they can't be used as the main source for contrastive learning and linking. Still, the information may be useful and can be incorporated in KRISS by appending it to the entity-centric reference (separated by [SEP]).

Cross-Attention Candidate Re-Ranking
Inspired by Logeswaran et al. (2019); , we further improve the linking accuracy by learning to re-rank the top K candidates via a cross-attention encoder. The input concatenates the mention and candidate representations (with the second [CLS] removed). A linear layer is applied to the top [CLS] encoding to compute the re-ranking score. The training data is generated by pairing self-supervised mentions with top K candidates based on Link(m t ). We learn the encoder using a cross-entropy loss that maximizes the re-ranking score for the correct entity.

Lazy Learning
KRISS does not require labeled information in training or inference. However, if labeled examples are available, KRISS can directly use them,

Baseline Systems
We conduct head-to-head comparison against five baseline systems, including popular tools and prior state-of-the-art methods: QuickUMLS (Soldaini and Goharian, 2016), BLINK , SapBERT (Liu et al., 2021), MedLinker (Loureiro and Jorge, 2020), ScispaCy (Neumann et al., 2019). See subsection A.4 for details. Table 2 shows the main results. KRISSBERT results are averaged over three runs with different random seeds. As expected, QuickUMLS provides a reasonable dictionary-based baseline but can't effectively handle variations and ambiguities. BLINK attained promising results in the Wikipedia domain, but performed poorly in biomedical entity linking, due to the scarcity of available entity descriptions. SapBERT performed well on largely unambiguous entity types such as chemicals/drugs but faltered in more challenging datasets such as MedMentions. By contrast, KRISSBERT performed substantially better across the board, establishing new state of the art in self-supervised biomedical entity linking, outperforming prior best systems by 10 points in average and by over 20 points in MedMentions. The SapBERT results are different from Liu et al. (2021); we explain the difference in §4.5. By leveraging knowledge-rich self-supervision, KRISSBERT even substantially outperformed supervised entity linkers such as MedLinker and Scis-paCy, which used MedMention training data, gaining over 10-20 absolute points in average.

Main Results
Self-supervised KRISSBERT also outperforms KRISSBERT (supervised only). It is particularly remarkable as KRISSBERT (self-supervised) learns a single, unified model for over three million UMLS entities, whereas KRISSBERT (supervised only) learns separate supervised models that tailor to individual datasets. This seemingly counter-intuitive result can be explained by the unreasonable effectiveness of data (Halevy et al., 2009). Knowledgerich self-supervision produces a large dataset comprising diverse entity and mention examples. Despite the inherent noise, it confers significant advantage over supervised learning with small training data. This manifests most prominently in small clinical datasets like ShARe and N2C2.

Why the Entity Linking Scores Reported
in the SapBERT Paper Are Incorrect?
The SapBERT paper (Liu et al., 2021) reported substantially higher scores than that in Table 2. Unfortunately, this stems from a significant error in their evaluation method, as inherited from BIOSYN (Sung et al., 2020) and widely adopted in subsequent work (e.g., Lai et al., 2021). Here, we conduct a detailed analysis using SapBERT (Liu et al., 2021) as the representative example. The problem can be immediately discerned by first principle. SapBERT completely ignores the Mention: "... Hence, we aimed to find drug targets using the 2DE / MS proteomics study of a dexamethasone ..." SapBERT prediction: surface form MS, which is shared by multiple entities, such as Master of Science (C1513009), Mass Spectrometry (C0037813), etc. KRISSBERT prediction: Mass Spectrometry (C0037813) KRISSBERT predicted prototype: "... mass spectrometry is a widely used technique for enrichment and sequencing of phosphopeptides ..." Example: "... every patient followed up accordingly within ten days of discharge ..." SapBERT prediction: surface form DISCHARGE, which is shared by multiple entities, such as Discharge, Body Substance, Sample (C0600083), Patient Discharge (C0030685), etc.  context of an entity mention (e.g., see Footnote 3 and Formal Definition in Section 2 in Liu et al., 2021). Given an ambiguous mention, there is no way such methods can resolve the ambiguity. Instead, these methods would merely produce a surface form (Footnote 2 in Liu et al., 2021). If the surface form matches multiple entities in name or alias, these methods can't predict a unique entity as required by entity linking. Unfortunately, such an ambiguous prediction is considered correct by their evaluation, as long as the gold entity is one of the matching entities. 3 Table 3 shows examples of such ambiguous cases. E.g., given the mention "MS", without the context SapBERT has no way to resolve its ambiguity. Instead, it simply returns a verbatim surface form "MS", which can be mapped to many UMLS    entities. Following BIOSYN, SapBERT evaluation would simply considers this as correct, as one of the matching entities is the gold entity Mass Spectrometry (C0037813). However, this obviously does not reflect the true linking performance for Sap-BERT, as it can't distinguish it from other equally matching entities such as such as Master of Science (C1513009) and Montserrat Island (C0026514).
Even if we adopt this incorrect evaluation method, KRISSBERT still substantially outperforms SapBERT, especially on the largest and most challenging MedMention dataset (see Table 4). The gain stems from cases when the gold entities have no official aliases matching the surface form predicted by SapBERT, whereas KRISSBERT can still match the gold entity based on context (e.g., see the last two examples in Table 3). We also evaluated the trivial baseline that returned the mention as is and found that SapBERT often does not out-  Table 7: Ablation study of KRISSBERT on the impact of knowledge components and domain-specific pretraining. perform it by much, especially on the most representative MedMention dataset. Interestingly, under this inflated evaluation, KRISSBERT appears to slightly underperform SapBERT in the relatively easy datasets NCBI and BC5CDR-d (both about diseases). We found that, in rare occasions, the context may lead KRISSBERT to predict a more fine-grained concept (see subsection A.5).
As shown in Table 5, ambiguous mentions 4 abound, especially in more diverse and realistic datasets such as N2C2 and MedMentions. The SapBERT paper's evaluation thus reflects the oracle score (assuming that the right entity is always chosen out of multiple candidates), rather than true linking performance. For more realistic assessment, if SapBERT returns multiple entities, a random one would be chosen for evaluation, as in §4.4. Not surprisingly, KRISSBERT substantially outperforms SapBERT in the ambiguous cases, but still has much room for growth.

Lazy Supervised Entity Linking
KRISSBERT can make good use of labeled data when available. Even lazy learning ( §3.7) yields results comparable to supervised state of the art, as shown in Table 6. Note that KRISSBERT (lazy supervised) is based on a single task-agnostic model (KRISSBERT (self-supervised)), and simply uses corresponding training set examples as prototypes for linking in a zero-shot fashion. By contrast, prior supervised state-of-the-art results were attained using separate models that tailored to individual datasets. They may use additional supervision such as coreference and joint inference (Angell et al., 2021), which can be incorporated into KRISSBERT.

Ablation Studies
In Table 7, we conduct a series of ablation studies to understand the impact of domain knowledge and model choices. Deep cross attention between query mentions and candidates produces consistent gains. The mention pair contrastive loss L ( §3.2) is fundamental for self-supervised learning, whereas additional domain knowledge such as entity descriptions and semantic hierarchy offer incremental gains. Domain-specific pretraining (PubMedBERT; Gu et al., 2021) offers a substantial advantage for biomedical entity linking, gaining 6.5 points on average over BERT initialization.

Discussion
Aside from BC5CDR-c where KRISSBERT already performs very well, there is a large gap (10-15 points) between top-1 and top-5 accuracy, in both self-supervised and lazy supervised settings (Figure 2). This suggests that there is much room for KRISSBERT to gain by further improving ranking.
KRISSBERT also facilitates efficient few-shot learning, with a single example per entity yielding over 10 point gain in N2C2. subsection A.5 Table 9 shows examples of common errors by KRISSBERT. They are subtle and challenging. E.g., the gold concept is expression, while KRISSBERT predicts the procedure of expression.

Conclusion
We propose knowledge-rich self-supervised entity linking by conducting contrastive learning on mention examples generated from unlabeled text using available domain knowledge. Experiments on seven standard biomedical entity linking datasets show that our proposed KRISSBERT outperforms prior state of the art by as much as 20 points in accuracy. Future directions include: further improving self-supervision quality; incorporating additional knowledge; applications to other domains.

Limitations
KRISS is mainly tested for languages with limited morphology, i.e., English. Relatively large GPU resources, 4 NVIDIA V100 GPUs, are required to train the KRISSBERT model. Therefore, we did not do an exhaustive search for hyperparameters. Our experiments report results on seven standard biomedical datasets, which may not reflect KRISSBERT performance in the real-world applications.  Zero-shot entity linking by reading entity descriptions (Logeswaran et al., 2019; learns to encode contextual mentions against entity descriptions and attains state-of-the-art zeroshot entity linking results in the Wikipedia domain. Prior work uses gold mention examples in supervised learning. We adapt it to self-supervised learning using the self-supervised mention examples and available entity descriptions in UMLS. Prior work initializes the encoder with general-domain BERT models. To ensure head-to-head comparison, we followed KRISSBERT to use PubMedBERT (Gu et al., 2021) instead, which yielded better results.

References
SapBERT (Liu et al., 2021) learns to resolve variations in entity surface forms using synonyms in UMLS, using PubMedBERT (Gu et al., 2021). It ignores the mention context and returns all entities with a matching surface form. To use SapBERT for linking, we randomly select an entity when SapBERT returns multiple ones.
MedLinker (Loureiro and Jorge, 2020) is a strong supervised entity linking baseline that trains a BERT model on MedMentions. during test, it augments BERT-based prediction with approximate dictionary match for entities unseen in training.
ScispaCy (Neumann et al., 2019) provides another strong entity linking baseline that leverages labeled data in MedMentions to tune an elaborate biomedical linking system that uses TF-IDF based approximate matching and sophisticated abbreviation expansion.

A.5 Error Analysis
In Table 8, KRISSBERT considers "t cell prolymphocytic leukemia" and "families with" in the context of two mentions, and predicts more specific entities than the gold ones.
Mention: "By analysing tumor DNA from patients with sporadic t cell prolymphocytic leukemia, a rare clonal malignancy with similarities to a mature t cell leukemia seen in ataxia telangiectasia ..." Gold entity: T-Cell Leukemia (C0023492) KRISSBERT prediction: T-Cell Prolymphocytic Leukemia (C2363142) Example: "The majority (81%) of the breast ovarian cancer families were due to BRCA1, with most others (14%) due to BRCA2. Conversely, the majority of families with female breast cancer were due to BRCA2 (76%)." Gold entity: Breast cancer (C0006142) KRISSBERT prediction: Familial cancer of breast (C0346153)