Biomedical Concept Normalization by Leveraging Hypernyms

Biomedical Concept Normalization (BCN) is widely used in biomedical text processing as a fundamental module. Owing to numerous surface variants of biomedical concepts, BCN still remains challenging and unsolved. In this paper, we exploit biomedical concept hypernyms to facilitate BCN. We propose Biomedical Concept Normalizer with Hypernyms (BCNH), a novel framework that adopts list-wise training to make use of both hypernyms and synonyms, and also employs norm constraint on the representation of hypernym-hyponym entity pairs. The experimental results show that BCNH outperforms the previous state-of-the-art model on the NCBI dataset.


Introduction
Biomedical Concept Normalization (BCN) plays an important and prerequisite role in biomedical text processing. The goal of BCN is to link the entity mention in the context to its normalized CUI (Unique Concept Identifier) in the biomedical dictionaries such as UMLS (Bodenreider, 2004), SNOMED-CT (Spackman et al., 1997) and MedDRA (Brown et al., 1999). Figure 1 is an example of BCN from NCBI dataset (Dogan et al., 2014), the mention B-cell non-Hodgkins lymphomas should be linked to D016393 Lymphoma, B-Cell in the MEDIC (Davis et al., 2012) dictionary.
Recent works on BCN usually adopt encoders like CNN (Li et al., 2017), LSTM (Phan et al., 2019), ELMo (Peters et al., 2018;Schumacher et al., 2020) or BioBERT Fakhraei et al., 2019;Ji et al., 2020) to embed both the mention and the concept's name entities, and then feed the representations to the following classifier or ranking network to determine the corresponding concept in the biomedical dictionary. However, biomedical dictionaries are generally sparse in nature: a concept is usually provided with only CUI, referred name (recommended concept name string), synonyms (acceptable name variants, synonyms), and related concepts (mainly hypernym concepts). Therefore, effectively using the limited information in the biomedical dictionary where the candidate entities came from is paramount for the BCN task.
For concept's synonym entities, recent BNE (Phan et al., 2019) and BIOSYN (Sung et al., 2020) tries to make full use of them by synonym marginalization to enhance biomedical entity representation and achieved consistent performance improvement. Unfortunately, previous works generally ignore concept hypernym hierarchy structure, which is exactly the initial motivation of biomedical dictionary: organization of thousands of concepts under a unified and multi-level hierarchical classification schema.
We believe that leveraging hypernym information in the biomedical dictionary can improve the BCN performance based on two intuitions. First, hard negative sampling (Fakhraei et al., 2019;Phan et al., 2019) is vital for the BCN model's discriminating ability and a hypernym is a hard negative example for its hyponym naturally. Second, injecting the hypernym hierarchy information during the training process is beneficial for encoders, since currently used encoders like BioBERT only encodes the context semantics in biomedical corpora instead of the biomedical concept structural information.
To this end, we propose Biomedical Concept Normalizer with Hypernyms (BCNH), a novel framework combining the list-wise cross entropy loss with norm constraint on hypernym-hyponym entity pairs. Concretely, we reformulate the candidate target list as a three-level relevance list to consider both synonyms and hypernyms, and apply ... missense mutations in ATM were also found in tumour DNA from patients with B-cell non-Hodgkins lymphomas (B-NHL) and a B-NHL cell line ... the list-wise cross entropy loss. On the one hand, synonyms help to encode surface name variants, on the other hand, hypernyms help encode hierarchical structural information. We also apply the norm constraint on the embedding of hypernym-hyponym entity pairs to further preserve the principal hypernym relation. Specifically, for a hypernym-hyponym entity pair (e hyper , e hypo ), we constraint that the norm of hypernym entity e hyper is larger than that of e hyper in a multi-task manner. We conduct experiments on the NCBI dataset and outperforms the previous state-of-the-art model.
To sum up, the contributions of this paper are as follows. First, for the first time, we reformulate the candidate target list as a three-level relevance list and apply the list-wise loss to attend all candidate entities. Second, we innovatively use norm constraint to model the hypernym-hyponym relation, preserving the hierarchy structure information inside the entity representation. The proposed BCNH outperforms the previous state-of-the-art model on the NCBI dataset, leading to an improvement of 0.73 % on top1 accuracy.

Methodology
The architecture of our framework is illustrated in Figure 2. Our model is composed of three parts: candidate generator to generate the candidate entities from the dictionary, list-wise ranker to train the encoder, hypernym normalizer to apply the hypernym-hyponym norm constraint.

Iterative Candidate Generator
We reuse the iterative candidate generator module from BIOSYN (Sung et al., 2020). Each mention m and entity e i in dictionary D = {e 1 , e 2 , · · · } are represented first with sparse representations and dense representations. The sparse representations of m and e i are denoted as (v s m , v s e i ) which is calculated based on the character-level n-grams statistics computed over all entities from D. The dense representations of m and e i are denoted as (v d m , v d e i ), which are obtained from the pre-trained BioBERT.
The candidate generator then computes the similarity score between mention m and each entity e i by combining the sparse similarity score S sparse (m, e i ) with dense similarity score S dense (m, e i ): where function f is the inner product and λ is a trainable sparse score scalar weight. In the end, the top k 1 entities with the highest similarity scores are selected into candidate list [e 1 , e 2 , · · · , e k 1 ], and their similarity score list is denoted as z = [z 1 , z 2 , · · · , z k 1 ]. The candidate list is pre-computed and iteratively updated at the beginning of every training step.
At inference time, the entity e * ∈ D with top similarity score is retrieved, and the CUI of entity e * is returned as predicted CUI.

List-wise Ranker
For mention m and its top k 1 candidate list [e 1 , e 2 , · · · , e k 1 ], we reformulate the targets of candidate list as a three-level relevance score list. The relevance score is defined as the degree of relevance between mention m and candidate entity e i . Specifically, for a candidate entity e i , the relevance score y i is set to 2 if e i is synonym of m, 1 if e i is hypernym of m, 0 if neither synonym or hypernym. Therefore, we have a pseudo relevance score target y = [y 1 , y 2 , · · · , y k 1 ] and the candidate similarity score list z = [z 1 , z 2 , · · · , z k 1 ].
The list-wise cross entropy loss (LCE) (Cao et al., 2007) then is applied on the relevance score y and candidate similarity score z. The objective of learning candidate similarity is formalized as minimization of the total LCE losses on all examples: where M is the number of mentions in the training dataset.
Leveraging hypernyms for the list-wise learning targets can be interpreted as a hard negative sampling technique (Kalantidis et al., 2020), which is crucial under the contrastive learning framework.

Hypernym Normalizer
Though we take hypernyms into account by the listwise training, the hypernym hierarchy information inside the dictionary is still absent in concept entity representation. It has been proven in (Vulić and Mrkšić, 2018) that the asymmetric norm distance is an effective way to encode the hierarchical ordering between hypernym and hyponym entities.
During training, we prepare a k 2 length hypernym list (e h 1 , e h 2 , · · · ) for mention m. We denote the norm distance between mention m and its all hyponyms as N ormLoss: By minimizing the N ormLoss, we constraint that the norm of hypernym embedding vector v d h i is larger than the mention embedding vector v d m under the intuition that the norm constraint fine-tunes norm values in the Euclidean embedding space to reflect the hierarchical organization of biomedical concept entities.
In the end, the BCNH jointly optimizes cost: 3 Experiments

Experimental setup
Dataset We train and evaluate our model on the NCBI Disease corpus, a collection of 793 PubMed abstracts with disease mentions and their concepts corresponding to the MEDIC dictionary. In this work, we use the MEDIC of version February 1, 2021 that contains 13,103 CUIs, 74,215 synonyms, and 21,999 hypernyms.
Preprocessing We follow the same dataset preprocessing including lower-casing, punctuation removing, abbreviations expanding, composite mentions splitting in previous works (Leaman and Lu, 2016;Wright, 2019;Phan et al., 2019;Sung et al., 2020). We use the top k accuracy metric to evaluate the task.
Hyper-parameters We set all the parameters in the candidate generator exactly the same with BIOSYN for fair comparison. Our model only introduces a new hyper parameter k 2 = 10 in our experiments. When the hypernyms of mention m in the dictionary is more than k 2 , we truncate it to k 2 ; and pad null entity if less than k 2 . The Adam optimizer (Kingma and Ba, 2014) is used to minimize the final loss.

Results
The main results are shown in Table 1. Our proposed BCNH outperforms the previous state-of-theart model BIOSYN (Sung et al., 2020) on Acc@1 and Acc@5 with an improvement of 0.73% and 1.18%, respectively. Our model also obtains a smaller confidence interval.

Ablation study
We conduct the ablation study to figure out the contributions of the two proposed components. The results are presented in Table 3. The first experiment reports the results of BIOSYN and the second reports for BIOSYN with a joint hypernym norm constraint. The third experiment reports the results of BCNH with list-wise training only, and the last experiment reports for BCNH with both list-wise training and norm constraint.  The results demonstrate that norm constraint indeed endows the concept entity representation with the hypernym-hyponym hierarchy structure. It also verifies that hypernyms are beneficial for harder negative sampling and paying attention to all candidate entities including hypernyms list-wisely is more appropriate than marginalization solely on the synonyms.

Conclusion
In this paper, we propose BCNH to leverage hypernyms in the biomedical concept normalization task. We adopts both list-wise training and norm constraint with the help of hypernym information. The experimental results on the NCBI dataset show that BCNH outperforms previous state-of-the-art models.