SMedBERT: A Knowledge-Enhanced Pre-trained Language Model with Structured Semantics for Medical Text Mining

Recently, the performance of Pre-trained Language Models (PLMs) has been significantly improved by injecting knowledge facts to enhance their abilities of language understanding. For medical domains, the background knowledge sources are especially useful, due to the massive medical terms and their complicated relations are difficult to understand in text. In this work, we introduce SMedBERT, a medical PLM trained on large-scale medical corpora, incorporating deep structured semantic knowledge from neighbours of linked-entity. In SMedBERT, the mention-neighbour hybrid attention is proposed to learn heterogeneous-entity information, which infuses the semantic representations of entity types into the homogeneous neighbouring entity structure. Apart from knowledge integration as external features, we propose to employ the neighbors of linked-entities in the knowledge graph as additional global contexts of text mentions, allowing them to communicate via shared neighbors, thus enrich their semantic representations. Experiments demonstrate that SMedBERT significantly outperforms strong baselines in various knowledge-intensive Chinese medical tasks. It also improves the performance of other tasks such as question answering, question matching and natural language inference.

In the literatures, a majority of KEPLMs (Zhang et al., 2020a;Hayashi et al., 2020; inject information of entities corresponding to mention-spans from Knowledge Graphs (KGs) into contextual representations. However, those KEPLMs only utilize linked-entity in the KGs as auxiliary information, which pay little attention to the neighboring structured semantics information of the entity linked with text mentions. In the medical context, there exist complicated domain knowledge such as relations and medical facts among medical terms (Rotmensch et al., 2017;Li et al., 2020), which are difficult to model using previous approaches. To address this issue, we consider leveraging structured semantics knowledge in medical KGs from the two aspects.
(1) Rich semantic information from neighboring structures of linked-entities, such as entity types and relations, are highly useful for medical text understanding. As in Figure 1, "新型冠状病毒" (novel coronavirus) can be the cause of many diseases, such as "肺炎" (pneumonia) and "呼吸综合征" (respiratory syndrome). 2 (2) Additionally, we leverage neighbors of linked-entity as global "contexts" to complement plain-text contexts used in (Mikolov et al., 2013a;Pennington et al., 2014). The structure knowledge contained in neighbouring entities can act as the "knowledge bridge" between mention-spans, facilitating the interaction of different mention representations. Hence, PLMs can learn better representations for rare medical terms.
In this paper, we introduce SMedBERT, a KE-PLM pre-trained over large-scale medical corpora and medical KGs. To the best of our knowledge, SMedBERT is the first PLM with structured semantics knowledge injected in the medical domain. Specifically, the contributions of SMedBERT mainly include two modules: Mention-neighbor Hybrid Attention: We fuse the embeddings of the node and type of linkedentity neighbors into contextual target mention representations. The type-level and node-level attentions help to learn the importance of entity types and the neighbors of linked-entity, respectively, in order to reduce the knowledge noise injected into the model. The type-level attention transforms the homogeneous node-level attention into a heterogeneous learning process of neighboring entities. Mention-neighbor Context Modeling: We propose two novel self-supervised learning tasks for promoting interaction between mention-span and corresponding global context, namely masked neighbor modeling and masked mention modeling. The former enriches the representations of "context" neighboring entities based on the well trained "target word" mention-span, while the latter focuses on gathering those information back from neighboring entities to the masked target like low-frequency mention-span which is poorly represented (Turian et al., 2010).
In the experiments, we compare SMedBERT against various strong baselines, including mainstream KEPLMs pre-trained over our medical resources. The underlying medical NLP tasks include: named entity recognition, relation extraction, question answering, question matching and natural language inference. The results show that SMedBERT consistently outperforms all the baselines on these tasks.

Related Work
PLMs in the Open Domain. PLMs have gained much attention recently, proving successful for boosting the performance of various NLP tasks . Early works on PLMs focus on feature-based approaches to transform words into distributed representations (Collobert and Weston, 2008;Mikolov et al., 2013b;Pennington et al., 2014;Peters et al., 2018). BERT (Devlin et al., 2019) (as well as its robustly optimized version RoBERTa ) employs bidirectional transformer encoders (Vaswani et al., 2017) and self-supervised tasks to generate context-aware token representations. Further improvement of performances mostly based on the following three types of techniques, including self-supervised tasks (Joshi et al., 2020), transformer encoder architectures  and multi-task learning (Liu et al., 2019a). Knowledge-Enhanced PLMs. As existing BERTlike models only learn knowledge from plain corpora, various works have investigated how to incorporate knowledge facts to enhance the language understanding abilities of PLMs. KEPLMs are mainly divided into the following three types.
(1) Knowledge-enhanced by Entity Embedding: ERNIE-THU (Zhang et al., 2019) and KnowBERT (Peters et al., 2019) inject linked-entity as heterogeneous features learned by KG embedding algorithms such as TransE (Bordes et al., 2013). (2) Knowledge-enhanced by Entity Description: E-BERT (Zhang et al., 2020a) and KEPLER (Wang et al., 2019b) add extra description text of entities to enhance semantic representation. (3) Knowledgeenhanced by Triplet Sentence: K-BERT (Liu et al., 2020b) and CoLAKE  convert triplets into sentences and insert them into the training corpora without pre-trained embedding. Previous studies on KG embedding (Nguyen et al., 2016;Schlichtkrull et al., 2018) have shown that utilizing the surrounding facts of entity can obtain more informative embedding, which is the focus of our work. PLMs in the Medical Domain. PLMs in the medical domain can be generally divided into three categories. (1) BioBERT (Lee et al., 2020), Blue-BERT (Peng et al., 2019), SCIBERT (Beltagy et al., 2019) and ClinicalBert (Huang et al., 2019) apply continual learning on medical domain texts, such as PubMed abstracts, PMC full-text articles and MIMIC-III clinical notes. (2) PubMedBERT  (Gu et al., 2020) learns weights from scratch using PubMed data to obtain an in-domain vocabulary, alleviating the out-of-vocabulary (OOV) problem. This training paradigm needs the support of largescale domain data and resources. (3) Some other PLMs use domain self-supervised tasks for pretraining. For example, MC-BERT (Zhang et al., 2020b) masks Chinese medical entities and phrases to learn complex structures and concepts. Disease-BERT (He et al., 2020) leverages the medical terms and its category as the labels to pre-train the model. In this paper, we utilize both domain corpora and neighboring entity triplets of mentions to enhance the learning of medical language representations.

Notations and Model Overview
In the PLM, we denote the hidden feature of each token {w 1 , ..., w N } as {h 1 , h 2 , ..., h N } where N is the maximum input sequence length and the total number of pre-training samples as M . Let E be the set of mention-span e m in the training corpora. Furthermore, the medical KG consists of the entities set E and the relations set R. The triplet set is S = {(h, r, t) | h ∈ E, r ∈ R, t ∈ E}, where h is the head entity with relation r to the tail entity t. The embeddings of entities and relations trained on KG by TransR (Lin et al., 2015) are represented as Γ ent and Γ rel , respectively. The neighboring entity set recalled from KG by e m is denoted as N em = {e 1 m , e 2 m , ..., e K m } where K is the threshold of our PEPR algorithm. We denote the number of entities in the KG as Z. The dimensions of the hidden representation in PLM and the KG embeddings are d 1 and d 2 , respectively.
The main architecture of the our model is shown in Figure 2. SMedBERT mainly includes three components: (1) Top-K entity sorting determine which K neighbour entities to use for each mention.
(2) Mention-neighbor hybrid attention aims to infuse the structured semantics knowledge into encoder layers, which includes type attention, node attention and gated position infusion module.
(3) Mention-neighbor context modeling includes masked neighbor modeling and masked mention modeling aims to promote mentions to leverage and interact with neighbour entities.

Top-K Entity Sorting
Previous research shows that simple neighboring entity expansion may induce knowledge noises during PLM training (Wang et al., 2019a). In order to recall the most important neighboring entity set from the KG for each mention, we extend the Personalized PageRank (PPR) (Page et al., 1999) algorithm to filter out trivial entities. 3 Recall that the iterative process in PPR is V i = (1−α)A·V i−1 +αP where A is the normalized adjacency matrix, α is the damping factor, P is uniformly distributed jump probability vector, and V is the iterative score vector for each entity.
PEPR specifically focuses on learning the weight for the target mention span in each iteration. It assigns the span e m a higher jump probability 1 in P with the remaining as 1 Z . It also uses the entity frequency to initialize the score vector V : where T is the sum of frequencies of all entities. t em is the frequency of e m in the corpora. After sorting, we select the top-K entity set N em .

Mention-neighbor Hybrid Attention
Besides the embeddings of neighboring entities, SMedBERT integrates the type information of medical entities to further enhance semantic representations of mention-span.

Neighboring Entity Type Attention
Different types of neighboring entities may have different impacts. Given a specific mention-span e m , we compute the neighboring entity type attention. Concretely, we calculate hidden representation of each entity type τ as where f sp is the self-attentive pooling (Lin et al., 2017) to generate the mention-span representation h em ∈ R d 1 and the (h i , h i+1 , . . . , h j ) is the hidden representation of tokens (w i , w i+1 , . . . , w j ) in mention-span e m trained by PLMs. h em ∈ R d 2 is obtained by σ(·) non-linear activation function GELU (Hendrycks and Gimpel, 2016) and the learnable projection matrix W be ∈ R d 1 ×d 2 . LN is the LayerNorm function (Ba et al., 2016). Then, we calculate the each type attention weight using the type representation h τ ∈ R d 2 and the transformed mention-span representation h em : . Finally, the neighboring entity type attention weights α τ are obtained by normalizing the attention score α τ among all entity types T .

Neighboring Entity Node Attention
Apart from entity type information, different neighboring entities also have different influences. Specifically, we devise the neighboring entity node attention to capture the different semantic influences from neighboring entities to the target mention span and reduce the effect of noises. We calculate the entity node attention using the mentionspan representation h em and neighboring entities representation h e i m with entity type τ as: where W q ∈ R d 2 ×d 2 and W k ∈ R d 2 ×d 2 are the attention weight matrices.
The representations of all neighboring entities in N em are aggregated toh em ∈ R d 2 : h em is the mention-neighbor representation from hybrid attention module.

Gated Position Infusion
Knowledge-injected representations may divert the texts from its original meanings. We further reduce knowledge noises via gated position infusion: We generate the output token representation h if by 4 : where W ug , W ex ∈ R 2d 1 ×d 1 . b ug , b ex ∈ R d 1 . " * " means element-wise multiplication.

Mention-neighbor Context Modeling
To fully exploit the structured semantics knowledge in KG, we further introduce two novel selfsupervised pre-training tasks, namely Masked Neighbor Modeling (MNeM) and Masked Mention Modeling (MMeM).

Masked Neighbor Modeling
Formally, let r be the relation between the mentionspan e m and a neighboring entity e i m : where h mf is the mention-span hidden features based on the tokens hidden representation h if , h (i+1)f , . . . , h jf . h r = Γ rel (r) ∈ R d 2 is the relation r representation and W sa ∈ R d 1 ×d 2 is a learnable projection matrix. The goal of MNeM is leveraging the structured semantics in surrounding entities while reserving the knowledge of relations between entities. Considering the object functions of skip-gram with negative sampling (SGNS) (Mikolov et al., 2013a) and score function of TransR (Lin et al., 2015): where the w in L S is the target word of context c. f s is the compatibility function measuring how well the target word is fitted into the context. Inspired by SGNS, following the general energy-based framework (LeCun et al., 2006), we treat mention-spans in corpora as "target words", and neighbors of corresponding entities in KG as "contexts" to provide additional global contexts. We employ the Sampled-Softmax (Jean et al., 2015) as the criterion L MNeM for the mention-span e m : where θ denotes the triplet (e m , r, e i m ), e i m ∈ N em . θ is the negative triplets (e m , r, e n ), and e n is negative entity sampled with Q(e i m ) detailed in Appendix B. To keep the knowledge of relations between entities, we define the compatibility function as: where µ is a scale factor. Assuming the norms of both h mf M r + h r and h e i m M r are 1,we have: f s e m , r, e i m = µ ⇐⇒ f tr (h mf , h r , h e i m ) = 0 (17) which indicates the proposed f s is equivalence with f tr . Because | h en M r | needs to be calculated for each e n , the computation of the score function f s is costly. Hence, we transform part of the formula f s as follows: In this way, we eliminate computation of transforming each h en . Finally, to compensate the offset introduced by the negative sampling function Q(e i m ) (Jean et al., 2015), we complement f s (e m , r, e i m ) as:

Masked Mention Modeling
In contrast to MNeM, MMeM transfers the semantic information in neighboring entities back to the masked mention e m .
where Y m is the ground-truth representation of e m and h ip = Γ p (w i ) ∈ R d 2 . Γ p is the pre-trained embedding of BERT in our medical corpora. The mention-span representation obtained by our model is h mf . For a sample s, the loss of MMeM L MMeM is calculated via Mean-Squared Error: where M s is the set of mentions of sample s.

Training Objective
In SMedBERT, the training objectives mainly consist of three parts, including the self-supervised loss proposed in previous works and the mentionneighbor context modeling loss proposed in our work. Our model can be applied to medical text pre-training directly in different languages as long as high-quality medical KGs can be obtained. The total loss is as follows: (22) where L EX is the sum of sentence-order prediction (SOP) (Lan et al., 2020) and masked language modeling. λ 1 and λ 2 are the hyperparameters. 2020b) is pre-trained over a Chinese medical corpora via masking different granularity tokens. We also pre-train BERT using our corpora, denoted as BioBERT-zh. KEPLMs: We employ two SOTA KEPLMs continually pre-trained on our medical corpora as our baseline models, including ERNIE-THU (Zhang et al., 2019) and KnowBERT (Peters et al., 2019). For a fair comparison, KEPLMs use other additional resources rather than the KG embedding are excluded (See Section 2), and all the baseline KE-PLMs are injected by the same KG embedding. The detailed parameter settings and training procedure are in Appendix B.

Intrinsic Evaluation
To evaluate the semantic representation ability of SMedBERT, we design an unsupervised semantic similarity task. Specifically, we extract all entities pairs with equivalence relations in KGs as positive pairs. For each positive pair, we use one of the entity as query entity while the other as positive candidate, which is used to sample other entities as negative candidates. We denote this dataset as D1. Besides, the entities in the same positive pair often have many neighbours in common. We select positive pairs with large proportions of common neighbours as D2. Additionally, to verify the ability of SMedBERT of enhancing the low-frequency mention representation, we extract all positive pairs that with at least one low-frequency mention as D3. There are totally 359,358, 272,320 and 41,583 samples for D1, D2, D3 respectively. We describe the   details of collecting data and embedding words in Appendix C. In this experiments, we compare SMedBERT with three types of models: classical word embedding methods (SGNS (Mikolov et al., 2013a), GLOVE (Pennington et al., 2014)), PLMs and KEPLMs. We compute the similarity between the representation of query entities and all the other entities, retrieving the most similar one. The evaluation metric is top-1 accuracy (Acc@1). Experiment results are shown in Table 1. From the results, we observe that: (1) SMedBERT greatly outperforms all baselines especially on the dataset D2 (+1.36%), where most positive pairs have many shared neighbours, demonstrating that ability of SMedBERT to utilize semantic information from the global context. (2) In dataset D3, SMedBERT improve the performance significantly (+1.01%), indicating our model is effective to enhance the representation of low-frequency mentions.

Results of Downstream Tasks
We first evaluate our model in NER and RE tasks that are closely related to entities in the input texts. Table 2 shows the performances on medical NER and RE tasks. In NER and RE tasks, we can observe from the results: (1) Compared with PLMs trained in open-domain corpora, KEPLMs with medical corpora and knowledge facts achieve better results. (2) The performance of SMedBERT is greatly improved compared with the strongest baseline in two NER datasets (+0.88%, +2.07%), and (+0.68%, +0.92%) on RE tasks. We also evaluate SMedBERT on QA, QM and NLI tasks and the performance is shown in Table 3. We can observe that SMedBERT improve the performance consistently on these datasets (+0.90% on QA, +0.89% on QM and +0.63% on NLI). In general, it can be seen from Table 2 and Table 3 that injecting the domain knowledge especially the structured semantics knowledge can improve the result greatly.

Influence of Entity Hit Ratio
In this experiment, we explore the model performance in NER and RE tasks with different entity hit ratios, which control the proportions of knowledgeenhanced mention-spans in the samples. The aver-  age number of mention-spans in samples is about 40. Figure 3 illustrates the performance of SMed-BERT and ERNIE-med (Zhang et al., 2019). From the result, we can observe that: (1) The performance improves significantly at the beginning and then keeps stable as the hit ratio increases, proving the heterogeneous knowledge is beneficial to improve the ability of language understanding and indicating too much knowledge facts are unhelpful to further improve model performance due to the knowledge noise (Liu et al., 2020b). (2) Compared with previous approaches, our SMedBERT model improves performance greatly and more stable.

Influence of Neighboring Entity Number
We further evaluate the model performance under different K over the test set of DXY-NER and DXY-RE. Figure 4 shows the the model result with K = {5, 10, 20, 30}. In our settings, the SMed-BERT can achieve the best performance in different tasks around K = 10. The results of SMed-BERT show that the model performance increasing first and then decreasing with the increasing of K. This phenomenon also indicates the knowledge noise problem that injecting too much knowledge of neighboring entities may hurt the performance.

Ablation Study
In Table 4, we choose three important model components for our ablation study and report the test  Table 4: Ablation study of SMedBERT on four datasets (testing set). Due to the space limitation, we use the abbreviations "D5", "D6", "D7", and "D8" to represent the cMedQANER, DXY-NER, CHIP-RE, and DXY-RE datasets respectively. set performance on four datasets of NER and RE tasks that are closely related to entities. Specifically, the three model components are neighboring entity type attention, the whole hybrid attention module, and mention-neighbor context modeling respectively, which includes two masked language model loss L MNeM and L MMeM . From the result, we can observe that: (1) Without any of the three mechanisms, our model performance can also perform competitively with the strong baseline ERNIE-med (Zhang et al., 2019).
(2) Note that after removing the hybrid attention module, the performance of our model has the greatest decline, which indicates that injecting rich heterogeneous knowledge of neighboring entities is effective.

Conclusion
In this work, we address medical text mining tasks with the structured semantics KEPLM proposed named SMedBERT. Accordingly, we inject entity type semantic information of neighboring entities into node attention mechanism via heterogeneous feature learning process. Moreover, we treat the neighboring entity structures as additional global contexts to predict the masked candidate entities based on mention-spans and vice versa. The experimental results show the significant improvement of our model on various medical NLP tasks and the intrinsic evaluation. There are two research directions that can be further explored: (1) Injecting deeper knowledge by using "farther neighboring" entities as contexts; (2) Further enhancing Chinese medical long-tail entity semantic representation.

A Data Source
A.1 Pre-training Data

A.1.1 Training Corpora
The pre-training corpora is crawled from DXY BBS (Bulletin Board System) 9 , which is a very popular Chinese social network for doctors, medical institutions, life scientists, and medical practitioners. The BBS has more than 30 channels, which contains 18 forums and 130 fine-grained groups, covering most of the medical domains. For our pre-training purpose, we crawl texts from channels about clinical medicine, pharmacology, public health and consulting. For text pre-processing, we mainly follow the methods of . Additionally, (1) we remove all URLs, HTML tags, e-mail addresses, and all tokens except characters, digits, and punctuation (2) all documents shorter than 256 are discard, while documents longer than 512 are cut into shorter text segments.

A.1.2 Knowledge Graph
The DXY knowledge graph is construed by extracting structured text from DXY website 10 , which includes information of diseases, drugs and hospitals edited by certified medical experts, thus the quality of the KG is guaranteed. The KG is mainly disease-centered, including totally 3,764,711 triples, 152.508 unique entities, and 44 types of relations. The details of Symptom-In-Chinese from OpenKG is available 11 . We finally get 26 types of entities, 274,163 unique entities, 56 types of relations, and 4,390,726 triples after the fusion of the two KGs.

A.2 Task Data
We choose the four large-scale datasets in Chine-seBlue tasks (Zhang et al., 2020b) while others are ignored due to the limitation of datasets size, which are cMedQANER, cMedQQ, cMedQNLI and cMedQA. WebMedQA (He et al., 2019) is a real-world Chinese medical question answering dataset and CHIP-RE dataset are collected from online health consultancy websites. Note that since both the WebMedQA and cMedQA datasets are very large while we have many baselines to be compared, we randomly sample the official training set, development set and test set respectively to form their corresponding smaller version for experiments. DXY-NER and DXY-RE are datasets from real medical application scenarios provided by a prestigious Chinese medical company. The DXY-NER contains 22 unique entity types and 56 relation types in the DXY-RE. These two datasets are collected from the medical forum of DXY and books in the medical domain. Annotators are selected from junior and senior students with clinical medical background. In the process of quality control, the two datasets are annotated twice by different groups of annotators. An expert with medical background performs quality check manually again when annotated results are inconsistent, whereas perform sampling quality check when results are consistent. Table 5 shows the datasets size of our experiments.
Model Details. We align the all mention-spans to the entity in KG by exact match for comparison purpose with ENIRE-THU (Zhang et al., 2019). The negative sampling function is defined as Q(e i m ) = t e i m C e i m , where C e i m is the sum of frequency of all mentions with the same type of e i m . The Mention-neighbor Hybrid Attention module is inserted after the tenth transformer encoder layer to compare with KnowBERT (Peters et al., 2019), while we perform the Mention-neighbor Context Modeling based on the output of BERT encoder. We use all the base-version PLMs in the experiments. The size of SMedBERT is 474MB while 393MB of that are components of BERT, and the added 81MB is mostly of the KG embedding. Results are presented in average with 5 random runs with different random seeds and the same hyperparameters.
Training Procedure. We strictly follow the originally pre-training process and parameter setting of other KEPLMs. We only adapt their publicly available code from English to Chinese and use the knowledge embedding trained on our medical KG. To have a fair comparison, the pre-training processing of SMedBERT is mostly set based on ENIRE-THU (Zhang et al., 2019) without layerspecial learning rates in KnowBERT (Peters et al., 2019). We only pre-train SMedBERT on the collected medical data for 1 epoch. In pre-training