Incorporating medical knowledge in BERT for clinical relation extraction

In recent years pre-trained language models (PLM) such as BERT have proven to be very effective in diverse NLP tasks such as Information Extraction, Sentiment Analysis and Question Answering. Trained with massive general-domain text, these pre-trained language models capture rich syntactic, semantic and discourse information in the text. However, due to the differences between general and specific domain text (e.g., Wikipedia versus clinic notes), these models may not be ideal for domain-specific tasks (e.g., extracting clinical relations). Furthermore, it may require additional medical knowledge to understand clinical text properly. To solve these issues, in this research, we conduct a comprehensive examination of different techniques to add medical knowledge into a pre-trained BERT model for clinical relation extraction. Our best model outperforms the state-of-the-art systems on the benchmark i2b2/VA 2010 clinical relation extraction dataset.


Introduction
In recent years pre-trained language models (PLMs) such as ELMo (Peters et al., 2017), BERT (Devlin et al., 2019), XLNet , and GPT (Radford et al., 2018) have become very popular as they can effectively boost the performance of diverse NLP tasks such as Information Extraction (Shi and Lin, 2019;Jia et al., 2020), Sentiment Analysis (Gao et al., 2019), Question/Answering (Lv et al., 2020) and language entailment (Devlin et al., 2019). These models are trained on large text corpora using self-supervised tasks such as masked language modeling (MLM) and next sentence prediction. As these models can learn meaningful context-sensitive text embeddings, they are frequently used to encode input text in many downstream text analysis tasks. However, PLMs trained on general-domain text (e.g., books, Wikipedia and webdata) may not be ideal for domain-specific NLP applications (e.g., biomedical NLP). In this research, we explore how medical knowledge can be added to PLMs to facilitate clinical relation extraction.
Previously, significant effort has been made on adding domain knowledge into PLMs. Based on the types of knowledge added, we can group the work into two categories: integrating domain text in PLMs (Lee et al., 2020;Peng et al., 2019a;Gu et al., 2020) and integrating domain-specific knowledge graphs into PLMs (Wang et al., 2021;Peters et al., 2019). In this study, we conduct a comprehensive investigation of these methods. We test their effectiveness in integrating knowledge from Unified Medical Language System (UMLS) into BERT for clinical relation extraction. Here we focus on UMLS because it is one of the most widely used bio-medical knowledge sources for clinical NLP. Among all the PLMs, we focus on BERT since it often achieves the state of the art performance on diverse NLP tasks. The main contributions of our work include: • We conducted a comprehensive empirical analysis of the effectiveness of applying diverse knowledge integration techniques to combine medical knowledge encoded in UMLS with embeddings from pre-trained BERT models for clinical relation extraction. • We proposed several knowledge fusion methods such as ClinicalBERT-EE-RI-CT/ST/SG, ClinicalBERT-EE-ED-CT and ClinicalBERT-EE-KB-MLM for clinical relation extraction. • Our proposed method ClinicalBERT-EE-RI-ST achieved the state of the art performance on a benchmark clinical relation extraction dataset.

Related Work
In this section, we survey the representative work on topics that are most relevant to this research: (a) incorporating domain text in BERT during model training and (b) combining knowledge graph infor-mation with BERT. We also summarize the state-ofthe-art techniques for extracting clinical relations from text.
Incorporating domain text in BERT: There are quite a few BERT models which have been trained (or fine-tuned) with bio-medical text: BioBERT (Lee et al., 2020) is pre-trained on PubMed abstracts and PMC full-text articles; Clin-icalBERT (Alsentzer et al., 2019) is pre-trained on the clinic notes in the MIMIC-III database (Johnson et al., 2016); BlueBERT (Peng et al., 2019a) is pre-trained on the PubMed abstracts and the clinical notes in MIMIC-III; PubMedBERT (Gu et al., 2020) is pre-trained using abstracts from PubMed and full-text articles from PMC. Among them, BioBERT and BlueBERT are initialized with weights from a general-domain BERT model, Clin-icalBERT is initialized from BioBERT, and Pub-MedBERT is trained from scratch.
Combining knowledge graph information with BERT: The simplest method to combine knowledge graph information with BERT is concatenation. For example, (Jeong et al., 2020) combines knowledge graph embedding trained using Graph Convolution Networks (GCN) with BERT embeddings for citation recommendation. Efforts to directly inject knowledge graph information into BERT can be further categorized into the following categories. (a) Joint optimization with knowledge graph objectives: Clinical KB-BERT (Hao et al., 2020) pre-trained BERT with a knowledge graph objective. In addition to predicting masked words, a triplet classification objective is added, where given a triplet of two concepts and a relation in UMLS, the model aims to correctly predict if the relationship exists between the two concepts. (b) Fusing entity embeddings from knowledge graphs with BERT: (Peters et al., 2019) first retrieves pretrained entity embeddings from a knowledge graph, then uses them to update BERT word embeddings via word-to-entity attention. (Weinzierl et al., 2020) incorporates entity embeddings learned from a UMLS knowledge graph into BERT using adversarial learning. (c) Augmenting BERT input with knowledge graph information: (Liu et al., 2020) presents K-BERT in which triples from knowledge graphs are added into the input sentences before sent to BERT. In (Mitra et al., 2019), relevant knowledge statements are assigned to each training instance and BERT is fine-tuned on the modified training data to facilitate question answering tasks.
Clinical relation extraction: Early work on clinical relation extraction employed supervised machine learning with a wide range of hand-crafted features such as lexical features, syntactic features as well as semantic features extracted from external knowledge resources such as UMLS, cTAKES, and Medline. Among them, (de Bruijn et al., 2010;Minard et al., 2011) derived concept mapping and concept types based on UMLS. Later, pre-trained word embeddings (e.g. Word2Vec) became the most popular input features for relation extraction. A neural network-based classifier (e.g., CNN, LSTM, GCN) is often used to predict clinical relations based on word embeddings (Sahu et al., 2016;Luo et al., 2018;Li et al., 2019;Ningthoujam et al., 2019). Word embedding features were frequently combined with additional features such as word types, POS tags, IOB encoding of semantic concepts, relative distance, and dependency relations to further improve performance (Hasan et al., 2020).
Recently, BERT-based text embedding has gained dominance due to it superior performance (Peng et al., 2019b).  is the first to combine BERT embeddings with traditional IOB tags. (Hasan et al., 2020) combines part-of-speech, IOB encoding, relative distance and dependency tree information with BERT. (He et al., 2020;Weinzierl et al., 2020) utilized pre-trained UMLS knowledge graph embeddings to enhance BERT. Despite a substantial body of research on clinical relation extraction, it is still an open question in terms of what is the best method to integrate bio-medical knowledge graphs (e.g., UMLS) into BERT for clinical relation extraction.

Methodology
The main steps in this research include: (a) generating text embeddings using BERT, (b) aligning the entities in text with the concepts in UMLS, (c) generating UMLS knowledge graph embeddings and (d) integrating UMLS knowledge with BERT.

Generating Text Embeddings Using
BERT (Wu and He, 2019) shows that incorporating information about the target entities along with a BERT sentence representation greatly benefits relation classification. To implement this, given a sentence S, we insert four markers e11, e12, e21 and e22 at the beginning and end of the two target entities (e1,e2) in a relation. In the i2b2 relation extraction dataset, the ground truth entity locations were provided as the input to the relation extraction model. After inserting these special tokens, for a sentence "The patient was given ibuprofen for high fever." with target entities "ibuprofen" and "fever" becomes: "[CLS] The patient was given e11 ibuprofen e12 for high e21 fever 22 .
[SEP]". Based on the positions of the two target entities in the BERT embedding, entity embeddings (EE) can be calculated. Then sentence embedding derived from the [CLS] token embedding and the entity embeddings are concatenated and passed through a fully connected layer to generate a representation that contains both sentence and entity embeddings (BERT+EE).

Text and UMLS Concept Alignment
To incorporate external knowledge from UMLS into language models, we need to identify UMLS concepts in clinical notes. We use Apache cTAKES (Savova et al., 2010) to extract named entity mentions in clinical notes and align them with the concepts in UMLS. cTAKES is a clinical Text Analysis and Knowledge Extraction System that extracts clinical information from unstructured text. It processes clinical notes, identifies types of clinical named entities such as drugs, diseases/disorders, signs/symptoms, anatomical sites and procedures and maps them to UMLS concepts. Using cTAKES we can map 46,305 out of 58,688 entities in our dataset to UMLS concepts.

Generating UMLS Knowledge Graph Embeddings
The Unified Medical Language System (UMLS) (Bodenreider, 2004) is a repository of biomedical vocabularies developed by the National Library of Medicine of US. It has three knowledge sources: (a) The Metathesaurus integrates millions of concepts from over 200 vocabularies. The Metathesaurus is organized by concepts, each concept is characterized by a unique concept identifier (CUI), definition, attributes and relationships with other concepts. For example, the concept "Headache" (CUI C0018681) has a definition "The symptom of pain in the craninal region" and it is related to the concept "Acetanilide" (CUI C0000973) with the relation "may treat". (b) Semantic Network provides consistent categorization of all concepts represented in the UMLS Metathesaurus. Each concept in the Metathesaurus is assigned one or more semantic types, which are linked with each other through semantic relationships. Each semantic type has an identifier, a definition, a few examples, and a few relationships. Semantic groups are smaller and coarser-grained semantic type groupings. For example, the semantic type of "Headache" is "Sign or Symptom" and its semantic group is "Disorder". (c) SPECIALIST Lexicon and Lexical Tools include a large syntactic lexicon and tools for normalizing strings, generating lexical variants, and creating indexes. Both the Metathesaurus and the Semantic Network can be considered as multi-relational knowledge graphs with nodes representing concepts or semantic types and edges representing relations. Each relationship is represented as a triplet (h, r, t), indicating a relationship (r) between two nodes (h and t). We use a subset of the Metathesauras and the complete semantic network to create our knowledge graph (KG). Specifically, we select all the CUIs extracted from our dataset using cTAKES. Then we select a subset of the Metathesauras by collecting all CUIs and relations that are one hop away from the initial set of CUIs. We connect this graph with the semantic network by including CUIs and their semantic type relationships. This created knowledge graph contains 312,474 nodes and 1,613,019 relations. An example of the created knowledge graph can be seen in figure 1.
Once the knowledge graph is created, we tested the effectiveness of several popular Knowledge Graph Embedding (KGE) models such as a translation-based model TransE (Bordes et al., 2013), two semantic matching models DistMult (Yang et al., 2014) and ComplEx (Trouillon et al., 2016) and a convolution network based models (Dettmers et al., 2018) and ConvKB (Nguyen et al., 2017) to create UMLS knowledge graph embeddings. We evaluate the effectiveness of these methods on a link prediction task, which predicts an entity that has a specific relation with a given entity, i.e., predicting h given (r, t) or t given (h, r). Among these KGE methods, ComplEx performed the best on the link prediction task. As a result, we only use knowledge graph embeddings from ComplEx in our experiments. From KGE, we can extract concept embedding, semantic type embedding, semantic group embedding and relation embedding.

Integrating UMLS knowledge with BERT
In our experiments, we primarily use ClinicalBERT trained on clinical text corpora. We systematically Figure 1: A snippet of a knowledge graph created from UMLS investigate different techniques to infuse knowledge from UMLS with pre-trained ClinicalBERT. We have examined the following methods.
ClinicalBERT-EE-KGE: The first technique we tried is to combine knowledge graph embedding with the text embeddings from ClinicalBERT and feed them to the relation classifier. For two entities in an input sentence, we retrieve their respective concept embeddings (CT), semantic type embeddings (ST) and semantic group embeddings (SG) from KGE. In addition, for a pair of concepts mapped from two entities in a sentence, we use KGE to predict the UMLS relation between them. Then we retrieve the UMLS relation embedding from KGE. Finally, we concatenate all the KGE embeddings with the sentence and entity embeddings from ClinicalBERT for relation classification. Please note that in this approach, text embeddings and knowledge graph embeddings are in two separate embedding spaces. ClinicalBERT-EE-MLP: Effectively merging knowledge graph embeddings with BERT can be tricky. Because pre-trained language models, such as BERT, are often trained for 2 to 5 epochs with smaller learning rate during fine-tuning, whereas graph embedding features extracted from KGE need to be trained for much longer with a higher learning rate. If we directly concatenate the BERT output with the KGE features, the relation classifier might not benefit much from the KGE features. To solve this issue, we first train a multi-layer perceptron (MLP) with knowledge graph embeddings for relation classification. The output of the MLP hidden layer is combined with BERT text embeddings in relation classification. The use of a trained MLP ensures that the model does not underfit when trained in an ensemble with pre-trained BERT models for a small number of epochs. ClinicalBERT with Relation Indicator: In each input sentence, relevant knowledge from a knowledge graph is injected into an input sentence to BERT, which transforms the original sentence into a knowledge-enriched text input. We add knowledge from UMLS as the second sentence in the BERT input. Then, we feed both the original input sentence and the synthesized second sentence to pre-trained ClinicalBERT and it will use these knowledge enriched sentences to predict relation labels. With this method, we inject UMLS knowledge directly into the BERT embedding space. To construct this second sentence, first, we find corresponding CUIs for the two entities in a sentence using cTAKES. Then we use pre-trained KGE to predict the UMLS relation between them. Then we construct the second input sentence in the form of "concept1 relation concept2". For the input sentence "The patient was given ibuprofen for high fever", we first map "ibuprofen" and "fever" to their UMLS CUIs. Pre-trained KGE predicts the UMLS relation between them is "may_treat". Then we construct the second sentence as "ibuprofen may treat fever". This KGE-predicted UMLS relation can potentially act as a relation indicator that may help to differentiate relation class labels. To pass this relation indicator information to BERT, we use special tokens before and after the relation indica-tor phrase. These tokens are used to extract the relation indicator embedding from BERT. Finally, the combined sentence embedding, entity embedding and relation indicator embedding are used for relation classification. For example, the final input would be "[CLS] Patient was given e11 ibuprofen e12 for high e21 fever e22 .
[SEP]". We also tried a variety of templates to generate the second sentence where each entity is replaced by its semantic type or semantic group. In the same example, the second sentence would be "pharmacologic substance r31 may treat r31 sign or symptom" or "drug and chemical r31 may treat r31 disease and disorder", where "pharmacologic substance" and "sign or symptom" are the semantic types of "ibuprofen" and "fever", and "drug and chemical" and "disease and disorder" are the semantic groups of "ibuprofen" and "fever". We call these models ClinicalBERT-EE-RI-CT, ClinicalBERT-EE-RI-ST and ClinicalBERT-EE-RI-SG where "RI" stands for "relation indicator", "CT" for "concept", "ST" for "semantic type" and "SG" for "semantic group". ClinicalBERT with Entity Definition: In this method, we fine-tune BERT not only with input sentences but also with the text descriptions of the two entities. For entities in an input sentence, we extract their corresponding concept definitions from UMLS. They are used as the input to BERT to get concept embeddings (ClinicalBERT-EE-ED-CT). We can also generate semantic type embeddings using its definitions (ClinicalBERT-EE-ED-ST). These definitions are fed to a separate BERT model as input. Text representations are extracted based on the special entity markers we inserted. We use the [CLS] token embeddings related to the concept and semantic type definitions as the concept and semantic type embeddings. These embeddings are concatenated with the text embeddings of the input sentence in relation classification. There are no definitions for semantic groups in UMLS. ClinicalBERT-EE-KB: UMLS knowledge is infused into BERT by jointly optimizing both a knowledge graph objective and a masked language model objective. Jointly optimizing the two objectives can implicitly integrate knowledge from external knowledge graphs into language models. Here we adopt the pre-trained Clinical KB-BERT (Hao et al., 2020) in our analysis. ClinicalBERT-EE-KB-MLM: In this method, we pre-train BERT with UMLS information with only the masked language model (MLM) objective. We use the abbreviations provided by UMLS to map a triple into a natural language sentence (e.g., generating a sentence like "fever may be treated by ibuprofen" based on the triple (fever, may_be_treated_by, ibuprofen). In this way, we can get a set of sentences based on the triples in UMLS. We have created a total of 1,613,019 UMLS sentences. We then fine-tune ClinicalBERT with these UMLS sentences using only the MLM objective. By transferring the knowledge graph into natural language texts, we fuse UMLS knowledge with BERT in the same representation space. We call this model ClinicalBERT-EE-KB-MLM. Table 1 summarizes the methods we proposed to add domain knowledge to BERT. We characterize them along multiple dimensions: infusion stage, type of domain knowledge added, form of domain knowledge added and fusion methods.

Dataset Description
For this study, we use the clinical relation extraction dataset from the 2010 i2b2/VA on Natural Language Processing Challenges for Clinical Records. This dataset contains discharge summaries and progress reports from different healthcare providers. The relation extraction task is to identify nine target relations between three types of medical concepts: treatments, problems and tests. The dataset used during the 2010 i2b2/VA challenge includes a total of 394 training reports, 477 test reports, and 877 un-annotated reports. After the challenge, however, only a part of the data was publicly released. The dataset we downloaded from the i2b2 website 1 only includes 170 training and 256 testing documents. Descriptions and statistics of the target relations can be found in table 2.

Experiment Details
In all the BERT-based classifiers, we use both the BERT sentence embedding and entity embedding (EE). We train all classifiers for 5 epochs with a learning rate of 0.00002 and a batch size of 8 with a softmax classification layer. In addition, for ClinicalBERT-EE-KGE we concatenate 700 dimensional KGE with 768 dimensional BERT text representations. This 1468 di-1 https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ mensional vector is used as the input to the softmax layer. To implement ClinicalBERT-EE-MLP, we first train a MLP with KGE as the input. We use a single hidden layer consisting of 128 hidden units, a tanh activation function and a final linear layer with a softmax function to make predictions. We train this model for 50 epochs with a learning rate of 0.001 and a batch size of 32. After the MLP model is trained, we combine the output of the hidden layer (a 128 dimensional vector) with the BERT text embeddings (a 768 dimensional vector). This combined vector is connected to a linear layer and a softmax function to make predictions. In ClinicalBERT-EE-RI-CT/ST/SG, we employ different variations of the second input sentence. Text embedding is created by combining sentence embedding, entity embedding and relation indicator embedding. We connect the combined embeddings to a fully connected layer. In addition, we combine sentence embedding (768 dimensional vector) with two concept embedding (each a 768 dimensional vector) to create a 2304 dimensional input vector in ClinicalBERT-EE-ED-CT/ST. ClinicalBERT-EE-KB uses only the 768 dimensional text embedding. We pre-train ClinicalBERT-EE-KB-MLM with 4.4 million token text created from the UMLS knowledge graph with 425K steps and a learning rate of 0.00002. We initialized this model with weights from ClinicalBERT.

Baseline Methods
To compare with the current state-of-the-art, we consider systems that employ the same number of training instances and define the classification task with the same granularity level as ours. So far, we have found only two existing systems (Li et al., 2019;Hasan et al., 2020) meeting these criteria. In addition, (He et al., 2020) and(Weinzierl et al., 2020) (Mikolov et al., 2013) and Doc2Vec (Le and Mikolov, 2014)  We also consider a baseline where we only use sentence representations from BERT (BERT). This is to show the advantage of incorporating entity embeddings.

Performance Evaluation
In this section, we evaluate the effectiveness of different methods in incorporating additional knowledge into BERT. We use a 80%-20% train and test split and report the average result over multiple runs for each model. We calculate per class (9 class) and weighted F1 scores. The results of all the models can be found in  However, ClinicalBERT-EE-MLP (F1=0.9124) did not perform as well. We hypothesize that important information is lost when the 128 dimensional hidden layer vectors are used (versus the 700 dimensional knowledge graph embedding vectors). Next, we try to inject knowledge graph information directly into BERT input. Out of the three variations, ClinicalBERT-EE-RI-ST (with semantic type information and relation indicator in the second input) performed the best. This is our overall best performing model, achieving an F1-score of 0.9181. The relation indicator predicted by KGE may play an important role to boost the performance. Adding domain knowledge provides 0.59% improvement over ClinicalBERT-EE. From the results of ClinicalBERT-EE-ED-CT and ClinicalBERT-EE-ED-ST we can see that concept definitions and semantic type definitions did not help much. We hypothesize that the entity embeddings learned from BERT or the concept embeddings from KGE may be more precise than the embeddings learned from their text definitions. Next, we move on to pretrain BERT with knowledge graph information. Here we can see that ClinicalBERT-EE-KB is our second-best performing model with an F1-score of 0.9177. Finally, when we pre-train BERT with knowledge graphs using a Masked Language Model objective, we see that result (F1=0.9101) slightly went down compared to ClinicalBERT-EE (F=0.9127). This indicates that incorporating knowledge into BERT using knowledge graph objective may be more efficient than injecting UMLS sentences with a language model objective.