Joint Biomedical Entity and Relation Extraction with Knowledge-Enhanced Collective Inference

Compared to the general news domain, information extraction (IE) from biomedical text requires much broader domain knowledge. However, many previous IE methods do not utilize any external knowledge during inference. Due to the exponential growth of biomedical publications, models that do not go beyond their fixed set of parameters will likely fall behind. Inspired by how humans look up relevant information to comprehend a scientific text, we present a novel framework that utilizes external knowledge for joint entity and relation extraction named KECI (Knowledge-Enhanced Collective Inference). Given an input text, KECI first constructs an initial span graph representing its initial understanding of the text. It then uses an entity linker to form a knowledge graph containing relevant background knowledge for the the entity mentions in the text. To make the final predictions, KECI fuses the initial span graph and the knowledge graph into a more refined graph using an attention mechanism. KECI takes a collective approach to link mention spans to entities by integrating global relational information into local representations using graph convolutional networks. Our experimental results show that the framework is highly effective, achieving new state-of-the-art results in two different benchmark datasets: BioRelEx (binding interaction detection) and ADE (adverse drug event extraction). For example, KECI achieves absolute improvements of 4.59% and 4.91% in F1 scores over the state-of-the-art on the BioRelEx entity and relation extraction tasks


Introduction
With the accelerating growth of biomedical publications, it has become increasingly challenging to manually keep up with all the latest articles. As a result, developing methods for automatic extraction of biomedical entities and their relations has attracted much research attention recently (Li et al., 2017;Fei et al., 2020;Luo et al., 2020). Many related tasks and datasets have been introduced, ranging from binding interaction detection (BioRelEx) (Khachatrian et al., 2019) to adverse drug event extraction (ADE) (Gurulingappa et al., 2012).
Many recent joint models for entity and relation extraction rely mainly on distributional representations and do not utilize any external knowledge source (Eberts and Ulges, 2020;Zhao et al., 2020). However, different from the general news domain, information extraction for the biomedical domain typically requires much broader domain-specific knowledge. Biomedical documents, either formal (e.g., scientific papers) or informal ones (e.g., clinical notes), are written for domain experts. As such, they contain many highly specialized terms, acronyms, and abbreviations. In the BioRelEx dataset, we find that about 65% of the annotated entity mentions are abbreviations of biological entities, and an example is shown in Figure 1. These unique characteristics bring great challenges to general-domain systems and even to existing scientific language models that Figure 2: KECI operates in three main steps: (1) initial span graph construction (2) background knowledge graph construction (3) fusion of these two graphs into a final span graph. KECI takes a collective approach to link multiple mentions simultaneously to entities by incorporating global relational information using GCNs.
do not use any external knowledge base during inference (Beltagy et al., 2019;Lee et al., 2019). For example, even though SciBERT (Beltagy et al., 2019) was pretrained on 1.14M scientific papers, our baseline SciBERT model still incorrectly predicts the type of the term UIM in Figure 1 to be "DNA", which should be a "Protein Motif" instead. Since the biomedical literature is expanding at an exponential rate, models that do not go beyond their fixed set of parameters will likely fall behind.
In this paper, we introduce KECI (Knowledge-Enhanced Collective Inference), a novel end-to-end framework that utilizes external domain knowledge for joint entity and relation extraction. Inspired by how humans comprehend a complex piece of scientific text, the framework operates in three main steps ( Figure 2). KECI first reads the input text and constructs an initial span graph representing its initial understanding of the text. In a span graph, each node represents a (predicted) entity mention, and each edge represents a (predicted) relation between two entity mentions. KECI then uses an entity linker to form a background knowledge graph containing all potentially relevant biomedical entities from an external knowledge base (KB). For each entity, we extract its semantic types, its definition sentence, and its relational information from the external KB. Finally, KECI uses an attention mechanism to fuse the initial span graph and the background knowledge graph into a more refined graph repre-senting the final output. Different from previous methods that link mentions to entities based solely on local contexts (Li et al., 2020b), our framework takes a more collective approach to link multiple semantically related mentions simultaneously by leveraging global topical coherence. Our hypothesis is that if multiple mentions co-occur in the same discourse and they are probably semantically related, their reference entities should also be connected in the external KB. KECI integrates global relational information into mention and entity representations using graph convolutional networks (GCNs) before linking.
The benefit of collective inference can be illustrated by the example shown in Figure 2. The entity linker proposes two candidate entities for the mention FKBP12; one is of semantic type "AA, Peptide, or Protein" and the other is of semantic type "Gene or Genome". It can be tricky to select the correct candidate as FKBP12 is already tagged with the wrong type in the initial span graph (i.e., it is predicted to be a "Chemical" instead of a "Protein"). However, because of the structural resemblance between the mention-pair FK506, FKBP12 and the pair "Organic Chemical", "AA, Peptide, or Protein" , KECI will link FKBP12 to the entity of semantic type "AA, Peptide, or Protein". As a result, the final predicted type of FKBP12 will also be corrected to "Protein" in the final span graph.
Our extensive experimental results show that the proposed framework is highly effective, achieving new state-of-the-art biomedical entity and relation extraction performance on two benchmark datasets: BioRelEx (Khachatrian et al., 2019) and ADE (Gurulingappa et al., 2012). For example, KECI achieves absolute improvements of 4.59% and 4.91% in F1 scores over the state-of-the-art on the BioRelEx entity and relation extraction tasks. Our analysis also shows that KECI can automatically learn to select relevant candidate entities without any explicit entity linking supervision during training. Furthermore, because KECI considers text spans as the basic units for prediction, it can extract nested entity mentions.

Overview
KECI considers text spans as the basic units for feature extraction and prediction. This design choice allows us to handle nested entity mentions (Sohrab and Miwa, 2018). Also, joint entity and relation extraction can be naturally formulated as the task of extracting a span graph from an input document . In a span graph, each node represents a (predicted) entity mention, and each edge represents a (predicted) relation between two entity mentions.
Given an input document D, KECI first enumerates all the spans (up to a certain length) and embeds them into feature vectors (Sec. 2.2). With these feature vectors, KECI predicts an initial span graph and applies a GCN to integrate initial relational information into each span representation (Sec. 2.3). KECI then uses an entity linker to build a background knowledge graph and applies another GCN to encode each node of the graph (Sec. 2.4). Finally, KECI aligns the nodes of the initial span graph and the background knowledge graph to make the final predictions (Sec. 2.5). We train KECI in an end-to-end manner without using any additional entity linking supervision (Sec. 2.6).
Overall, the design of KECI is partly inspired by previous research in educational psychology. Students' background knowledge plays a vital role in guiding their understanding and comprehension of scientific texts (Alvermann et al., 1985;Braasch and Goldman, 2010). "Activating" relevant and accurate prior knowledge will aid students' reading comprehension.

Span Encoder
Our model first constructs a contextualized representation for each input token using SciBERT (Beltagy et al., 2019). Let X = (x 1 , ..., x n ) be the output of the token-level encoder, where n denotes the number of tokens in D. Then, for each span s i whose length is not more than L, we compute its span representation s i ∈ R d as: where START(i) and END(i) denote the start and end indices of s i respectively. x START(i) and x END(i) are the boundary token representations.x i is an attention-weighted sum of the token representations in the span (Lee et al., 2017). φ(s i ) is a feature vector denoting the span length. FFNN g is a feedforward network with ReLU activations.

Initial Span Graph Construction
With the extracted span representations, we predict the type of each span and also the relation between each span pair jointly. Let E denote the set of entity types (including non-entity), and R denote the set of relation types (including non-relation). We first classify each span s i : where FFNN e is a feedforward network mapping from R d → R |E| . We then employ another network to classify the relation of each span pair s i , s j : where • denotes the element-wise multiplication, FFNN r is a mapping from R 3×d → R |R| . We will use the notation r ij [k] to refer to the predicted probability of s i and s j having the relation k.
At this point, one can already obtain a valid output for the task from the predicted entity and relation scores. However, these predictions are based solely on the local document context, which can be difficult to understand without any external domain knowledge. Therefore, our framework uses these predictions only to construct an initial span graph that will be refined later based on information extracted from an external knowledge source.
To maintain computational efficiency, we first prune out spans of text that are unlikely to be entity mentions. We only keep up to λn spans with the lowest probability scores of being a non-entity. The value of λ is selected empirically and set to be 0.5. Spans that pass the filter are represented as nodes in the initial span graph. For every span pair s i , s j , we create |R| directed edges from the node representing s i to the node representing s j . Each edge represents one relation type and is weighted by the corresponding probability score in r ij .
Let G s = {V s , E s } denote the initial span graph. We use a bidirectional GCN (Marcheggiani and Titov, 2017;Fu et al., 2019) to recursively update each span representation: is a feedforward network whose output dimension is the same as the dimension of h l i . After multiple iterations of message passing, each span representation will contain the global relational information of G s . Let h i denote the feature vector at the final layer of the GCN. Note that the dimension of h i is the same as the dimension of s i (i.e., h i ∈ R d ).

Background Knowledge Graph Construction
In this work, we utilize external knowledge from the Unified Medical Language System (UMLS) (Bodenreider, 2004). UMLS consists of three main components: Metathesaurus, Semantic Network, and Specialist Lexicon and Lexical Tools. The Metathesaurus provides information about millions of fine-grained biomedical concepts and relations between them. To be consistent with the existing literature on knowledge graphs, we will refer to UMLS concepts as entities. Each entity is annotated with one or more higher-level semantic types, such as Anatomical Structure, Cell, or Virus. In addition to relations between entities, there are also semantic relations between semantic types. For example, there is an affects relation from Acquired Abnormality to Physiologic Function. This information is provided by the Semantic Network. We first extract UMLS biomedical entities from the input document D using MetaMap, an entity mapping tool for UMLS (Aronson and Lang, 2010). We then construct a background knowledge graph (KG) from the extracted information. More specifically, we first create a node for every extracted biomedical entity. The semantic types of each entity node are also modeled as type nodes that are linked with associated entity nodes. Finally, we create an edge for every relevant relation found in the Metathesaurus and the Semantic Network. An example KG is in the grey shaded region of Figure  2. Circles represent entity nodes, and rectangles represent nodes that correspond to semantic types.
Note that we simply run MetaMap with the default options and do not tune it. In our experiments, we found that MetaMap typically returns many candidate entities unrelated to the input text. However, as to be discussed in Section 3.4, we show that KECI can learn to ignore the irrelevant entities.
Let G k = {V k , E k } denote the constructed background KG, where V k and E k are the node and edge sets, respectively. We use a set of UMLS embeddings pretrained by Maldonado et al. (2019) to initialize the representation of each node in V k . We also use SciBERT to encode the UMLS definition sentence of each node into a vector and concatenate it to the initial representation. After that, since G k is a heterogeneous relational graph, we use a relational GCN (Schlichtkrull et al., 2018) to update the representation of each node v i : is the set of neighbors of v i under relation k ∈ R. c i,k is a normalization constant and set to be |N k i |. After multiple iterations of message passing are performed, the global relational information of the KG will be integrated into each node's representation. Let v i denote the feature vector at the final layer of the relational GCN. We further project each vector v i to another vector n i using a simple feedforward network, so that n i has the same dimension as the span representations (i.e., n i ∈ R d ).

Final Span Graph Prediction
At this point, we have two graphs: the initial span graph G s = {V s , E s } (Sec. 2.3) and the background knowledge graph G k = {V k , E k } (Sec. 2.4). We have also obtained a structure-aware representation for each node in each graph (i.e., h i for each span s i ∈ V s and n j for each entity v j ∈ V k ).
The next step is to soft-align the mentions and the candidate entities using an attention mechanism ( Figure 3). Let C(s i ) denote the set of candidate entities for a span s i ∈ V s . For example, in Figure  2, the mention FKBP12 has two candidate entities, while FK506 has only one candidate. For each candidate entity v j ∈ C(s i ), we calculate a scalar score α ij indicating how relevant v j is to s i : where FFNN c is a feedforward network mapping from R 2×d → R. Then we compute an additional sentinel vector c i (Yang and Mitchell, 2017;He et al., 2020) and also compute a score α i for it: where FFNN s is another feedforward network mapping from R d → R d . Intuitively, c i records the information of the local context of s i , and α i measures the importance of such information. After that, we compute a final knowledge-aware representation f i for each span s i as follows: The attention mechanism is illustrated in Figure 3.
With the extracted knowledge-aware span representations, we predict the final span graph in a way similar to Eq. 2 and Eq. 3: where FFNN e is a mapping from R d → R |E| , and FFNN r is a mapping from R 3×d → R |R| . e i is the final predicted probability distribution over possible entity types for span s i . r ij is the final predicted probability distribution over possible relation types for span pair s i , s j .

Training
The total loss is computed as: where L e * denotes the cross-entropy loss of span classification. L r * denotes the binary cross-entropy loss of relation classification. L e 1 and L r 1 are loss terms for the initial span graph prediction (Eq. 2 and Eq. 3 of Section 2.3). L e 2 and L r 2 are loss terms for the final span graph prediction (Eq. 9 of Section 2.5). We apply a larger weight score to the loss terms L e 2 and L r 2 . We train the framework using only ground-truth labels of the entity and relation extraction tasks. We do not make use of any entity linking supervision in this work.

Data and Experiments Setup
Datasets and evaluation metrics We evaluate KECI on two benchmark datasets: BioRelEx and ADE. The BioRelEx dataset (Khachatrian et al., 2019) consists of 2,010 sentences from biomedical literature that capture binding interactions between proteins and/or biomolecules. BioRelEx has annotations for 33 types of entities and 3 types of relations for binding interactions. The training, development, and test splits contain 1,405, 201, and 404 sentences, respectively. The training and development sets are publicly available. The test set is unreleased and can only be evaluated against using CodaLab 2 . For BioRelEx, we report Micro-F1 scores. The ADE dataset (Gurulingappa et al., 2012) consists of 4,272 sentences extracted from medical reports that describe drug-related adverse effects. Two entity types (Adverse-Effect and Drug) and a single relation type (Adverse-Effect) are pre-defined. Similar to previous work (Eberts    language model. The baseline first uses SciB-ERT to construct initial token-level representations. It then uses the KAR mechanism to inject external knowledge from UMLS into the token-level vectors. Finally, it embeds text spans into feature vectors (Eq. 1) and uses the span representations to extract entities and relations in one pass (similar to Eq. 9).
For fair comparison, all the baselines use SciBERT as the Transformer encoder.
A major difference between KECI and Know-BertAttention (Peters et al., 2019) is that KECI explicitly builds and extracts information from a multi-relational graph structure of the candidate entity mentions before the knowledge fusion process. In contrast, KnowBertAttention only uses SciBERT to extract features from the candidate entity mentions. Therefore, KnowBertAttention only takes advantage of the entity-entity co-occurrence information. On the other hand, KECI integrates more fine-grained global relational information (e.g., the binding interactions shown in Figure 2) into the mention representations. This difference makes KECI achieve better overall performance, as to be discussed next. Table 1 and Table 2 show the overall results on the development and test sets of BioRelEx, respectively. Compared to SentContextOnly, KECI achieves much higher performance. This demonstrates the importance of incorporating external knowledge for biomedical information extraction. KECI also outperforms the baseline FlatAttention by a large margin, which shows the benefit of collective inference. In addition, we see that our model performs better than the baseline KnowBertAttention. Finally, at the time of writing, KECI achieves the first position on the BioRelEx leaderboard 3 . Table 3 shows the overall results on ADE. KECI again outperforms all the baselines and state-of-theart models such as SpERT (Eberts and Ulges, 2020) and SPAN Multi-Head . This further confirms the effectiveness of our framework.

Overall Results
Overall, the two datasets used in this work focus on two very different subareas of the biomedical domain, and KECI was able to push the state-ofthe-art results of both datasets. This indicates that our proposed approach is highly generalizable. i . We see that all the partial variants perform worse than our full model. This shows that each component of KECI plays an important role.

Attention Pattern Analysis
There is no gold-standard set of correspondences between the entity mentions in the datasets and the UMLS entities. Therefore, we cannot directly evaluate the entity linking performance of KECI. However, for each UMLS semantic type, we compute the average attention weight that an entity of that type gets assigned (Table 5). Overall, we see that KECI typically pays the most attention to the relevant informative entities while ignoring the irrelevant ones. Table 6 shows some examples from the ADE dataset that illustrate how incorporating external knowledge can improve the performance of joint biomedical entity and relation extraction.

Qualitative Analysis
In the first example, initially, there is no edge between the node "bleeding symptoms" and the node "warfarin", probably because of the distance between their corresponding spans in the original input sentence. However, KECI can link the term "warfarin" to a UMLS entity (CUI: C0043031), and the definition in UMLS says that warfarin is a type of anticoagulant that prevents the formation of blood clots. As the initial feature vector of each entity contains the representation of its definition (Sec. 2.4), KECI can recover the missing edge.
In the second example, the initial span graph is predicted to have three entities of type Adverse-Effect, which correspond to three different overlapping text spans. Among these three, only "retroperitoneal fibrosis" can be linked to a UMLS entity. It is also evident from the input sentence that one of these spans is related to "methysergide". As a result, KECI successfully removes the other two unlinked span nodes to create the final span graph.
In the third example, probably because of the phrase "due to", the node "endometriosis" is initially predicted to be of type Drug, and the node "acute abdomen" is predicted to be its Adverse-Effect. However, KECI can link the term "endometriosis" to a UMLS entity of semantic type Disease or Syndrome. As a result, the system can correct the term's type and also predict the right edges for the final span graph.
Finally, we also examined the errors made by KECI. One major issue is that MetaMap sometimes fails to return any candidate entity from UMLS for an entity mention. We leave the extension of this work to using multiple KBs as future work.

Related Work
Traditional pipelined methods typically treat entity extraction and relation extraction as two separate tasks (Zelenko et al., 2002;Zhou et al., 2005;Chan and Roth, 2011). Such approaches ignore the close interaction between named entities and their relation information and typically suffer from the error  #2: A 25-year-old woman sought medical attention because of iliocaval manifestations of retroperitoneal fibrosis while she was taking methysergide.
#3: TITLE: Acute abdomen due to endometriosis in a premenopausal woman taking tamoxifen. propagation problem. To overcome these limitations, many studies have proposed joint models that perform entity extraction and relation extraction simultaneously (Roth and Yih, 2007;Li and Ji, 2014;Li et al., 2017;Zheng et al., 2017;Bekoulis et al., 2018a,b;Fu et al., 2019;Zhao et al., 2020;Wang and Lu, 2020;Li et al., 2020b;Lin et al., 2020). Particularly, span-based joint extraction methods have gained much popularity lately because of their ability to detect overlapping entities. For example, Eberts and Ulges (2020) propose SpERT, a simple but effective span-based model that utilizes BERT as its core. The recent work of  also closely follows the overall architecture of SpERT but differs in span-specific and contextual semantic representations. Despite their impressive performance, these methods are not designed specifically for the biomedical domain, and they do not utilize any external knowledge base. To the best of our knowledge, our work is the first span-based frame-work that utilizes external knowledge for joint entity and relation extraction from biomedical text. Biomedical event extraction is a closely related task that has also received a lot of attention from the research community (Poon and Vanderwende, 2010;Kim et al., 2013;V S S Patchigolla et al., 2017;Rao et al., 2017;Espinosa et al., 2019;Li et al., 2019;Ramponi et al., 2020;Yadav et al., 2020). Several studies have proposed to incorporate external knowledge from domain-specific KBs into neural models for biomedical event extraction. For example, Li et al. (2019) incorporate entity information from Gene Ontology into tree-LSTM models. However, their approach does not explicitly use any external relational information. Recently,  introduce a framework that uses a novel Graph Edge conditioned Attention Network (GEANet) to utilize domain knowledge from UMLS. In the framework, a global KG for the entire corpus is first constructed, and then a sentence-level KG is created for each individual sentence in the corpus. Our method of KG construction is more flexible as we directly create a KG for each input text. Furthermore, the work of  only deals with event extraction and assumes that gold-standard entity mentions are provided at inference time.
Some previous work has focused on integrating external knowledge into neural architectures for other tasks, such as reading comprehension (Mihaylov and Frank, 2018), question answering (Pan et al., 2019), natural language inference (Sharma et al., 2019), and conversational modeling (Parthasarathi and Pineau, 2018). Different from these studies, our work explicitly emphasizes the benefit of collective inference using global relational information.
Many previous studies have also used GNNs for various IE tasks (Nguyen and Grishman, 2018;Liu et al., 2018;Subburathinam et al., 2019;Zeng et al., 2021;. Many of these methods use a dependency parser or a semantic parser to construct a graph capturing global interactions between tokens/spans. However, parsers for specialized biomedical domains are expensive to build. KECI does not rely on such expensive resources.

Conclusions and Future Work
In this work, we propose a novel span-based framework named KECI that utilizes external domain knowledge for joint entity and relation extraction from biomedical text. Experimental results show that KECI is highly effective, achieving new stateof-the-art results on two datasets: BioRelEx and ADE. Theoretically, KECI can take an entire document as input; however, the tested datasets are only sentence-level datasets. In the future, we plan to evaluate our framework on more document-level datasets. We also plan to explore a broader range of properties and information that can be extracted from external KBs to facilitate biomedical IE tasks. Finally, we also plan to apply KECI to other information extraction tasks (Li et al., 2020a;Wen et al., 2021). Giannis  Average Runtime Table 7 shows the estimated average run time of our full model.

Number of Model Parameters
The number of parameters in a full model trained on BioRelEx is about 121.0M parameters. The number of parameters in a full model trained on ADE is about 119.9M parameters.

Hyperparameters of Best-Performing Models
The span length limit L is set to be 20 tokens. Note that the choice of L only has some noticeable effects on the training time of KECI during the first epoch. KECI with randomly initialized parameters may include many non-relevant spans in the initial span graph. However, after a few training iterations, KECI typically can filter out most nonrelevant spans. The pruning parameter λ is set to be 0.5. All of our models use SciBERT as the Transformer encoder (Beltagy et al., 2019). We use two different learning rates, one for the lower pretrained Transformer encoder and one for the upper layers. Table 8 summarizes the hyperparameter configurations of best-performing models.

Expected Validation Performance
The main paper has the results on the dev set of BioRelEx. For ADE, as in previous work, we conduct a 10fold cross validation.