ERLKG: Entity Representation Learning and Knowledge Graph based association analysis of COVID-19 through mining of unstructured biomedical corpora

We introduce a generic, human-out-of-the-loop pipeline, ERLKG, to perform rapid association analysis of any biomedical entity with other existing entities from a corpora of the same domain. Our pipeline consists of a Knowledge Graph (KG) created from the Open Source CORD-19 dataset by fully automating the procedure of information extraction using SciBERT. The best latent entity representations are then found by benchnmarking different KG embedding techniques on the task of link prediction using a Graph Convolution Network Auto Encoder (GCN-AE). We demonstrate the utility of ERLKG with respect to COVID-19 through multiple qualitative evaluations. Due to the lack of a gold standard, we propose a relatively large intrinsic evaluation dataset for COVID-19 and use it for validating the top two performing KG embedding techniques. We find TransD to be the best performing KG embedding technique with Pearson and Spearman correlation scores of 0.4348 and 0.4570 respectively. We demonstrate that a considerable number of ERLKG’s top protein, chemical and disease predictions are currently in consideration for COVID-19 related research.


Introduction
COVID-19 is a global epidemic with a considerable fatality rate and a high transmission rate, affecting millions of people world-wide since its outbreak. 1 The search for treatments and possible cures for the novel Coronavirus (Wang et al., 2020b) has led to an exponential increase in scientific publications, but the challenge lies in effectively processing, integrating and leveraging related sources of information.
Rapid and effective utilization of literature during times of pandemic such as COVID-19 is of utmost importance in combating the disease. In this paper, we introduce a fully automated generic pipeline consisting of an Information Extraction (IE) system followed by Knowledge Graph construction. The IE module uses SciBERT (Beltagy et al., 2019) for performing Named Entity Recognition (NER) and Relationship Extraction (RE). The entire entity extraction procedure is fully automated and no human expertise is used. The major goal is to ensure rapid access of relevant data through a structured representation of free text articles. Following this, we focus on the task of association analysis of essential biomedical entities, namely, proteins, diseases and, chemicals. Such entities are well explored in existing literature and an analysis of their relatedness to COVID-19 is provided by leveraging the CORD-19 Open Research Dataset (Wang et al., 2020a). This can assist the physicians to accelerate knowledge discovery and provide support for clinical decision making. The dataset and related resources of this paper are made public 2 .
Due to a lack of gold standard information, we perform extensive qualitative evaluations in order to show that our system does not suffer from redundancy or bias. These evaluations include performance on a link prediction task and intrinsic evaluation. For the former, KG embeddings along with graph adjacency matrix are fed to a GCN-AE (Kipf and Welling, 2016) model to perform link prediction. Average Precision (AP) and ROC scores were used to benchmark different KG embeddings on the generated knowledge graph. For the intrinsic evaluation, we propose a new dataset that has been developed with the help of three physicians and benchmark our embeddings against it. Finally, 128 based on cosine similarity score, the best representation was used to predict top chemicals, proteins and diseases related to COVID-19. The contributions of our approach are as follows : 1. We propose a fully automated, humanout-of-the-loop, end-to-end generic pipeline for rapidly determining association of any biomedical entity of interest with other existing well explored entities.
2. We benchmark multiple KG embedding techniques on the task of link prediction and demonstrate that simple embedding methods provide comparable performance on straightforward structured KGs.
3. We introduce two human gold-standard entity lists, COV19 25 and COV19 729. The former consists of expert ratings for 25 entities predicted by ERLKG while the latter consists of expert ratings for 729 entities sampled from the CORD-19 dataset. The ratings are based on every entity's relatedness with respect to COVID-19.

Related Work
We mostly focus on recent works centered around the CORD-19 dataset by discussing about the techniques used for IE and KG generation.

Entity and Relation Extraction
Most of the recent NLP systems use pretrained language models on unannotated text like ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), and XLNet (Yang et al., 2019). In the biomedical and clinical domains, BERT based architectures pretrained with domain-specific unlabelled text have been used for IE Alsentzer et al., 2019). The CORD-19 dataset, curated for the COVID-19 pandemic, integrates related scientific articles for various information retrieval tasks (Roberts et al., 2020). Multiple NLP applications have been developed around CORD-19 like Question Answering (Das et al., 2020), Summarization (Park, 2020), NER (Wang et al., 2020c), etc.

Knowledge Graph
KGs were immensely used in different fields like Life Science (Chen et al., 2009) (Richardson et al., 2020). However, the scope of the network built by the last two methods is limited owing to the smaller dataset size. Also, to learn node representations and leverage the structural information of the graph, various techniques are used for Knowledge Graph embeddings. Rossi et al. (2020) conducts extensive survey on 16 KG embedding techniques to perform a comparative analysis. They form a taxonomy of the embedding methods, grouping various methods to tensor decomposition models like DisMult (Yang et al., 2015), Geometric models like TransE (Bordes et al., 2013), TransD (Ji et al., 2015), ComplEx (Trouillon et al., 2016) and Rotate (Sun et al., 2019) and Deep Learning models like ConVE (Dettmers et al., 2018) and CapsE (Nguyen et al., 2019). Shifting from textual source to construct a KG, Ray et al. (2020) uses biological interaction networks like drug-protein and proteinprotein networks to predict repurposable drugs for SARS-CoV-2 through link prediction while employing Variational Graph AutoEncoders with features from Node2Vec (Grover and Leskovec, 2016) for entity representation.

CORD-19
The CORD-19 corpus (Wang et al., 2020a) was published by Allen AI in association with White House and other organizations. It was made publicly available on the Kaggle 3 platform as a part of an open research challenge. The data, containing scholarly articles, is collected from sources like PubMed Central (PMC), PubMed, the World Health Organization's COVID-19 Database, and various preprint servers like bioRxiv, medRxiv and arXiv. CORD-19 corpus (2020-05-12) contains a pool of 1,38,000 scholarly articles with 69,000 full-text articles related to COVID-19, SARS-CoV-2, etc. Each paper is associated with bibliographic metadata such as Title, Author etc, as well as unique identifiers such as a DOI, PubMed Central ID etc. Various sub-tasks have been identified for effective information retrieval, however, it lacks task oriented ground truth data. We merge all the metadata with corresponding full text papers and retain the title, abstract and full text from the corpus.

Datasets for Fine-tuning SciBERT
For NER, we consider the following three datasets, namely, JNLPBA (Collier and Kim, 2004) corpus which consists of 5 distinct tags: Protein, DNA, RNA, Cell line and Cell type, the CHEMDNER (Krallinger et al., 2015) corpus which consists of : Abbreviation, Family, Formula, Identifiers, Multiple, Systematic and Trivial, the NCBI Disease Corpus (Dogan et al., 2014) which is used to identify only disease mentions.
For RE, the following datasets are used, namely, CHEMPROT (Kringelum et al., 2016) which consists of 13 different relationship types based on identified positive associations according to : Inhibitor, Substrate, Indirect-Down regulator, Indirect-Up regulator, Activator, Antagonist, Product-Of, Agonist, Down regulator, Up regulator, Agonist-Activator, Agonist-Inhibitor and Substrator-Product-Of and BC5CDR (Li et al., 2016) which captures binary relations predicting positive or negative interaction for chemicalinduced-disease pairs.

ERLKG
In this section we discuss about the entire pipeline and its various components. Figure 1 depicts the pipeline which consists of the following modules : Preprocessing, Named Entity Recognition (NER), Relation Extraction (RE) and Knowledge Graph (KG) construction. The rest part of the Figure 1 depicts the evaluation strategies adopted for a reliable association analysis of various chemical, protein and drug entities from CORD-19 corpus with re-spect to COVID-19.

Preprocessing
Each abstract or full text was split into sentences using NLTK (Loper and Bird, 2002) sentence tokenizer and the sentences, in turn, were tokenized using the Spacy (v2.0.10) tokenizer 4 . Following this we removed all the non-functional tokens and attached POS tags to the remaining tokens.

Named Entity Recognition
Named Entity Recognition (NER) is the task of identifying domain-specific proper nouns in a sentence. In order to gain meaningful insights about the major classes of biomedical entities present in the dataset, it was necessary to tag the entities using an NER module by fine-tuning on various biomedical datasets. Since the CORD-19 dataset is a collection of scientific articles, we use SciBERT for NER extraction. SciBERT is a variant on the BERT (Devlin et al., 2019) model and is pretrained on a scientific corpus of 1.14M articles where 82 percent of the literature comprised of the biomedical domain and the rest was from various computer science domains. In order to extract chemical, protein and disease entities, SciBERT is fine-tuned on different task specific datasets one-by-one, namely, JNLPBA (Collier and Kim, 2004), CHEMDNER (Krallinger et al., 2015) and NCBI Disease Corpus (Dogan et al., 2014) to obtains proteins, chemical and disease annotations respectively.
We use the SciBERT-scivocab-uncased model for NER extraction. The input to the SciBERT model is the pre-processed dataset modified according to the tokenization of BERT. The output of the model consists of the input sentence along with labels according to the BIO scheme where "B" stands for Beginning of an entity tag, "I" stands for Inside of an entity tag and "O" means Outside the entity as can be seen in NER module of Figure 1.
Due to a lack of human gold standard dataset for NER on the CORD-19 data, we do not retain the obtained fine-grained entity annotations. Following the NER tagging, we therefore, tag the Protein, DNA and RNA entities extracted upon finetuning the JNLPBA dataset simply as PROTEIN, CHEMDNER as CHEMICAL and NCBI-Disease Corpus as DISEASE. We drop all entities with tags Cell line and Cell type as they could not be merged into any existing categories.

Relation Extraction
From the NER module we obtain an annotated dataset. To further exploit the underlying information present in the running sentences we perform intra sentence Relationship Extraction (RE) which is the task of identifying relationships between any two named entities present within a sentence. Using this RE module we try to identify the relationships that different pairs of entities have at sentence level. The output from the NER module was further processed in order to discover sentences containing more than one entity. For a given set of entities, E, in a sentence, it is split into E 2 instances. So, a single sentence, is represented as: X = {e 1 , e 2 , w 1 ...w n } where e 1 and e 2 are two tagged entities and w j is the j th word in the sentence.
An approach similar to the NER module is performed, employing SciBERT for identifying relations from sentences through contextual evidence. We fine tune SciBERT on two datasets, CHEMPROT (Kringelum et al., 2016) and BC5CDR (Li et al., 2016), to capture relations between chemical-protein and chemical-disease pairs.
Following the RE task on the CORD-19 data, we combine the 13 different types of associations obtained upon fine-tuning CHEMPROT as a single relation type called CHEMICAL-PROTEIN. Similarly, only the positive associations obtained upon fine-tuning BC5CDR were retained

Knowledge Graph Construction
Statistics of the consolidated set of entity mentions and relation pairs obtained as a result of NER and RE on the CORD-19 dataset can be seen from Table  1. To obtain an overview of the different entities and their association with each other, we generate a KG which is a good association representation of the entire unstructured CORD-19 dataset. We construct a KG which is defined as KG = (E, R, G), where, • E: a set of nodes representing disease/ protein/ drug entities • R: a set of labels representing chemicalprotein relation or chemical-disease • G ⊆ E × R × E: a set of edges that represent facts connecting entity pairs.
Each fact is a triple h, r, t , where h is the head, r is the relation, and t is the tail of the fact.

COV19 729
After generating the KG, a list of all entities are supplied to a physician, who clubbed the terms into 3 groups based on their relatedness to COVID-19, i.e., NOT RELATED, PARTIALLY RELATED and HIGHLY RELATED. It was identified that the number of entities in the HIGHLY RELATED group are much less in comparison to the other two categories. Thus, in order to reduce bias, the physician sampled nearly equal number of entities from each group, resulting in a final dataset comprising of 729 entities named as COV19 729. This dataset was then shuffled and passed on to two independent physicians, who provided ratings to each sample indicating how related an entity is to COVID-19 on a scale of 0 (NOT RELATED) to 5 (HIGHLY RELATED). The inter-rater agreement score (kappa score) is found to be 0.5116, which lies in the moderate agreement range. We, therefore, average out the ratings and propose a relatively large, intrinsic evaluation dataset called COV19 729 for benchmarking COVID-19 related embedding techniques. Table 4 shows a snapshot of the COV19 729.

Implementation Details
To generate the intrinsic evaluation dataset, the total list of 78K entities present in our KG are reduced to 5K by removing all entities having less than 5 indegrees. This is done in order to reduce noise. After experimenting with multiple values, the threshold 5 provided the highest signal to noise ratio. For finetuning SciBERT, all hyper-parameters are left at their default values except truncate long sequences parameter which is set to false. For training KG embedding, in OpenKE (Han et al., 2018), the dimension is set to 400 and the rest of the parameters are kept as default. In the case of GCN-AE (Kipf and Welling, 2016), for the link prediction task, the learning rate is set to 0.01, epochs to 200, hidden units in the first and second layer as 32 and 16 respectively.

Link Prediction
Latent entity representation learning of the constructed KG is crucial so that one can effectively analyze associations of any given biomedical entity with respect to COVID-19. Rather than randomly choosing a method, we first evaluate popular KG embedding techniques on a downstream Natural Language Processing task of Link Prediction. We consider Node2Vec (Grover and Leskovec, 2016), Tensor decomposition models like DisMult (Yang et al., 2015) and Geometric models, namely, TransE (Bordes et al., 2013), TransD (Ji et al., 2015), ComplEx (Trouillon et al., 2016) and RotatE (Sun et al., 2019).
The test and validation set is created from the removed edges with the addition of equal number of randomly sampled pairs of false links (nodes that did not have connections in the graph). The test and validation sets have 10 percent and 5 percent of true links, respectively. We use OpenKE (Han et al., 2018) which is an Open-source Framework for Knowledge Embedding techniques. The results are reported based on the model's performance on the test set. The embeddings resulting from these methods are treated as features, along with the graph adjacency matrix and is fed to GCN-AE (Kipf and Welling, 2016). The Average Precision and ROC score of each setting is noted and used to benchmark these embedding types as can be seen in Table 2.

Method
ROC AP RotatE (Sun et al., 2019) 0.858 0.887 TransD (Ji et al., 2015) 0.860 0.883 TransE (Bordes et al., 2013) 0.853 0.877 DistMult (Yang et al., 2015) 0.855 0.883 ComplEx (Trouillon et al., 2016) 0.852 0.881 Node2Vec (Grover and Leskovec, 2016) 0.821 0.849  Table 2 in terms of Average Precision, Ro-tatE performs the best among all KG embeddings while in terms of ROC score, TransD outperforms the rest. Models like TransE capture inversion and composition patterns well, whereas models like DisMult capture symmetrical relationships. But in case of RotatE all the different aspects like symmetry, anti-symmetry, inversion and composition are captured. Also, TransD has a similar performance to RotatE. This is because in our setting every relationship pair has the head and tail entity to be of different entity types (either chemicalprotein or chemical-disease). The inherent property of TransD to separate the head and tail entity spaces was useful to model this graph structure. Hence, giving comparable results to RotatE.
Node2Vec performs relatively poor since it relies on the internal mechanism of grouping nodes with identical connection patterns which could be less frequent in our KG as it is not raised from an interaction network and is rather constructed from entities and relations obtained from free text.

Intrinsic Evaluation
We conduct Intrinsic Evalaution where Table 3 shows the performance of TransD and RotatE embedding methods in terms of Pearson and Spearman correlation scores between the ratings and the cosine similarity scores of entities on the COV19 729 dataset. The cosine similarity scores for each entity was generated with respect to the COVID-19 embedding vector obtained from our proposed pipeline. However, most of the top entities generated by two of our best methods, TransD and RotatE (selected on basis of the link prediction task) were not present in COV19 729 since the said dataset was randomly sampled. In our view, these entities require immediate attention and hence, we conduct another round of scoring to evaluate them and in the process, propose COV19 25.

Entity List
Spearman Correlation

COV19 25
The top 100 predicted entities from TransD and RotatE were selected and an intersection of the generated entities was taken, which was then passed on to a physician. The physician recommended a list of 25 relevant entities, out of the provided set. This list was then sent to another physician who rated the entities based on their relatedness to COVID-19. This was named as COV19 25. It is evident from Table 3 that TransD has the highest Pearson and Spearman score on the COV19 729 and COV19 25 datasets. Hence, we use TransD as the final embedding generation method for ERLKG.

Discussion
We exploit the contextual evidence from CORD-19 corpus in finding entities and relations. This is followed by KG construction for determining the relatedness between any biomedical entity with respect to COVID-19. A simple co-occurrence ma- trix based method is not sufficient to capture the different relationship association types. We, therefore, use state-of-the-art SciBERT for the purpose of entity and relationship extraction. We construct a KG from entity pairs and the relationship among them. Our aim was to utilize this KG for effective association analysis for which identifying the best entity representation was necessary. We therefore conduct link prediction task and evaluate popular KG embedding techniques. Since, our KG consists of a simple bare-bone structure, deep learning based KG embedding methods like ConvE were not explored in this work. This is because such methods lead to an increase in the number of hyperparameters while providing little to no explainability.
We face the challenge of an absence of ground truth data for CORD-19 corpus. Thus, we conduct extensive qualitative evaluations and in the process, introduce two gold-standard, annotated entity lists, COV19 25, and COV19 729. COV19 25 consists of 25 entities predicted by the top two embedding techniques, TransD and RotatE while COV19 729 consists of 729 entities sampled from the processed CORD-19 dataset. The ratings were based on an entity's relatedness to COVID-19. From the correlation scores 3 of our intrinsic evaluation, we observe that our model could provide considerable insight in predicting important associations with respect to COVID-19.

Conclusion and Future Work
We propose ERLKG, a generic pipeline, for association analysis with respect to a given entity from an unstructured dataset. The part of the pipeline integrating IE and KG construction keeps humanout-of-the-loop. In order to learn the latent repre-sentation of the formed KG, we first benchmark various types of KG embedding techniques on the task of Link Prediction. According to our experiments we find TransD and RotatE producing a comparable performance.
In this work our approach is evaluated only on CORD-19 dataset and no additional resources have been employed. However, due to the lack of gold standard data we introduce COV19 729, which is a list of extracted named entities from our pipeline selected randomly and given to physicians for assigning association scores with respect to COVID-19. Owing to random selection most of the entities listed with greater association scores by TransD and RotatE were found to be missing in COV19 729 hence another set was given to physicians from the top entites which we call COV19 25. Finally TransD is used as our best KG embedding technique to predict top entities that are closely associated to COVID-19 from CORD-19 corpus. As a future scope, we plan to implement a normalization and abbreviation expansion module after the detection of entities. The study of these top predicted entities, by the domain experts, can help them understand the different types of associations and relationships they exhibit with respect to COVID-19.