COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation

To combat COVID-19, both clinicians and scientists need to digest the vast amount of relevant biomedical knowledge in literature to understand the disease mechanism and the related biological functions. We have developed a novel and comprehensive knowledge discovery framework, COVID-KG to extract fine-grained multimedia knowledge elements (entities, relations and events) from scientific literature. We then exploit the constructed multimedia knowledge graphs (KGs) for question answering and report generation, using drug repurposing as a case study. Our framework also provides detailed contextual sentences, subfigures, and knowledge subgraphs as evidence. All of the data, KGs, reports.


Introduction
Practical progress at combating COVID-19 highly depends on effective search, discovery, assessment and extension of scientific research results.However, clinicians and scientists are facing two unique barriers on digesting these research papers.
The first challenge is quantity.Such a bottleneck in knowledge access is exacerbated during a pandemic when increased investment in relevant research leads to even faster growth of literature than usual.For example, as of April 28, 2020, at PubMed 3 there were 19,443 papers related to coronavirus; as of June 13, 2020, there were 140K+ related papers, nearly 2.7K new papers per day (see Figure 1).The resulting knowledge bottleneck contributes significant delays in the development of vaccines and drugs for COVID-19.More intelligent knowledge discovery technologies need to be developed to enable researchers to more quickly and accurately access and digest relevant knowledge from the literature.The second challenge is quality.Many research results about coronavirus from different research labs and sources are redundant, complementary, or even conflicting with each other, while some false information has been promoted in both formal publication venues as well as social media platforms such as Twitter.As a result, some of public policy responses to the virus, and public perception of it, have been based on misleading, and at times erroneous, claims.The relative isolation of these knowledge resources makes it hard, if not impossible, for researchers to connect the dots that exist in separate resources to gain new insights.
Let us consider drug repurposing as a case study. 4Besides the long process of clinical trial and biomedical experiments, another major cause of the lengthy discovery phase is the complexity of the problem involved and the difficulty in drug discovery in general.The current clinical trials for drug repurposing rely mainly on reported symptoms in considering drugs that can treat diseases with similar symptoms.However, there are too many drug candidates and too much misinformation published in multiple sources.The clinicians and scientists thus urgently need assistance in obtaining a reliable ranked list of drugs with detailed evidence, and also in gaining new insights into the underlying molecular cellular mechanisms on COVID-19 and the pre-existing conditions that may affect the mortality and severity of this disease.
To tackle these two challenges we propose a new framework, COVID-KG, to accelerate scientific discovery and build a bridge between the research scientists making use of our framework and clinicians who will ultimately conduct the tests, as illustrated in Figure 2. COVID-KG starts by reading existing papers to build multimedia knowledge graphs (KGs), in which nodes are entities/concepts and edges represent relations and events involving these entities, as extracted from both text and images.Given the KGs enriched with path ranking and evidence mining, COVID-KG answers natural language questions effectively.With drug repurposing as a case study, we focus on 11 typical questions that human experts pose and integrate our techniques to generate a comprehensive report for each candidate drug.

Coarse-grained Text Knowledge Extraction
Our coarse-grained Information Extraction (IE) system consists of three components: (1) coarsegrained entity extraction (Wang et al., 2019a) (Li et al., 2019): we extract 13 Event types and the roles of entities involved in these events as defined in (Nédellec et al., 2013), including Gene expression,  However, questions from experts often involve fine-grained knowledge elements, such as "Which amino acids in glycoprotein are most related to Glycan (CHEMICAL)?".To answer these questions, we apply our fine-grained entity extraction system CORD-NER (Wang et al., 2020c) to extract 75 types of entities to enrich the KG, including many COVID-19 specific new entity types (e.g., coronaviruses, viral proteins, evolution, materials, substrates and immune responses).CORD-NER relies on distantly-and weakly-supervised methods (Wang et al., 2019b;Shang et al., 2018), with no need for expensive human annotation.Its entity annotation quality surpasses SciSpacy (up to 93.95% F-score, over 10% higher on the F1 score based on a sample set of documents), a fully supervised BioNER tool.See Figure 4 for results on part of a COVID-19 paper (Zhang et al., 2020).

Image Processing and Cross-media Entity Grounding
Figures in biomedical papers may contain different types of visual information, for example, displaying molecular structures, microscopic images, dosage response curves, relational diagrams, and other uniquely visual content.We have developed a visual IE subsystem to extract the visual information from figures to enrich the KG.We start by designing a pipeline and automatic tools shown in Figure 5 to extract figures from papers in the CORD-19 dataset and segment figures into nearly half a million isolated subfigures.In the end, we perform cross-modal entity grounding, i.e., associating visual objects identified in these subfigures with entities mentioned in their captions or referring text.To start, since most figures are embedded as part of PDF files, we run Deepfigures (Siegel et al., 2018) to automatically detect and extract figures from each PDF document.Then each figure is associated with text in its caption or referring context (main body text referring to the figure).In this way, a figure can be attached, at a coarse level, to a KG entity if that entity is mentioned in the associated text.
To further delineate semantic and visual information contained within each subfigure, we have developed a pipeline to segment individual subfigures and then align each subfigure with its corresponding subcaption.We run Figure-separator (Tsutsui and Crandall, 2017) to detect and separate all nonoverlapping image regions.On occasion, subfigures within a figure may also be marked with alphabetical letters (e.g., A, B, C, etc).We use deep neural networks (Zhou et al., 2017) to detect text within figures and then apply OCR tools (Smith, 2007) to automatically recognize text content within each figure.To identify subfigure marker text and text labels for analyzing figure content, we rely on the distance between text labels and subfigures to locate subfigure text markers.Location information of such text markers can also be used to merge multiple image regions into a single subfigure.At the end, each subfigure is segmented, and associated with its corresponding subcaption and referring context.The segmented subfigures and associated text labels provide rich information that can expand the KG constructed from text captions.For example, as shown in Figure 6, we apply a classifier to detect subfigures containing molecular structures.Then by linking the specific drug names extracted from within-figure text to corresponding drug entities in the coarse KG constructed from the caption text, an expanded cross-modal KG can be constructed that then links images with specific molecular structures to their drug entities in the KG.

Knowledge Graph Semantic Visualization
In order to enhance the exploration and discovery of the information mined from the COVID-19 literature through the algorithms discussed in previous sections, we create semantic visualizations over large complex networks of biomedical relations using the techniques proposed by Tu et al. (2020).Semantic visualization allows for visualization of user-defined subsets of these relations interactively through semantically typed tag clouds and heat maps.This allows researchers to get a global view of selected relation subtypes drawn from hundreds or thousands of papers at a single glance.This in turn allows for the ready identification of novel relations that would typically be missed by directed keyword searches or simple unigram word cloud or heatmap displays. 5e first build a data index from the knowledge elements in the constructed KGs, and then create a Kibana dashboard6 out of the generated data indices.Each Kibana dashboard has a collection of visualizations that are designed to interact with each other.Dashboards are implemented as web applications.The navigation of a dashboard is mainly through clicking and searching.By clicking the protein keyword EIF2AK2 in the tag cloud named "Enzyme proteins participating Modification relations", a constraint on the type of proteins in modifications is added.Correspondingly, all the other visualizations will be changed.
One unique feature of the semantic visualization is the creation of dense tag clouds and dense heatmaps, through a process of parameter reduction over relations, allowing for the visualization of relation sets as tag clouds and multiple chained relations as heatmaps.Figure 7 illustrates such a dense heatmap that contains relations between proteins and implicated diseases (e.g., "those proteins that are down-regulators of TNF which are implicated in obesity"), along with their type information7 .In contrast to most current question-answering (QA) methods which target single documents, we have developed a QA component based on a combination of KG matching and distributional semantic matching across documents.We build KG indexing and searching functions to facilitate effective and efficient search when users pose their questions.
We also support extended semantic matching from the constructed KGs and related texts by accepting multi-hop queries.
A common category of queries is about the connections between two entities.Given two entities in a query, we generate a subgraph covering salient paths between them to show how they are connected through other entities.Figure 3 is an example subgraph summarizing the connections between Losartan and cathepsin L pseudogene 2. The paths are generated by traversing the constructed KG, and are ranked by the number of papers covering the knowledge elements in each path in the KG.Each edge is assigned a salience score by aggregating the scores of paths passing through it.In addition to knowledge elements, we also present related sentences and source information as evidence.We use BioBert (Lee et al., 2020), a pre-trained language model to represent each sentence along with its left and right neighboring sentences as local contexts.Using the same architecture computed on all respective sentences and the user query, we aggregate the sequence embedding layer, the last hidden layer in the BERT architecture with average pooling (Reimers and Gurevych, 2019).We use the similarity between the embedding representations of each sentence and each query to identify and extract the most relevant sentences as evidence.
Another common category of queries includes entity types, rather than entity instances, and requires extracting evidence sentences based on type or pattern matching.We have developed EVI-DENCEMINER (Wang et al., 2020a,b), a web-based system that allows for the user's query as a natural language statement or an inquiry about a relationship at the meta-symbol level (e.g., CHEMICAL, PROTEIN) and then automatically retrieves textual evidence from a background corpora of COVID-19.

A case study on Drug Repurposing
Report Generation

Task and Data
A human-written report about drug repurposing usually answers the following typical questions.The answers to questions #5 and #11 are extracted based on the meta-data sections of research papers in scientific literature, including the author affiliation and acknowledgement sections.The answers for other questions are all extracted based on the knowledge graphs constructed and knowledge-driven question-answering method described above.
As in our case studies, DARPA biologists inquired about three drugs, Benazepril, Losartan, and Amodiaquine, and their links to COVID-19 related chemicals/genes as shown in Figure 8:  Our KG results for many other drugs are visualized at our website 8 .We download new COVID-19 papers from three Application Programming Interfaces (APIs): NCBI PMC API, NCBI Pubtator API and CORD-19 archive.We provide incremental updates including new papers, removed papers and updated papers, and their metadata information at our website 9 .

Results
As of June 14, 2020 we collected 140K papers.We selected 25,534 peer-reviewed papers and constructed the KG that includes 7,230 Diseases, 9,123 Chemicals and 50,864 Genes, with 1,725,518 Chemical-Gene links, 5,556,670 Chemical-Disease links, and 77,844,574 Gene-Disease links.The KG has received more than 1,000+ downloads.Our final generated reports 10 are shared publicly.For each question, our framework provides answers along with detailed evidence, knowledge subgraphs and image segmentation and analysis results.Table 1 shows some example answers.
Several clinicians and medical school students in our team have manually reviewed the drug repurposing reports for three drugs, and also the KGs connecting 41 drugs and COVID-19 related chemicals/genes.In checking the evidence sentences and reading the original articles, they reported that most of our output is informative and valid.For instance, after the coronavirus enters the cell in the lungs, it can cause a severe disease called Acute Respiratory Distress Syndrome.This condition causes the release of inflammatory molecules in the body named cytokines such as Interleukin-2, Interleukin-6, Tumor Necrosis Factor, and Interleukin-10.We see all of these connections in our results, such as the examples shown in Figure 3 and Figure 9.With further checks on these results, the scientists also indicated that many results were worth further investigation.For example, in Figure 3 we can see that Lusartan is connected to tumor protein p53 which is related to lung cancer.Habibi et al., 2017;Crichton et al., 2017;Wang et al., 2018;Beltagy et al., 2019;Alsentzer et al., 2019;Wei et al., 2019;Wang et al., 2020c), relations (Uzuner et al., 2011;Krallinger et al., 2011;Manandhar and Yuret, 2013;Bui et al., 2014;Peng et al., 2016;Wei et al., 2015;Peng et al., 2017;Luo et al., 2017;Wei et al., 2019;Li and Ji, 2019;Peng et al., 2019Peng et al., , 2020)), and events (Ananiadou et al., 2010;Van Landeghem et al., 2013;Nédellec et al., 2013;Deléger et al., 2016;Wei et al., 2019;Li et al., 2019;ShafieiBavani et al., 2020) from biomedical literature, with the most recent work focused on COVID-19 literature (Hope et al., 2020;Ilievski et al., 2020;Wolinski, 2020;Ahamed and Samad, 2020).

Related Work
Most of the recent biomedical QA work (Yang et al., 2015(Yang et al., , 2016;;Chandu et al., 2017;Kraus et al., 2017) is driven by the BioASQ initiative (Tsatsaronis et al., 2015), and many live QA systems, including COVIDASK11 and AUEB12 , and search engines (Kricka et al., 2020;Esteva et al., 2020;Hope et al., 2020;Taub Tabib et al., 2020) have been developed.Our work is an application and extension of our recently developed multimedia knowledge extraction system for news domain (Li et al., 2020a,b).Similar to news domain, the knowledge elements extracted from text and images in literature are complementary.Our framework advances state-of-the-art by extending the knowledge elements to more fine-grained types, incorporating image analysis and cross-media knowledge grounding, and KG matching into QA.

Conclusions and Future Work
We have developed a novel framework, COVID-KG, that automatically transforms a massive scientific literature corpus into organized, structured, and actionable KGs, and uses it to answer questions in drug repurposing reporting.With COVID-KG, researchers and clinicians are able to obtain informative answers from scientific literature, and thus focus on more important hypothesis testing, and prioritize the analysis efforts for candidate exploration directions.In our ongoing work we have created a new ontology that includes 77 entity subtypes and 58 event subtypes, and we are building a neural IE system following this new ontology.In the future we plan to extend COVID-KG to automate the creation of new hypotheses by predicting new links.We will also create a multimedia common semantic space (Li et al., 2020a,b) for literature and apply it to improve cross-media knowledge grounding and inference.

Ethical Considerations
Required Workflow for Using Our System Human review required.Our knowledge discovery tool provides investigative leads for pre-clinical research, not final results for clinical use.Currently, biomedical researchers scour the literature to identify candidate drugs, then follow a standard research methodology to investigate their actual utility (involving literature reviews, computer simulations of drug mechanisms and effectiveness, invitro studies, cellular in-vivo studies, etc. before moving to clinical studies.).Our tool COVID-KG (and all knowledge discovery tools for biomedical applications) is not meant to be used for direct clinical applications on any human subjects.Rather, our tool aims to highlight unseen relations and patterns in large amounts of scientific textual data that would be too time consuming for manual human effort.Accordingly, the tool would be useful for stakeholders (e.g., biomedical scientists) to identify specific drug candidates and molecular targets that are relevant in their biomedical and clinical research aims.Use of our knowledge discovery tool allows the researcher to narrow down the set of candidate drugs to investigate rapidly, but then proceed with the usual sequence of steps before kicking off expensive and time-consuming clinical tests.Failure to follow this sequence of events, and use of the system without the required human review, could lead to misguided experimental design wasting time and resources.
Check evidence and source before use our system results.In addition, our tool provides source and rich evidence sentences for each node and link in the KG.To curtail potential harms caused by extraction errors, users of the knowledge graphs should double check the source information and verify the accuracy of the discovered leads before launching expensive experimental studies.We spell out here the positive values, as well as the limitations and possible solutions to address these issues for future improvement.Moreover, any planned investigations involving human subjects should first be approved by the stakeholder's IRB (Institutional Review Board) who will oversee the safety of the proposed studies and the role of COVID-KG before any experimental studies are conducted.COVID-KG is a tool to enhance biomedical and clinical research; it is not a tool for direct clinical application with human subjects.

Limitations of System Performance and Data Collection
System errors.Our system can effectively convert a large amount of scientific papers into knowledge graphs, and can scale as literature volume increases.However, none of our extraction components is perfect, they produce about 6%-22% false alarms and misses as reported in section 2. But as we described in the workflow, all of the connections and answers will be validated by domain experts by checking their corresponding sources before they are included in the drug repurposing report.COVID-KG is developed for pre-clinical research to target down drugs of interest for biomedical scientists.Therefore, no human subjects or specific populations are directly subjected to COVID-KG unless approved by the stakeholder's IRB who oversees the safety and ethical aspects of the clinical studies in accordance with the Belmont report (https://www.hhs.gov/ohrp/regulations-andpolicy/belmont-report/index.html).Accordingly, COVID-KG will not impose direct harm to vulnerable human cohorts or populations, unless misused by the stakeholders without IRB approval.With regards to potential harm in preclinical studies, users of COVID-KG are advised to verify the accuracy of the discovered leads in the source information before conducting expensive experimental studies.
Bias in training data.Proper use of the technology requires that input documents are legally and ethically obtained.Regulation and standards (e.g.GDPR 13 ) provide a legal framework for ensuring that such data is properly used and that any individual whose data is used has the right to request its removal.In the absence of such regulation, society relies on those who apply technology to ensure that data is used in an ethical way.The input data to our system is peer-reviewed publicly available scientific articles.An additional potential harm could come from the output of the system being used in ways that magnify the system errors or bias in its training data.The various components in our system rely on weak distant supervision based on large-scale external knowledge bases and ontologies that cover a wide range of topics in the biomedical domain.Nevertheless, our system output is intended for human interpretation.We do not endorse incorporating the system's output into an automatic decision-making system without human validation; this fails to meet our recommendations and could yield harmful results.In the cited technical reports for each component in our framework, we have reported detailed error rates for each type of knowledge element from system evaluations and provide detailed qualitative analysis and explana- 13 The General Data Protection Regulation of the European Union https://gdpr.eu/what-is-gdpr/.

tions.
Bias in development data.We also note that the performance of our system components as reported is based on the specific benchmark datasets, which could be affected by such data biases.Thus questions concerning generalizability and fairness should be carefully considered.Within the research community, addressing data bias requires a combination of new data sources, research that mitigates the impact of bias, and, as done in (Mitchell et al., 2019), auditing data and models.Sections 2 and ?? cite data sources used for training to support future auditing.A general approach to properly use our system should incorporate ethics considerations as the first-order principles in every step of the system design, maintain a high degree of transparency and interpretability of data, algorithms, models, and functionality throughout the system, make software available as open source for public verification and auditing, and explore countermeasures to protect vulnerable groups.In our ongoing and future work, we have kept increasing the annotated dataset size, add more rounds of user correction and validation, and iteratively incorporate feedback from domain experts who have used the tool, to create new benchmarks for retraining model and conducting more systematic evaluations.We recommend caution of using our system output until a more complete expert evaluation has occurred.
Bias in source.Furthermore, our system output may include some biases from the sources, by way of biases in the peer reviewing process.In our previous work (Yu et al., 2014;Ma et al., 2015;Zhi et al., 2015;Zhang et al., 2019), we have aggregated source profile, knowledge graphs and evidence for fact-checking across sources.We plan to extend our framework to include fact-checking to enable practitioners and researchers to access to up-to-the-minute information.
Bias in test queries.Finally, the queries (i.e., the lists of candidate drugs and proteins/genes) are provide by the users who might have bias in their selection.Addressing the user's own biases falls outside the scope of our project, but as we have stated in the previous subsection, we direct users to carefully examine source information (author, publication date, etc.) and detailed evidence (contextual sentences and documents) associated with the extracted connections.

Figure 2 :
Figure 2: COVID-KG Overview: From Data to Semantics to Knowledge

Figure 3 :
Figure 3: Constructed KG Connecting Losartan (candidate drug in COVID-19) and cathepsin L pseudogene 2 (gene related to coronavirus), where red nodes represent chemicals, grey nodes represent genes, and edges represent gene-chemical relations.

Figure 4 :
Figure 4: Example of Fine-grained Entity Extraction

Figure 5 :
Figure 5: System Pipeline for Automatic Figure Extraction and Subfigure Segmentation.The figure image shown here is from (Kizziah et al., 2020)

Figure 6 :
Figure 6: Expanding KG through Subfigure Segmentation and Cross-modal Entity Grounding.The figure image shown here is from(Ekins and Coffee, 2015)