Extracting a Knowledge Base of Mechanisms from COVID-19 Papers

The COVID-19 pandemic has spawned a diverse body of scientific literature that is challenging to navigate, stimulating interest in automated tools to help find useful knowledge. We pursue the construction of a knowledge base (KB) of mechanisms—a fundamental concept across the sciences, which encompasses activities, functions and causal relations, ranging from cellular processes to economic impacts. We extract this information from the natural language of scientific papers by developing a broad, unified schema that strikes a balance between relevance and breadth. We annotate a dataset of mechanisms with our schema and train a model to extract mechanism relations from papers. Our experiments demonstrate the utility of our KB in supporting interdisciplinary scientific search over COVID-19 literature, outperforming the prominent PubMed search in a study with clinical experts. Our search engine, dataset and code are publicly available.


Introduction
The global effort to understand the SARS-CoV-2 virus and to mitigate the COVID-19 pandemic is an interdisciplinary endeavor with an intensity the world has rarely seen (Apuzzo and Kirkpatrick 2020).Scientists from many areas, ranging from microbiology to AI, are working to understand the disease, adding to a rapidly expanding body of literature which encompasses both past work on viruses and findings on the novel coronavirus (Wang et al. 2020b).As a recent example, a diverse group of scientists called attention to the airborne transmissibility of the virus based on research spanning virology, aerosol physics, flow dynamics, epidemiol-

(Heparan sulfate proteoglycans, binds to target cells) (image analysis, face mask detection) (general anesthesia in infants, perioperative complications)
The virus binds to target cells by use of heparan sulfate proteoglycans.
There are few studies about face mask detection based on image analysis.
Severe perioperative complications may occur during general anesthesia in infants.ogy, medicine and building engineering, stating, "expertise in many science and engineering areas enables us to understand the mechanisms behind generation of respiratory microdroplets, how viruses survive in microdroplets, and how airflow patterns carry microdroplets in buildings" (Queensland University of Technology 2020).In this paper, our overarching goal is to build a knowledge base (KB) that scientists can use to search and explore diverse interdisciplinary mechanisms in literature related to COVID-19.
Figure 1 shows examples of the types of mechanisms we focus on.These include mentions of mechanistic activities (e.g., viral binding), of functions that natural or artificial entities serve (e.g., a protein used for binding, or image analysis used in public health), and also more indirect influences and associations (such as possible complications associated with a medical procedure).These relationships cover a wide range of domain-specific concepts in scientific papers, providing a unified language which can be used for domainagnostic extraction and scientific search (with results as seen in Figure 1).More broadly, a KB of mechanisms across science could enable the transfer of ideas across disparate areas (Hope et al. 2017;Kittur et al. 2019), and assist in literature-based discovery (Swanson and Smalheiser 1996;Spangler et al. 2014;Nordon et al. 2019) by finding crossdocument causal links (Swanson and Smalheiser 1996).
In biomedicine, information extraction (IE) has been used to extract mentions of pinpointed entities such as proteins or chemicals and their relations, including recently from coronavirus-related papers (Ilievski et al. 2020;Ahamed and Samad 2020;Hope et al. 2020).Some of these relations correspond to mechanisms (e.g., chemical-protein regulation, or drug-drug interactions), but capture only a fraction of the full breadth and depth of mechanisms in the literature.It is challenging to formulate comprehensive fine-grained schemas for diverse domains; on the other extreme, Open IE approaches (Etzioni et al. 2008;Stanovsky et al. 2018; Zhan and Zhao 2020) focus on general-purpose, schemafree extraction of relations, but many of the relations are generic and uninformative for scientific applications In this work, we use open, free-form entities with a broad class of relations centered around mechanisms, to strike a balance between expressivity and breadth.Our unified view of mechanisms is designed to help generalize and scale the study of these important relations in the context of the COVID-19 emergency and more broadly.We lay the foundations for the framework, which we hope will open new avenues for boosting knowledge discovery across the sciences.
Our main contributions include:

Background: Mechanisms in Science
The concept of mechanisms, also referred to as functional relations, is fundamental in biomedical ontologies (Burek et al. 2006;Röhl 2012;Keeling et al. 2019), engineering (Hirtz et al. 2002), and across science.Mechanisms can be natural (e.g., the mechanism by which amylase in saliva breaks down starch into sugar), artificial (electronic devices), non-physical constructs (algorithms, economic policies), and very often a blend (a pacemaker regulating the beating of a heart through electricity and AI algorithms).
In our work we aim to achieve broad coverage of mechanism relations, extending to a wide range of entities and topics observed in COVID-19 papers.For example, in addition to areas such as medicine, microbiology, genetics, proteomics, zoology and virology, topics we cover in our mechanism annotations include computer science, public policies, flow dynamics, building engineering, macroeconomic impacts and international relations.In Homo Deus: A Brief History of Tomorrow (Harari 2016), the author writes: "While some experts are familiar with one field, such as AI, nanotechnology, big data or genetics [...] no one is capable of connecting all the dots and seeing [...] how breakthroughs in AI might impact nanotechnology, or vice versa."By building a KB with diverse, domain-agnostic mechanisms, we aim to make progress toward connecting those dots.
Exact definitions of mechanisms are subject to debate in the philosophy of science (Röhl 2012;Keeling et al. 2019).A dictionary definition of mechanisms refers to a natural or established process by which something takes place or is brought about.More intricate definitions discuss "complex systems producing a behavior", "entities and activities productive of regular changes", "a structure performing a function in virtue of its parts and operations", or the distinction between "correlative property changes" and "activity determining how a correlative change is achieved" (Röhl 2012).The schema we propose in this work (see Section 3, Figures 1,2) draws inspiration from these existing definitions.We extract activities and functions, and also more general influences and associations.Our work is also related to a large body of literature on extracting information from biomedical papers.This information often corresponds to very specific types of mechanisms such as chemical-protein regulation and drug-drug interactions (Li et al. 2016;Segura Bedmar, Martínez, and Herrero Zazo 2013).In the CHEMPROT dataset (Li et al. 2016) for example, texts are annotated for relations capturing interactions between chemicals and proteins (e.g., up/down regulation).The annotation guidelines for CHEMPROT distinguish between direct and indirect interactions, between relations explicitly and implicitly referred to in the text, and between texts where "mechanistic information is available" and those where the nature of an interaction is more vague.A semantic predication schema akin to Semantic Role Labeling, with predicates such as X treats Y or X induces Y, has also been proposed (Kilicoglu et al. 2011).Concepts and relations in that work were also limited to a relatively narrow set of biomedical sub-domains and entities aligned with the UMLS biomedical onotolgy (Bodenreider 2004) such as names of drugs and diseases (see Section 5.1 for more details).Recent work has applied such tools to extract information from the CORD-19 corpus, for constructing KBs (Wise et al. 2020;Wang et al. 2020a) andvisualizations (Hope et al. 2020).Unfortunately, biomedical ontologies suffer from cultural differences between disciplines that lead to a lack of a unified language (Wang et al. 2018) and many fragmented classes (Salvadores et al. 2013) -with only a small a fraction at the focus of mainstream biomedical IE.In the next section, we present our schema for unified extraction of mechanisms -in biomedicine, and beyond.
3 Task and Data

Relation Schema
Our goal is to extract information expressing the important notion of mechanisms.As discussed in Section 2, this seemingly intuitive concept is subject to debate, and an absolute definition is illusive.We opt for a practical approach that is simple enough for annotators and models, inspired by the definitions and schema discussed in Section 2.
Within the concept of mechanisms, we include activities (e.g., binding) or explicit mentions of functions (e.g., a use for treating), and also influences or associations of a more indirect nature (such as describing observed effects, without describing the process involved).We further break the concept of mechanisms into relations of the form ( subject, object, class) with two coarse-grained classes.The first, which we call direct mechanisms, includes mechanistic activities, and reference to specific functions.The second, indirect mechanisms, includes influences or associations without explicit mechanistic information or mention of a function, and relations that are expressed more implicitly in the text.
Indirect mechanisms correspond to texts indicating "input-output correlations" (Röhl 2012), such as indicating that COVID-19 may lead to certain symptoms but not how, or mentioning a general association between two proteins.Direct mechanisms describe "inner workings" -revealing more of the intermediate states that lead from initial conditions (COVID-19) to final states (symptoms) (Röhl 2012), or describing explicitly the function served by an entity (whether natural or human-made).This distinction is inspired by the direct and indirect types of relations in the CHEMPROT chemical-protein regulation schema, but covers a much broader set of concepts and domains.
Figure 1 shows some examples, such as SARS-CoV-2 binding to target cells (direct mechanism), the use of image analysis for face mask detection (direct mechanism), and complications generally associated with a medical procedure (indirect mechanism).Our annotation guideline, available in the supplement, shows many more instantiations of these relations.
Finally, to be able to more directly interpret mechanism relations beyond the coarse-grained categorization, we also experimented with granular relations of the form subject-predicate-object, where predicates represent a specific type of a mechanism relation explicitly mentioned in the text (e.g., binds, causes, reduces; see Figure 2).While more granular, these relations are also less general -as the natural language of scientific papers describing mechanisms often does not conform to this more rigid structure (in Table 1, there are 400 coarse relations that could not be converted to granular form by an annotator).In our experiments, we also train a model that infers the predicates (a list of frequent predicates is available in the supplement).

Dataset
We construct a dataset (called Annotation process To obtain high-quality annotations of CORD-19 abstracts, we follow a three-stage process of (1) annotating entities and relations using biomedical experts, (2) unifying span boundaries with an NLP expert, and (3) verifying annotations with a bio-NLP expert.
In the first stage, 5 annotators with biomedical background annotate all relations reflecting mechanisms as defined in Section 3.1 (the full annotation guidelines can be found in the supplement).Annotators were given examples and had a one-hour training session using Prodigy, a platform with a GUI for rapid annotations (Montani and Honnibal 2018).Entities are annotated only when involved in a relation with another.Following (Luan et al. 2018), annotators perform a greedy annotation preferring longer spans whenever ambiguity occurs as to span boundaries.
We initially observed large variation between annotators and low agreement as measured with strict, exact matching criteria between relations.A deeper look revealed much of the disagreement was due to variations in annotation style rather than meaning.In particular, the largest reason for disagreement was differences in span boundaries, likely stemming from the challenging nature of our task with abstract, soft definitions of relations between free-form spans.
In the second stage of annotation, an NLP expert annotator carried out a round of style unification between annotators to enhance dataset quality.The NLP expert unified entity annotations by adjusting span boundaries while preserving the original meaning.In the last stage, a bio-NLP expert with experience in annotating scientific papers verified the annotations and corrected them as needed.We observe that the bio-NLP expert accepted 81% of the annotations from the second stage without modification, confirming the high quality of the annotated data.

Task Definition
Given an input document D represented as a sequence of input tokens {w 1 , . . ., w n }, the task is to identify all mentions of mechanism relations in D -including the entities participating in those relations.
Entities In COFIE, we only annotate entity mentions that participate in one of our two relation categories (direct/indirect).These mentions all share a single common entity type.The mention e = (e start , e end ) is represented by the indices of its start and end tokens in D.
Relations A coarse relation is represented as a tuple r c = (s, o, y), where s and o are the subject and object entities.
The relation label is given by y ∈ {DIRECT, INDIRECT}.
A granular relation is represented as a tuple r g = (s, p, o).
The s and o slots are the same as in coarse relations.p represents a specific type of mechanism relation (which may be direct or indirect).For simplicity, we constrain the predicate p to consist of a single token (usually a verb); p is therefore represented by its token index in D.

Evaluation metrics
We evaluate our performance in the tasks of entity identification and relation extraction, defined as follows.
Entity identification Given a boolean span matching function m(s 1 , s 2 ) = 1(s 1 matches s 2 ), a predicted entity mention ê is correctly identified if there exists some gold mention e * in D such that m(ê, e * ) = 12 .We experiment with three different span matching functions.The most conservative is m exact , which is true if two spans have the same start and end tokens.Given the heterogeneous nature of the spans present in our dataset, this metric is overly stringent.Therefore, following common practice in work on Open IE (Stanovsky et al. 2018), we also report results using two more lenient matching functions.The similarity function m rouge (s 1 , s 2 ) is true if Rouge-L(s 1 , s 2 ) > 0.5 (Lin 2004), and the function m subset (s 1 , s 2 ) true if s 1 is contained in s 2 or vice versa.dataset, with 100 sentence pairs from biomedical papers annotated for similarity.This approach allows us to capture related concepts (such as cardiac injury and cardiovascular morbidity), as well as simpler surface matches.

Relation identification / classification
3. Approximate nearest neighbors search.Finally, to perform search over this KB in an efficient manner, we employ a recent system (Johnson, Douze, and Jégou 2017) designed for fast similarity-based search over vectors (such as our text embeddings).We create an index of embeddings corresponding to the 900K unique surface forms.In our setting, queries consist of terms representing the subject and/or object of a relation.Queries are embedded using the same language model.Relations are retrieved within 2 seconds on a standard CPU-only laptop.

Evaluating Extracted Relations
We evaluate the extracted entities and relations on the three sub-tasks introduced in Section 3.4: relation classification, relation identification, and entity identification.

Setup
Implementation Details We use the DYGIE library3 , with SciBERT (Beltagy, Lo, and Cohan 2019) token embeddings finetuned on our task.We employ minimal random hyperparameter search to select the best-performing model on the development set.Full details and code are in the supplement.Baselines We compare our method with the baselines below.Some of these baselines involve training a model on an existing dataset.In these cases, we preprocess the dataset by "mapping" all relation types to the direct mechanism or indirect mechanism relations in COFIE.This mapping was performed by a bio-NLP expert annotator.
SemRep.We train DYGIE on the SemRep dataset (Kilicoglu et al. 2011), consisting of 500 sentences from MEDLINE abstracts and annotated for semantic predication.Concepts and relations in this dataset relate to clinical medicine, substance interactions, genetic etiology of disease and pharmacogenomics.Concepts are tied to the UMLS biomedical ontology (Bodenreider 2004) and focused on pinpointed entities as in most biomedical IE resources.Some of the relations correspond to mechanisms (such as X TREATS Y or X CAUSES Y); other relations are even broader, such as PART-OF or IS-A -we do not attempt to capture these categories as they often do not reflect a functional relation.

SciERC
We train DYGIE on the SciERC dataset (Luan et al. 2018), consisting of 500 abstracts from computer science papers that are annotated for a set relations, including for USED-FOR relations between methods and tasks.We naturally map this relation to our MECHANISM label and discard other relation types.
SRL Our task consists of functional relations between open, flexible spans.A natural baseline to try is thus Semantic Role Labeleing (SRL).Using a pre-trained SRL model (Shi and Lin 2019), we select relations of the form (Arg0, verb, Arg1), and evaluate using our partial metrics applied to Arg0 and Arg1 respectively.SRL-Bio predicates.We adapt the SRL baseline by filtering predicates down to a list of 80 biomedical verbs that are publicly available from a biomedical proposition bank named BioPro (Chou et al. 2006).

SRL-Mechanism
Continuing the above baseline, we task a biomedical annotator with mapping each predicate verb to either DIRECT MECHANISM or INDIRECT MECHANISM, using this mapping as SRL's predictions.OpenIE Finally, we also experiment with the supervised Open Information Extraction model of (Stanovsky et al. 2018), similar in nature and in motivation to SRL.

Automatic evaluations
Relation Prediction Table 2 reports  or more, primarily showing the inability of existing frameworks to capture an important class of relations.Table 3 shows a comparison of precision, precision@K and recall of the DYGIE model trained on our data and on SciERC.The model trained on our data achieves 90% P@50, indicating that correct predictions are assigned higher confidence scores.Figure 3 (left) shows precision scores for top K predicted relations, sorted by prediction confidence.The DY-GIE model trained on COFIE maintains a high precision score (≥ 70%) within top-20% predictions.As discussed in Section 4, we construct a KB by filtering for high-confidence relations, thus having high P@K is important.
In Figure 3 (right) we show the relation identification F 1 score for different thresholds of the Rouge-L matching metric.Our default threshold is 0.5.We conduct this analysis primarily to make sure results are reasonably robust in a local neighborhood around 0.5.As expected, we observe a steady decline in F1 as the threshold increases, however the curve declines moderately, and even with the most conservative threshold of 1.0 (i.e. an exact match) F1 is substantially higher than our best performing baseline (SciERC).Granular relation prediction In addition to coarse-grained relation prediction, we also train a model on COFIE-G and measure the prediction quality.Our evaluation shows that the model trained to predict (s, predicate, o) triples achieves F 1 scores of 43.2 and 24.4 using the substring and exact match metrics, respectively.When predicting relations without trigger labels (i.e., (s, o)), the model achieves F 1 scores of 51.7 and 27.6 on the same two metrics.

Human evaluation of predicted relations
To complement the automatic evaluation metrics in section 2, we conduct an additional human evaluation to measure the quality of the predictions in our KB -do automated metrics capture the true quality of predictions, or are they an under/overestimation as measured by human judgments?
We employ two annotators with biomedical and computer science background, and show them predicted relations for sentences selected randomly from our test data so that we can compare to our automated metrics over ground truth annotation.In particular, we show each annotator 200 relations and the sentences from which they were extracted: 100 predicted by our approach (DYGIE trained on COFIE), and 100 using the pre-trained semantic role labeling (SRL) baseline in Section 2. Annotators are asked to evaluate relations in similar fashion to our annotation guidelines, tagging relations as positive if they reflect a mechanism described in the text, and if they consist of coherent argument spans that capture essential information for the relation to hold, but not redundant or irrelevant information.Table 4 shows a major increase in relation accuracy, as compared to results obtained with automated metrics.In particular we reach average positive rating of over 91%, a high figure in absolute terms, and more than double the rating of SRL (41.7%).Inter-rater agreement was high at 71 by Cohen's Kappa score and 73 by Matthew Correlation Coefficient (MCC) score.Interestingly, we observe that while in absolute numbers the gap between our human-evaluated accuracy and the partial metrics is high, there is strong correlation between them in the overall trend.In particular, accuracy as measured using the rouge-L score to match relations increases more than two-fold for our model in comparison to SRL (from 22.4 to 55.2).A similar trend is seen for the substring-inclusion measure (from 28.7 to 68.6).
We conclude via human judgments that our predicted relations are of overall sufficiently high quality, and that our automated metrics correlate with human judgments.

Knowledge Base evaluations
We show how our system can be used to search for mechanism relations across a KB of 2M functional relations, and evaluate its utility in two search applications: Studying the viral mechanisms of SARS-CoV-2, and discovering medical applications of AI in the literature.Targeted biomedical search.This task involves searching for SARS-CoV-2 mechanism relations focused on a specific well-known statement or question regarding the virus (e.g., SARS-CoV-2 binds ACE2 receptor to gain entry into cells).Table 4: Human evaluation stats for our predictions vs. baseline SRL.We note that human evaluation scores are considerably higher than captured by our automated metrics wrt ground truth annotations, yet still correlated with them -indicating their usefulness in this challenging setting.
In this scenario, we issue queries which specify both the subject and the object of a mechanism relation (e.g., for a given relation (s 1 = SARS-CoV-2, s 2 = binds ACE2 receptor), retrieve relations where s * 1 is relevant/similar to "SARS-CoV-2" and s * 2 relevant to "binds ACE2 receptor").This task is designed to test our framework's ability to support researchers looking to quickly generate a list of relations pertaining to a specific hypothesis.Open-ended cross-domain search.This task is focused on discovering diverse ways in which AI research areas or methods are applied in the CORD-19 corpus.Unlike the previous scenario, here evaluators are given queries where only the subject of the relation is specified, s 1 -with queries consisting of popular, leading subfields and methods within AI (e.g., deep reinforcement learning or text analysis).The aim of this task is to evaluate whether we can support exploratory search over relations, potentially surfacing inspirations for new applications of AI against COVID-19, or helping biomedical researchers and practitioners discover where AI methods are being used.Task Setup and Evaluation In both tasks, our goal is to see if we can retrieve relevant relations that expert annotators consider useful and correct.To evaluate these tasks, we re-   Table 5 shows examples from the two tasks.In the search focusing on viral mechanisms, 10 claims written by a medical student regarding COVID-19 were taken from a collection of statements prepared in recent automated scientific claim-verification work (Wadden et al. 2020).For example, in Table 5, one such statement regards an association (indirect mechanism) between cardiac injury and COVID-19.We formulate a query for indirect mechanism relations, shown in the second column of the table.In the second task focusing on exploring AI applications, we select a representative list of top methods and areas within AI.Task descriptions, queries and instructions are available in the supplement.
For both tasks, given a query we retrieve the top 1000 most similar relations from our KB, requiring the cosine distance between the embeddings of each of s1 (subject) and s2 (object) and the query to be at least 0.75.For queries consisting of both s 1 and s 2 terms, we compute the average distances to their respective query terms, and then select the top and bottom 10 relations (20 per query, 200 per task, and 400 relations in total), shuffle their order, and present to anno-tators together with the original sentence from which each relation was extracted.We ask evaluators to rate whether the retrieved relations are relevant, as judged by the system's ability to identify (s1, s2) relations such that (1) they are relevant to the query, and (2) the sentence in which (s1, s2) are mentioned expresses a mechanism relation between the two terms, rather than incidentally mentioning them together.This is a challenging task, evaluating both retrieval of relations that are semantically similar to the query, and also accurate extraction of relations from sentences.In total, we collect 1700 relevance labels across both tasks.
Results Figure 4 (left) shows our results for both tasks.We rank results by their similarity to the query as described, and measure precision and recall.We measure average pairwise annotator agreement with several metrics: accuracy (proportion of matching labels), F1 (taking into account precision and recall symmetrically), balanced accuracy (downweighting the positive ratings to counter their higher proportion), and the Matthew Correlation Coefficient (MCC) score.
In the viral mechanisms task, we achieve high precision of 90% that remains stable for recall values as high as 70%.This reflects our experiment in which annotators viewed relations constrained to be highly or moderately similar to the query, and our ability to retrieve relevant relations.Agreement is relatively high by all metrics.In the AI applications task, our model achieves a precision of 85% at a recall of 40%, but drops more quickly.This is likely due to the more exploratory nature of the task, and the use of concepts from biomedicine and computer science with jargon subtleties not all annotators could precisely understand (e.g., network models vs. neural networks).Despite this challenge, overall agreement was high or moderate in the AI task too.

Conclusion
In this paper we extract a knowledge base (KB) of mechanism and effect relations from papers relating to COVID-19.Our KB can help scientists search and explore relations spanning viral mechanisms of action, diagnostic algorithms, disease symptoms and many more.We release a dataset an-

Figure 1 :
Figure 1: Our knowledge base of mechanism relations spans a wide range of activities, functions, and influences extracted from CORD-19, a corpus of papers related to COVID-19.
Figure 3: (Left) Precision@K of our model compared the pre-trained SciERC baseline.P@K for our model is high in absolute numbers.(Right) F1 as function of the Rouge-L span matching threshold.Our default threshold is 0.5.

Figure 4 :
Figure 4: (Left) Precision vs. recall for the search tasks (viral mechanisms, AI methods).Retrieved relations are ranked by embedding-based similarity to a query and compared to gold labels for evaluation.(Right) Average pairwise annotator agreement by several metrics.In the AI task human labels were more diverse but with overall high precision / recall.

Table 2 :
F1 scores of partial and exact matching metrics.Relations from SRL and OpenIE do not map directly to DIRECT MECHANISM and INDIRECT MECHANISM classes, and do not have relation classification scores.We also explore mapping SRL predicates to these two classes.

Table 5 :
Queries and example results retrieved (sentences cut to fit).Subject/object (s 1 /s 2 ) of extracted relations appear in bold.