CovRelex: A COVID-19 Retrieval System with Relation Extraction

This paper presents CovRelex, a scientific paper retrieval system targeting entities and relations via relation extraction on COVID-19 scientific papers. This work aims at building a system supporting users efficiently in acquiring knowledge across a huge number of COVID-19 scientific papers published rapidly. Our system can be accessed via https://www.jaist.ac.jp/is/labs/nguyen-lab/systems/covrelex/.


Introduction
This work aims at facilitating knowledge acquisition from a huge number of COVID-19 scientific papers. Due to the COVID-19 outbreak, researchers have been focusing on studying the virus and publishing a huge number of papers rapidly. According to the estimation of Silva et al. (2020), 23,634 unique documents were published in just 6 months between January 1 st and June 30 th , 2020.
In the records of the COVID-19 Open Research Dataset (CORD-19) Challenge 1 , the number of collected papers about COVID-19, SARS-Cov-2 and related coronaviruses is more than 400K by January 9 th , 2021. The rapid speed of new publication and the huge number of related papers challenges specialists to seek knowledge by connecting findings across papers efficiently and timely. 1 https://www.kaggle.com/ allen-institute-for-ai/  When focusing on knowledge acquisition of biomedical entities, several questions can be asked regarding the entities and their relations: • Which papers mention entity E 1 ?
• Which papers mention the relation R between entity E 1 and entity E 2 ?
• Which papers mention the relation R 1 between entity E 1 and entity E 2 , and the relation R 2 between entity E 2 and entity E 3 ?
• What relations R x exist between entity E 1 and entity E 2 and in which papers?
• What entity E x has relation R with entity E 1 and in which papers?
Such questions can be answered by our system.

Related Work
FACTA+ (Tsuruoka et al., 2011(Tsuruoka et al., , 2008) was presented as a text search engine that helps users discover and visualize indirect associations between biomedical concepts from MEDLINE abstracts. Liu et al. (2015) introduced an online text-mining system (  variants and other associated entities such as diseases and chemicals/drugs.  presented a web service PubTator Central (PTC) that provides automated bioconcept annotations in full text biomedical articles, in which bioconcepts are extracted from state-of-the-art text mining systems.
Due to the COVID-19 outbreak, it is essential to grasp valuable knowledge from a huge number of COVID-19-related papers for dealing with the pandemic effectively. Sohrab et al. (2020) introduced the BENNERD system that detects named entities in biomedical text and links them to the unified medical language system (UMLS) to facilitate the COVID-19 research. Hope et al. (2020) created a dataset annotated for mechanism relations and trained an information extraction model on this data. Then, they used the model to extract a Knowledge Base (KB) of mechanism and effect relations from papers relating to COVID-19. Zhang et al. (2020) built Covidex, a search infrastructure that provides information access to the COVID-19 Open Research Dataset such as answering questions. Esteva et al. (2020) also presented Co-Search, a retriever-ranker semantic search engine designed to handle complex queries over the COVID-19 literature. Wang et al. (2020) created the EvidenceMiner web-based system. Given a query as a natural language statement, EvidenceMiner automatically retrieves sentence-level textual evidence from the CORD-19 corpus.
Clearly, previous works made a great effort to acquire useful knowledge from the COVID-19 literature, such as recognizing biomedical entities (Sohrab et al., 2020), extracting mechanism relations between entities (Hope et al., 2020), or retrieving relevant text segments based on the user query (Zhang et al., 2020;Wang et al., 2020). However, there is still a lack of a system that has the ability to automatically detect both entities with various types and their diverse relations through papers, especially when COVID-19 papers are published rapidly. This motivates us to build the CovRelex system, which aims to exploit such information.

Overview
The core of our system is built from extracting an enormous number of relations from COVID-19 related scientific papers (in CORD-19 corpus) by several open domain relation extraction methods. The extracted relations are represented not only by their original form from the extraction methods but also by the contained biomedical entities. Furthermore, the relations are clustered and scored for their informativeness over the corpus (Fig. 1).
A relation is a triplet in the form (arg 1 , rel, arg 2 ), where arg 1 , and arg 2 are noun phrases which may contain biomedical entities, and rel is an expression describing the directed relation from arg 1 to arg 2 (shown in Fig. 2).

Relation Extraction
With the objective of extracting as many relations as possible, we employ several relation extraction methods. Each method has their own characteristics, thus, may extract different kinds of relations. By combining several methods, we can obtain higher extraction coverage. The methods are briefly described as follows.
• ReVerb (Fader et al., 2011) tackles the problems of incoherent and uninformative extractions by introducing constraints on binary, verb-based relation phrases.
• OLLIE (Mausam et al., 2012) addresses the problems that Open IE systems such as Re-Verb only extract relations that are mediated by verbs. Not only by verbs, OLIEE extracts relations mediated also by nouns, adjectives, and more.
• ClausIE (Del Corro and Gemulla, 2013) is a clause-based approach to open information extraction. It separates the detection of clauses and clause types from the actual generation of propositions.
• Relink (Tran and Nguyen, 2020) is a method partly inherited from ReVerb, extracts relations from the connected phrases, not for identifying clause type like ClauseIE.
• OpenIE (Angeli et al., 2015) extracts relations by breaking a long sentence into short, coherent clauses, and then finds the maximally simple relations.
The extracted relations are also tagged with biomedical entities recognized by using entity recognition models presented in the next subsection.

Entity Recognition
We use biomedical entity recognition models specialized for predicting entity type and provided by SciSpacy (Neumann et al., 2019) (Table 1). Each of the models is trained on a different annotated corpus, thus, covers a different set of biomedical entities. By using multiple entity systems, we can obtain various specialized entity information: chemicals and diseases with BCD5CDR (Li et al., 2016), cell types, chemicals, proteins, and genes with CRAFT (Bada et al., 2012), cell lines, cell types, DNAs, RNAs, and proteins with JNLPBA (Collier and Kim, 2004), and cancer genetics with BioNLP13CG (Pyysalo et al., 2015).

Relation Clustering
We build a cluster hierarchy on a subset of the extracted relations (this subset contains all relations in which both arg 1 and arg 2 are biomedical entities), so users can quickly find their interested relation expressions or they can choose some clusters which may contain their interested relation expressions.
We utilize FINCH (Sarfraz et al., 2019), hierarchical clustering method, and BERT (Devlin et al., 2019) for this task. First, BERT-Base model is used to encode each relation as a simple sentence " arg 1 rel arg 2 " into a 768-dimensional vector. Then, FINCH is used to build the cluster hierarchy. For each cluster, representative expressions of the cluster are selected from its rels from top informative relations scored by the formula presented in the next subsection. The result cluster hierarchy is illustrated in Fig. 3.

Relation Scoring
Relations are scored for informativeness based from Pointwise Mutual Information (PMI) (Church and Hanks, 1990), the association ratio for measuring word association norms, based on the information-theoretic concept of mutual information. The informativeness of a relation (arg 1 , rel, arg 2 ) can be regarded as PMI (Eq. 1) of two points: arg-pair args = (arg 1 , arg 2 ) and its relation expression rel through occurrence p(.).
PMI(args, rel) = log 2 p(args, rel) p(args) p(rel) It is difficult to apply Eq. 1, which computes the occurrence by exact matching, for our system because of the variation and noise in the contents of the extracted relations. To mitigate the difficulty of using exact match, we propose to use cosine similarity with Tf-idf vectorization (Sparck Jones, 1988). While exact match counting of occurrence indicates the presence of an instance (args or rel) in the relation set, our use of cosine similarity indicates the presence of the contents of the instance in the relation set, thus can adapt to the variation and noise in the contents of the relations.
InfoScore(args, rel) = log 2 S(args, rel) S(args)S(rel) where (args , rel ) are all relations other than (args, rel), args are arg-pairs in all relations other than (args, rel), rel are expressions in all relations other than (args, rel), and v(t 1 , t 2 , ...t n ) is the vectorization function which concatenates the input texts t 1 , t 2 , ..., t n and converts the concatenated text into a single Tf-idf vector.

Retrieval System
The retrieval system provides two kinds of queries: Single-Relation Query and Graph Query. While Single-Relation Query provides simple way to  search for specific relations, Graph Query provides a sophisticated way to search for papers containing entities connected in a complex relation graph.

Single-Relation Query
A query consists of partial information of a relation which can contains keywords about arg 1 , arg 2 , and rel, types of entities possibly included in the arg 1 or arg 2 , or clusters which the relation belongs to. The retrieved results are relevant relations with their corresponding papers. An example of Single-Relation Query is illustrated in Fig. 4. The query relation is (mers-cov, any-relation, DISEASE). The results are best matched relations, for instance, (MERS-CoV, include, "fever, chills/rigors, headache, non-productive cough"). The candidate relations are retrieved based on the keyword matching score by BM25 (Schütze et al., 2008) and InfoScore (Eq. 2), then filtered by the entity types and the clusters. Keyword matching score and InfoScore can be weighed for the need of searching candidates that have high lexical matching with the query or candidates that are highly informative.

Graph Query
This extends Single-Relation Query by enabling more sophisticated paper search covering a complex graph describing relations among entities. An example of Graph Query is illustrated in Fig. 5 with a query consists of 4 relations: (merscov, cause, DISEASE), (CHEMICAL, any-relation, mers-cov), (CHEMICAl, any-relation, DISEASE), and (PROTEIN, any-relation, DISEASE). The result graph is built from linking entities and relations obtained from each paper, which matches the query graph. The entity linking is done through lexical matching and type matching. This approach faces the challenges from entities with synonyms and performance of entity recognition.
One special feature of Graph Query is Multi-Paper Graph Query which supports searching relations across multiple papers. The important use case is that interested relations are not described in one single paper, i.e., one entity is mentioned in different papers and thus engaged in different relations. For example, if users want to "find some   CHEMICAL that can treat some DISEASE caused by COVID-19", they will look for two relations: cause,DISEASE), and (CHEMICAL, treat, DISEASE). In that case, the two relations may be retrieved from two different papers. Therefore, aggregating information scattering over multiple papers is necessary for building a more comprehensive understanding. It is done through relation grouping allowing users to segment the query graph into several segments each belonging to different papers. With the above example, users can define a query graph (the left-hand side of Fig. 6) and our system could find that "pneunomia" is a DISEASE caused by COVID-19 and is treated with "Current [piperacillin-tazobactam] CHEMICAL regimens" (the right-hand side of Fig. 6) from two separate papers, and more.

Corpus
We performed relation extraction and entity recognition from the CORD19 corpus provided in the COVID-19 Open Research Dataset Challenge updated by January 3 rd , 2021. The corpus contains ≈400K entries to COVID-19 related papers. Relation extraction and entity recognition were performed on the abstracts of the papers.

Relation Extraction
As shown in Table 3, we extracted 40.5 million relations including 29.8 million unique relations. Among the relation extraction methods, OpenIE outputs the largest number. The other three relation extraction methods tend to output long and composite relations while OpenIE tends to break down and output shorter and simpler relations. However, OpenIE also outputs small variations of similar relations.
For assessing the quality of relation extraction, we conduct an evaluation on a small data sample consisting of 100 papers selected from the corpus. The evaluation was conducted by two human evaluators with the criteria to answer whether the relation can be entailed from the sentence.
The results (Table 2) show that the evaluation is a difficult task. The evaluation agreement between the two evaluators is 0.41 in term of Cohen's kappa coefficient (McHugh, 2012). It's considered fair agreement (Fleiss et al., 2003). Among the relation extraction methods, OLLIE yields the best kappa coefficient of 0.60 (good agreement), OpenIE yields the worst coefficient of 0.30 (poor agreement), and the others yield the coefficients of 0.47 to 0.58 (fair to good agreement). One of the possible reasons is the complexity of biomedical texts: sentences with 31 tokens in average and up to 167 tokens in the evaluated sample, and common use of conjunctions and nested clauses.

Entity Recognition
As shown in Table 4, a total of 6.4M entities were recognized from the corpus with the four entity recognition models. For each abstract of a COVID-19 related paper, an average of 22 entities were recognized. Among the four models, en ner jnlpba md outputs the largest number of entities, about 1.7 to 2.2 times more than the other models, where this model's specialized entity types are cell lines, cell types, DNAs, RNAs, and proteins.

Conclusion
We have presented our COVID-19 scientific paper retrieval system which focuses on analysing entities and their relations. The system is empowered with several relation extraction and entity recognition methods. The system supports users in acquiring knowledge efficiently across a huge number of COVID-19 scientific papers published rapidly. There, however, exist extremely challenging problems to tackle for making the system more practical: dealing with the newly created and unknown data, solving the performance gap when utilizing present methods, and do these in the nick of time of fighting with pandemics.