BENNERD: A Neural Named Entity Linking System for COVID-19

We present a biomedical entity linking (EL) system BENNERD that detects named enti- ties in text and links them to the unified medical language system (UMLS) knowledge base (KB) entries to facilitate the corona virus disease 2019 (COVID-19) research. BEN- NERD mainly covers biomedical domain, es- pecially new entity types (e.g., coronavirus, vi- ral proteins, immune responses) by address- ing CORD-NER dataset. It includes several NLP tools to process biomedical texts includ- ing tokenization, flat and nested entity recog- nition, and candidate generation and rank- ing for EL that have been pre-trained using the CORD-NER corpus. To the best of our knowledge, this is the first attempt that ad- dresses NER and EL on COVID-19-related entities, such as COVID-19 virus, potential vaccines, and spreading mechanism, that may benefit research on COVID-19. We release an online system to enable real-time entity annotation with linking for end users. We also release the manually annotated test set and CORD-NERD dataset for leveraging EL task. The BENNERD system is available at https://aistairc.github.io/BENNERD/.


Introduction
In response to the coronavirus disease 2019  for global research community to apply recent advances in natural language processing (NLP),  1 is an emerging research challenge with a resource of over 181,000 scholarly articles that are related to the infectious disease COVID-19 caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). To facilitate COVID-19 studies, since NER is considered a fundamental step in text mining system, Xuan et al. (2020b) created CORD-NER dataset with comprehensive NE annotations. The annotations are based on distant or weak supervision. The dataset includes 29,500 documents from the CORD-19 corpus. The CORD-NER dataset gives a shed on NER, but they do not address linking task which is important to address COVID-19 research. For example, in the example sentence in Figure 1, the mention SARS-CoV-2 needs to be disambiguated. Since the term SARS-CoV-2 in this sentence refers to a virus, it should be linked to an entry of a virus in the knowledge base, not to an entry of 'SARS-CoV-2 vaccination', which corresponds to therapeutic or preventive procedure to prevent a disease.
We present a BERT-based Exhaustive Neural Named Entity Recognition and Disambiguation (BENNERD) system. The system is composed of four models: NER model (Sohrab and Miwa, 2018) that enumerates all possible spans as potential entity mentions and classifies them into entity types, masked language model BERT (Devlin et al., 2019), candidate generation model to find a list of candidate entities in the unified medical language system (UMLS) knowledge base (KB) for entity linking (EL) and candidate ranking model to disambiguate the entity for concept indexing.
The BENNERD system provides a web interface to facilitate the process of text annotation and its disambiguation without any training for end users. In addition, we introduce CORD-NERD (COVID-19 Open Research Dataset for Named Entity Recognition and Disambiguation) dataset an extended version of CORD-NER as for leveraging EL task.

System Description
The main objective of this work is to address recent pandemic of COVID-19 research. To facilitate COVID-19 studies, we introduce the BEN-  NERD system that finds nested named entities and links them to a UMLS knowledge base (KB). BEN-NERD mainly comprises two platforms: a web interface and a back-end server. The overall workflow of the BENNERD system is illustrated in Figure 1.

BENNERD Web Interface
The user interface of our BENNERD system is a web application with input panel, load a sample tab, annotation tab, gear box tab, and .TXT and .ANN tabs. Figure 2 shows an users' input interface of BENNERD. For a given text from users or loading a sample text from a sample list, the annotation tab will show the annotations with the text based on best NER-and EL-based training model. Figure 3 shows an example of text annotation based on the BENNERD's NER model. Different colors represent different entity types and, when the cursor floats over a coloured box representing an entity above text, the corresponding concept unique identifier (CUI) on the UMLS is shown. Figure 3 also shows an example where entity mention SARS-CoV-2 links to its corresponding CUI. Users can save the machine readable text in txt format and annotation files in the ann format where the ann annotation file provides standoff annotation output in brat (Stenetorp et al., 2012) 2 format.

Data Flow of Web Interface
We provide a quick inside look of our BENNERD web interface (BWI). The data flow of BWI is presented as follows: Server-side initialization (a) The BWI configuration, concept embeddings, and NER and EL models are loaded (b) GENIA sentence splitter and BERT basic tokenizer instances are initialized (c) When a text is submitted (a) The text is split into sentences and tokens (b) Token and sentence standoffs are identified (c) NER model is run on tokenized sentences (d) EL model is run on the result (e) The identified token spans are translated into text standoffs (f) The identified concepts' names are looked up in the UMLS database (g) A brat document is created (h) The brat document is translated into JSON, and sent to the client side (i) The brat visualizer renders the document

BENNERD Back-end
The BENNERD back-end implements a pipeline of tools (e.g., NER, EL), following the data flow described in Section 2.1.1. This section provides implementation details of our back-end modules for NER and EL.

Neural Named Entity Recognition
We build the mention detection, a.k.a NER, based on the BERT model (Devlin et al., 2019). The layer receives subword sequences and assigns contextual representations to the subwords via BERT. We denote a sentence by S = (x 1 , ..., x n ), where x i is the i-th word, and x i consists of s i subwords. This layer assigns a vector v i,j to the j-th subword of the i-th word. Then, we generate the vector embedding v i for each word x i by computing the unweighted average of its subword embeddings v i,j . We generate mention candidates based on the same idea as the span-based model (Lee et al., 2017;Sohrab and Miwa, 2018;Sohrab et al., 2019a,b), in which all continuous word sequences are generated given a maximal size L x (span width). The representation x b,e ∈ R dx for the span from the b-th word to the e-th word in a sentence is calculated from the embeddings of the first word, the last word, and the weighted average of all words in the span as follows: where α b,e,i denotes the attention value of the i-th word in a span from the b-th word to the e-th word, and [; ; ] denotes concatenation.

Entity Linking
In our EL component, for every mention span x b,e of a concept in a document, we are supposed to identify its ID in the target KB. 3 Let us call the ID a concept unique identifier (CUI). The input is all predicted mention spans M = {m 1 , m 2 , . . . , m n }, where m i denotes the i-th mention and n denotes the total number of predicted mentions. The list of entity mentions {m i } i=1,...,n needs to be mapped to a list of corresponding CUIs {c i } i=1,...,n . We decompose EL into two subtasks: candidate generation and candidate ranking.
Candidate Generation To find a list of candidate entities in KB to link with a given mention, we build a candidate generation layer adapting a dual-encoders model (Gillick et al., 2019). Instead of normalizing entity definition to disambiguate entities, we simply normalize the semantic types in both mention and entity sides from UMLS. The representation of a mention m in a document by the semantic type t m , can be denoted as: where t m ∈ R dt m is the mention type embedding. For the entity (concept) side with semantic type information, the representation a e , and its entity type embedding t e ∈ R dt e can be computed as: We use cosine similarity to compute the similarity score between a mention m and an entity e and feed it into a linear layer (LL) to transform the score into an unbounded logit as: score (m, e) = LL(sim (m, e)).
We employ the in-batch random negatives technique as described in the previous work (Gillick et al., 2019). For evaluating the performance of the model during training, we use the in-batch recall@1 metric (Gillick et al., 2019) on the development set to track and save the best model. We calculate the embedding of each detected mention from the mention detection layer and each of all entities in KB and then using an approximate nearest neighbor search algorithm in Faiss (Johnson et al., 2019) to retrieve the top k entities as candidates for the ranking layer.
Candidate Ranking The cosine similarity score in the candidate generation is insufficient to disambiguate the entities in which the correct entity should be assigned the highest score which is comparable from the k candidate entities. We employed a fully-connected neural network model to aim at ranking the entity candidate list to select the best entity linked to the mention. Given a mention m and a set of candidate entities {e 1 , e 2 , ..., e k }, we concatenate the embedding of m in Equation (2) with the embedding of each entity e i in Equation (3) to form a vector v m,e i . Then the vector v m,e i is fed into a LL to compute the ranking score: score(m, e i ) = LL(v m,e i ).
The model is then trained using a softmax loss to maximize the score of the correct entity compared with other incorrect entities retrieved from the trained candidate generation model.

Experimental Settings
In this section, we evaluate our toolkit on CORD-NER and CORD-NERD datasets.

CORD-NER Dataset
We carry out our experiments on CORD-NER, a distant or weak supervision-based large-scale dataset that includes 29,500 documents, 2,533,485 sentences, and 10,388,642 mentions. In our experiment, CORD-NER covers 63 fine-grained entity types 4 . CORD-NER mainly supports four sources including 18 biomedical entity types 5 , 18 general entity types 6 , knowledge base entity types, and nine 7 seed-guided new entity types. We split the CORD-NER dataset into three subsets: train, development, and test, which respectively contain 20,000, 4,500, and 5,000 documents.

CORD-NERD Dataset
CORD-NER dataset comprises only NER task. To solve the EL task, we expand this dataset by leveraging a CUI for each mention in the CORD-NER dataset, we call this CORD-NERD dataset. We use the most recent UMLS version 2020AA release that includes coronavirus-related concepts. To create CORD-NERD dataset, we use a dictionary matching approach based on exact match using UMLS KB. CORD-NERD includes 10,470,248 mentions, among which 6,794,126 and 3,676,122 mentions are respectively present and absent in the UMLS. Therefore, the entity coverage ratio of CORD-NERD over the UMLS is 64.89%. We annotate the entity mentions that are not found in the UMLS with CUI LESS. To evaluate the EL performances on CORD-NERD, 302,166 mentions are assigned for 5,000 test set, we call this UMLS-based test set. The train and development sets of CORD-NERD dataset, we simply calls UMLS-based trainand UMLS-based dev-set respectively. Besides, we assigned a biologist to annotate 1,000 random sentences based on chemical, disease, and gene types to create a manually annotated test set. This test set includes 311 disease mentions for the NER task and 946 mentions 8 with their corresponding CUIs for the EL task.

Data Prepossessing
Each text and the corresponding annotation file are processed by BERT's basic tokenizer. After tokenization, each text and its corresponding annotation file was directly passed to the deep neural approach for mention detection and classification.     Table 1 shows the performance of SciSpacy on CORD-NER dataset. In this table, the results are based on randomly picked 1,000 manually annotated sentences as the test set. Table 2 shows the performance comparison of our BENNERD with different pre-trained BERT models based on our test set. Since the manually annotated CORD-NER test set is not publicly available, we cannot directly compare our system performance. Instead, in Table 3, we show the performance of gene, chemical, and disease based on our UMLS-based test set. Besides, in Table 4, we also show the NER performances comparison of BENNERD with BC5CDR corpus-based SciSpacy model on the manually annotated disease entities.

Candidate Ranking Performance
As we are the first to perform EL task on CORD-19 dataset, we present different scenarios to evaluate our candidate ranking performance. The results of EL are depicted in Table 5. In this table, we evaluate our candidate ranking performances based on two experiment settings. In setting1, we train the CUIs based on manually annotated MedMention (Murty et al., 2018) dataset. In setting2, the BENNERD model is trained on automatically annotated CORD-NERD dataset. Table 5 also shows that our BENNERD model with setting2 is outperformed in compare to setting1 in every cases in terms of accuracy@ (1,10,20,30,40,50). Table 6 shows the EL performance on the manually annotated test set. In this table, it also shows that our system with setting2 is outperformed in compare to setting1. Besides, we also evaluate the manually annotated test set simply with string matching approach where the results of the top 10, 20, 30, 40 or 50 predictions for a gold candidate are unchanged.

Performances on COVID-19 Entity Types
Finally, in Table 7, we show the performance of nine new entity types discussed in Section 3.1 related to COVID-19 studies, which may benefit research on COVID-19 virus, spreading mechanism, and potential vaccines.

Related Work
To facilitate the biomedical text mining research on COVID-19, recently a few works have reported to address text mining tasks. Xuan et al. (2020b) created CORD-NER dataset with distant or weak supervision and reported first NER performances on different NER models. Motivated by this work, we presented a first web-based toolkit that addresses both NER and EL. In addition, we also extend the CORD-NER dataset to solve EL task. Xuan et al. (2020a) created EvidenceMiner system that retrieves sentence-level textual evidence from CORD-NER dataset. Tonia et al. (2020) developed an NLP pipeline to extract drug and vaccine information about SARS-CoV-2 and other viruses to help biomedical experts to easily track the latest scientific publications. To the best of our knowledge, this work is our first effort to solve both NER and EL models in a pipeline manner.

Conclusion
We presented the BENNERD system for entity linking, hoping that we can bring insights for the COVID-19 studies on making scientific discoveries. To the best of our knowledge, BENNERD represents the first web-based workflow of NER and EL for NLP research that addresses CORD-19 dataset that leads to create CORD-NERD dataset to facilitate COVID-19 work. The online system is available for meeting real-time extraction for end users. The BENNERD system is continually evolving; we will continue to improve the system as well as to implement new functions such as relation extraction to further facilitate COVID-19 research. We refer to visit https://aistairc.github.io/BENNERD/ to know more about BENNERD and CORD-NERD.