A Comprehensive Evaluation of Biomedical Entity-centric Search

Biomedical information retrieval has often been studied as a task of detecting whether a system correctly detects entity spans and links these entities to concepts from a given terminology. Most academic research has focused on evaluation of named entity recognition (NER) and entity linking (EL) models which are key components to recognizing diseases and genes in PubMed abstracts. In this work, we perform a fine-grained evaluation intended to understand the efficiency of state-of-the-art BERT-based information extraction (IE) architecture as a biomedical search engine. We present a novel manually annotated dataset of abstracts for disease and gene search. The dataset contains 23K query-abstract pairs, where 152 queries are selected from logs of our target discovery platform and PubMed abstracts annotated with relevance judgments. Specifically, the query list also includes a subset of concepts with at least one ambiguous concept name. As a base-line, we use off-she-shelf Elasticsearch with BM25. Our experiments on NER, EL, and retrieval in a zero-shot setup show the neural IE architecture shows superior performance for both disease and gene concept queries.


Introduction
The amount of text data being produced is overwhelming, especially in biomedicine; PubMed1 covers over 33 million articles from biomedical and life sciences journals and other texts, with about 1.5 million added each year.Meanwhile, many of these articles are about specific entities (e.g.proteins, diseases, chemicals), i.e., entity-centric.In general, entities are central to many search queries; e.g., (Guo et al., 2009) demonstrated that 71% of search queries contained named entities, while (Xiong et al., 2017) found that more than half of the traffic in the Allen Institute's scholar search engine is about research concepts.The use of automatic natural language processing (NLP) methods is imperative for information retrieval (IR) or information extraction (IE) from a large volume of biomedical texts.Several efforts have been made in the past years on entity extraction from scientific publications (Kim et al., 2013;Lee et al., 2016;Allot et al., 2018;Mohan et al., 2018Mohan et al., , 2021;;Wang and Lo, 2021).For example, Biomedical Entity Search Tool (BEST) uses a dictionary-based indexing strategy to extract ten types of biomedical entities including genes, diseases, drugs, and chemical compounds (Lee et al., 2016), while (Kim et al., 2013;Mohan et al., 2021) adopt machine learning for disease and gene extraction and linking.However, recent works on Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) showed that the generalization ability of BERT-based named entity recognition (NER) and entity linking (EL) models is influenced by domain shift or whether the test en-tity/term has been seen in the training set (Miftahutdinov et al., 2020;Tutubalina et al., 2020;Kim and Kang, 2022).Recently, (Soni and Roberts, 2021) compared two commercial search engines with academic prototypes evaluated in the TREC-COVID challenge (Roberts et al., 2020;Voorhees et al., 2021).Their evaluation showed that commercial search engines from Amazon  and Google (COVID-19 Research Explorer) fail to outperform decades-old IR approaches.In particular, the best run (from sabir) was achieved by a SMART system (Buckley, 1985) and used no machine learning or biomedical knowledge.A similar observation has been made for general-domain information retrieval (Thakur et al., 2021), where more efficient approaches e.g. based on dense or sparse embeddings can substantially underperform traditional lexical models like BM25 (Robertson and Zaragoza, 2009).
In this paper, we describe the design and evaluation of a BERT-based IE system as an entitycentric search engine for a target discovery platform PandaOmics 2 .In particular, we seek to answer the following research question: considering near excellent performance on NER and EL (Miftahutdinov et al., 2021;Lee et al., 2019), are there models capable of finding relevant publications for disease and gene queries from diverse biomedical subdomains as real-world applications?To help answer this question, we develop a novel search collection of PubMed abstracts for disease and gene queries with corresponding relevance judgments.We evaluate the IE pipeline with two trained BERTbased models for NER and EL and standard document retrieval model BM25 with off-the-shelf Elasticsearch software.We perform error analysis on the models' predictions to shed light on future work directions.

Dataset
This section describes our dataset, including queries, and the process of collecting relevance assessments.Table 1 shows statistics of our dataset.

Queries
In our target discovery platform PandaOmics, a user can enter a gene name or gene symbol like 'PSEN1' (ENSG00000080815) and retrieve all relevant publications and the associated diseases in-2 https://pandaomics.com/cluding Alzheimer's disease (EFO:0000249).An autocomplete feature displays suggestions from disease or gene dictionaries as user search terms.Conversely, the user can enter the disease name 'Alzheimer's disease' to retrieve publications for this concept and the associated targets.These associations are relying on Omics datasets and on a collection of AI-based scores that are based on molecular data and previously published text-based data (see (Ozerov et al., 2016) for more details).As a disease terminology source, we use an internal knowledge base that contains 15,051 concept unique identifiers (CUIs) based on an experimental factor ontology (EFO)3 (Malone et al., 2010).As a gene terminology source, we use an an internal knowledge base with 28,227 CUIs from Ensembl (Hubbard et al., 2002).We recall that each concept consists of atoms (concept names); all of the atoms within a concept are synonymous (NLM, 2016).As test queries for our dataset, we use the most frequent queries from the platform's logs.These queries are disease CUIs and gene CUIs.In addition, our annotators selected a list of concepts with at least one ambiguous concept name (see Table 2 for examples).

Pooling
Following standard practice of IR collection building, we employ a pooling approach (Lipani et  2016 ;Lipani, 2016;Hasibi et al., 2017;Thakur et al., 2021), and combine retrieval results from two main sources: 1. we obtained retrieval results from Elasticsearch; see Sect.3.2 for the description of this system.Results are pooled from these runs up to depth 100.
2. we obtained retrieval results from PubMed.
Results are pooled from these runs up to depth 100, excluding abstracts from the first system.
The final assessment pool contains 23,099 queryabstract pairs (152 abstracts per query on average).

Collecting Relevance Judgments
For each query-abstract pair, we collected the relevance judgments by 2 annotators with biomedical degrees using an in-house annotation tool (Fig. 2).An expert annotator with Ph.D. in biology created a list of queries from logs of our target discovery platform PandaOmics.All annotators are paid biologists in the company.An expert annotator wrote annotation guidelines and educated annotators.Each annotator selected a disease or gene query from the list of selected identifiers, an abstract with information about the publication year and journal.Abstracts were presented in random order.Annotators were then asked to: (i) judge relevance on a 3-point scale: "relevant", "nonrelevant", or "doubtful", and (ii) categorize the reason for relevance/nonrelevance.
We note that annotators were asked to consider EFO hierarchy during relevance annotation for disease queries.According to the annotation guidelines, only the synonyms belonging to the required level of the hierarchy are relevant.Those terms that are higher in the hierarchy are "wider terms", and those that are lower represent a "narrower case".E.g., while annotating a text for the "prostate adenocarcinoma" query, "prostate cancers" is wider than the term of interest; for the "prostate cancer" query, the "prostate adenocarcinoma" is narrower than the term of interest.Further, we provide a summary of guidelines illustrated with examples.
Relevance The publication relevance to a gene/disease can be determined as true when the gene/disease of interest (its main name or any synonym) is present in the same meaning in an abstract.The term in the abstracts should belong to a disease/gene ontology (and not to any other category, e.g.name of a clinical trial, institution, foundation, etc).In particular, there are six reasons for relevance: 1. synonym in text -one of the synonyms is precisely present in an abstract; 2. new synonym -new synonym for the term of interest, which is absent in our synonyms list, was found; 3. term by fragments -an entity is annotated by several fragments of text if: (i) a term is either from the disease of gene ontology; (ii) both fragments are in the same sentence; (iii) the parts of the term are logically connected (according to the author's logic).E.g., the text "...secondary diabetic complications, such as retinopathy, neuropathy, and nephropathy" (pmid 33109031) should be annotated as TRUE for "diabetic retinopathy"; 4. enumeration -an entity is annotated by fragments which are separated only with punctuation marks or conjunctions.E.g., the text "asthma-wheezing" (pmid 33276583) should be annotated as a true for both "asthma" and "wheezing", while "AKT1-mTORC1 Axis" (pmid 32404972) should be annotated as TRUE for "AKT1"; 5. suffix/prefix -an entity was annotated as a part of a word with a suffix/prefix.E.g., we annotate "obesity-induced NAFLD" as a match for "obesity" and add "-induced" as a suffix (we note that there is no "obesity-induced NAFLD" term in the ontology); 6. complicated case -a term is encountered in abstract by fragments separated in different sentences, and there is a logical link between them.
Detailed distribution of relevance reasons are given in Fig. 3.
Nonrelevance Nonrelevance of a gene/disease is determined as either no link between the gene/disease and a publication abstract or a wrongly identified relation.The first means the gene/disease is not mentioned in an abstract.The second means that gene/disease is incorrectly linked to an abstract because of one of the following six reasons: 1. no results -no results for the term of interest were found in a publication; 2. refers to another -gene/disease name (or its abbreviation) is a synonym of some other term, or has some other meanings, which are outside of the ontology (e.g., abbreviation COAD for colon adenocarcinoma refers to another term "anaerobic co-digestion (co-AD)", an abbreviation for Non-alcoholic steatohepatitis refers to another term "Nash equilibria"); 3. gene/disease name (or its abbreviation) refers to another term within the ontology (gives collisions) because of: (i) same synonyms (e.g., abbreviation COAD for Chronic obstructive pulmonary disease refers to another disease "colon adenocarcinoma"); (ii) refers to a wider term -publication abstract was found by a wider disease term, which refers not only to a disease of interest, and may give additional non-relevant results (e.g, colon cancer is wider term for colon adenocarcinoma); (iii) narrower case -publication abstract was found by more specific term (e.g., Alzheimer's disease is a narrower case for neurodegenerative disease); (iv) preprocessing issue -either ignored punctuation mark ("background: retinopathy", "ER-breast cancer") or is a part of a longer term ("Non-small cell lung carcinoma", "Traf2-and Nck-interacting kinase").
Detailed distribution of nonrelevance reasons is given in Fig. 4. We note that our definition of nonrelevance differs from Pubmed search primarily because of the consideration of the concept hierarchy.PubMed search uses Best Match (Fiorini et al., 2018) trained on the user-click information from PubMed search logs.We believe that distinguishing more narrow concepts from broader ones is crucial for target discovery objectives.Doubtful This category includes publications that mention disease/gene of interest only in keywords/MeSH terms without an abstract match.PubMed articles are manually associated with author keywords and MeSH (Medical Subject Headings) (Lipscomb, 2000) as standardized keywords.The reasons for this label are the same as for the relevance label with synonym in MeSH/keywords and excluding the "complicated case" category.In 97.65% and 1.6% cases, the annotator associated texts with the synonym in MeSH/keywords and new synonym reasons, respectively.
In 91% and 80% of pairs, two annotators agreed on a relevance label and decision reasons, respectively.When annotators disagreed, the expert annotator was asked to decide whether the relevance labels among with reasons selected by one of the annotators were in fact correct.After this procedure, we obtained the dataset for entity search with 73 disease queries, 79 gene queries, and 23,099 annotated query-entity pairs.

Models
The goal of our work is to evaluate retrieval models in a zero-shot setup, with no available training data to train the IR system.

BERT-based IE pipeline
In our work, we have focused on the extraction of two entity types: disease and gene.Though, we design our IE system with the simplicity of scaling to new entities in mind.The system consists of pipelines, each for a different entity type.
The pipelines incorporate two sub-modules: (i) NER sub-module; (ii) EL sub-module.These submodules are applied successively.The first one extracts entities of interest the second one links extracted entities with concepts from given knowledge bases.Taken all together it means that the processing of different types of entities is independent and could be trained and applied separately.As a pretrained transformer model, we use BioBERT base v1.1.(Lee et al., 2019).
Named Entity Recognition In this paper, for reproducibility reasons, we decided to analyze models trained on publicly available academic datasets.Specifically, we train BioBERT on combination of NCBI and CDR Diseases datasets (Dogan et al., 2014;Li et al., 2016) for disease entities and on DrugProt dataset (Miranda et al., 2021) for gene entities.To join the NCBI and CDR Disease datasets, we utilized predefined train/test subsets and combined the datasets within these splits.Thus, the train part of NCBI was combined with the CDR Disease train sets.A similar procedure was carried out to obtain the test part of the combined dataset.We adopted model training hyper-parameters from (Lee et al., 2019).Our model achieves 88.43% and 90.39% of the F-measure on official test sets of disease and gene entities, respectively.
Entity Linking For linking extracted entities to corresponding concepts from dictionaries, we employ state-of-the-art Drug and disease Interpretation Learning with Biomedical Entity Representation Transformer (DILBERT) (Miftahutdinov et al., 2021).This model is based on metric learning and negative sampling, specifically, triplet constraints.Given an entity mention m, a positive concept name c g and a negative concept name c n , triplet loss tunes the network such that the distance between m and c g is smaller than the distance between m and c n .Details on overall architecture, configuration, hyperparameter search, and evaluation strategies are presented in (Miftahutdinov et al., 2021).The code is publicly available at https://github.com/insilicomedicine/DILBERT.We note that the advantage of DILBERT architecture is the ability to search for the closest concept in a different terminology without retraining the model (crossterminology use).
Similar to NER, we train models on publicly available academic datasets: CDR Diseases (Li et al., 2016) andBC2GN Genes (Morgan et al., 2008).The models are evaluated on refined test sets without entity overlap between train/test sets from (Tutubalina et al., 2020).These sets are publicly available at https://github.com/insilicomedicine/Fair-Evaluation-BERT.Our model achieves 75.8% and 82.4% of accuracy on the refined test sets of diseases and genes, respectively.
Details on models' configurations, speed performance and system deployment are presented in Appendices A and B.

Elasticsearch BM25
We utilized a popular search engine framework Amazon Elasticsearch/OpenSearch Service4 that uses OpenSearch v.1.05 .OpenSearch is a fork of open source Elasticsearch 7.106 .OpenSearch uses BM25 (Robertson and Zaragoza, 2009) to calculate relevance scores.BM25 is a commonly-used bag-of-words retrieval function based on tokenmatching between two high-dimensional sparse vectors with TF-IDF token weights.We note that (Thakur et al., 2021) recently showed that many approaches with sparse, dense late-interaction architectures outperform BM25 on in-domain evaluation, yet perform poorly on zero-shot setup.

Evaluation
For evaluation, we use precision, recall, and Fmeasure.We calculate the precision as a fraction of relevant documents among all retrieved documents.As well the recall is calculated as a fraction of relevant documents from all possibly relevant documents in the dataset.For experiments, we use query-document pairs with relevant and nonrelevant labels excluding the doubtful category.
Tables 3 and 4 present the performance of the BERT-based pipeline compared to BM25 on the full set of queries and the subset of concept with ambiguous names, respectively.Several observations can be made based on Tables 3 and 4. First, the BERT-based system outperformed BM25 on both sets of the dataset and both types of entities.As expected, the performance difference between the two models is larger on the subset with ambiguous concept names.Third, for the BERT-based pipeline, precision is higher than recall.In addition, we investigate search precision further by developing a dataset for out-of-domain abstract detection.Approximately 30,000 records are included in the PubMed journal list.These journals publish papers not only about biological entities, but also on cultural topics, economics and econometrics, artificial intelligence, law, linguistics and language, and so on (out-of-domain categories for us).Our expert annotator manually selected out-ofdomain journals on which we expect the IE system to return zero results.We randomly select 58,790 abstracts from these journals, where each abstract includes at least one gene of disease concept retrieved by Elasticsearch.In 90% of these abstracts, the BERT-based system did not find any entities.

Model
Error Analysis For error analysis of the BERTbased IE system, we reviewed a sample of 152 false positive (FP) documents and 168 false negative (FN) results.(there are ERK1/2)).For FNs, 60% of errors (100 abstracts) fell into the synonym in the text category.These documents were additionally analyzed to detect which model (NER or EL) predicted incorrectly (see Table 6).As shown in Table 6, in 23% cases, the NER model predicts a shorter entity which is also known as a boundary problem.E.g., in the text "external validation of the Nonalcoholic [Steatohepatitis] predicted Scoring System in patients" (pmid 33248101) Nonalcoholic Steatohepatitis was mapped to just Steatohepatitis due to NER predictions.Mapping errors are often related to the presence of numbers in gene names or abbreviations.E.g., in a text "orphan nuclear receptor [Nr4a1] mediates perinatal neuroinflammation" (pmid 32606386) entity Nr4a1 mapped to the Nr4a2 gene instead.For FPs, we additionally analyze 22% of errors (34 abstracts) from the refers to another category.The NER and EL models cause errors in 16 and 11 documents, respectively.

Conclusion and Future Work
In this work, we present a comprehensive evaluation of a biomedical entity-centric search engine based on BERT models for disease and gene extraction and linking.This engine is a part of a target discovery platform, where users can return a list of relevant publications given a disease or gene concept query.We evaluate BERT models on two information extraction tasks, entity-centric information retrieval, and out-of-domain abstract detection.Moreover, we present an error analysis for both retrieval and extraction tasks.
This work suggests several interesting directions for future research.We plan to conduct similar studies on other text sources such as full publication texts and patents.Moreover, we plan to expand the list of entity types with pathways and biological processes.To extract explicit associations between drug targets and diseases, we plan to add relation extraction/event detection models and study knowledge graph completion with novel disease-gene edges.

Ethics Statement
We outline potential ethical issues with our work below.First, our work focuses on a comprehensive evaluation of the information extraction pipeline for retrieval of relevant scientific texts given queries of disease and gene concepts.Consequently, the developed BERT-based models could reflect many domain-specific biases exhibited by language models.For example, (Sung et al., 2021) showed that predictions on factual triples tend to be highly biased towards a few objects (e.g., "headache", "pain", or "ESR1").Since pretrained language models are used for initialization, it is possible to reflect biased patterns in open-world applications.Second, our NLP engine is a part of the target discovery platform PandaOmics which intend to identify targets (genes/proteins) through deep feature selection, causality inference, and de novo pathway reconstruction (Ozerov et al., 2016).We use the NLP engine to assess the targets' novelty and disease association via the analysis of research publications.The imperfect completeness of the extracted information can be especially reflected in the small number of publications in the search results about rare diseases, making it difficult for subsequent analysis.Third, we use EFO and Ensembl as primary resources with disease hierarchy and concepts' synonyms.For example, (Miftahutdinov et al., 2021) demonstrated that degradation in the accuracy from the full disease dictionary to a 30% of the dictionary is significant for disease linking in clinical trials.Moreover, consistent description of these entities has numerous differing standards and opportune incorporation of new human disease terms and targets is still necessary.trained with the same optimizer and learning rate for 5 epochs and batch size equal to 32.At the inference and training time, we restrict the length of the sequence up to 128 sub-tokens for entity recognition and up to 28 sub-tokens for linking.
For NER sub-module we use Huggingface python library (https://huggingface.co), for EL we apply sentence-transformers library (https: //www.sbert.net).At the inference time, the EL model uses the FAISS library (Johnson et al., 2019) with GPU support for a fast nearest neighbor search by comparing vectors with Euclidean distance.Embeddings of all terminologies' concepts are indexed.
We note that deployed models are trained on in-house datasets with similar parameters and evaluation metrics that are not publicly available due to company policy.
We profiled retrieval speed on a server with Intel Xeon CPU E5-2660 2.00GHz and 256GB memory.First, we precomputed all embeddings for all concepts (500 thousand).On a single Nvidia TITAN X GPU, it takes about 7 minutes to compute all embeddings.Given that all embeddings are indexed on Nvidia TITAN X GPU using IndexFlatL2 index type 5 thousand documents processing takes 390 seconds, which is 0.08 seconds per document.Most of this time, specifically 359 seconds, is taken by the NER sub-module.

Figure 2 :
Figure 2: Task design in our in-house annotation tool with search by disease concept identifier.An annotator selects an abstract and choose one of three labels (relevant/true (green), nonrelevant/false (red), or doubtful (yellow).

Table 1 :
al., Summary of statistics of the proposed dataset.

Table 2 :
A sample of concepts with at least one ambiguous concept name.

Table 3 :
IR metrics on the full set of queries.

Table 4 :
IR metrics on the subset of queries with ambiguous concepts.

Table 5 :
Table 5 provides summary on error categories for FPs.As shown in Table5, the most frequent category of errors (58%) is related to the ontology hierarchy.Wider cases can also be attributed to a gene when the gene family is mentioned (e.g., Akt (there are Akt1/2/3), ERK Error analysis of IR results on the false positive sample (152 texts).

Table 6 :
Error analysis of NER and EL predictions on the false negative (FN) sample (100 texts).