Biomedical Interpretable Entity Representations

Pre-trained language models induce dense entity representations that offer strong performance on entity-centric NLP tasks, but such representations are not immediately interpretable. This can be a barrier to model uptake in important domains such as biomedicine. There has been recent work on general interpretable representation learning (Onoe and Durrett, 2020), but these domain-agnostic representations do not readily transfer to the important domain of biomedicine. In this paper, we create a new entity type system and training set from a large corpus of biomedical texts by mapping entities to concepts in a medical ontology, and from these to Wikipedia pages whose categories are our types. From this mapping we derive Biomedical Interpretable Entity Representations(BIERs), in which dimensions correspond to fine-grained entity types, and values are predicted probabilities that a given entity is of the corresponding type. We propose a novel method that exploits BIER's final sparse and intermediate dense representations to facilitate model and entity type debugging. We show that BIERs achieve strong performance in biomedical tasks including named entity disambiguation and entity label classification, and we provide error analysis to highlight the utility of their interpretability, particularly in low-supervision settings. Finally, we provide our induced 68K biomedical type system, the corresponding 37 million triples of derived data used to train BIER models and our best performing model.


Introduction
In modern NLP systems, entities are embedded in the same dense vector space as words using vectors from pre-trained (masked) language models (Devlin et al., 2019) that yield contextualized embeddings of tokens. These representations are used as inputs for downstream models built for particular tasks. One issue with such learned representations is that we do not actually know what information they encode. Recent work has shown that deep pre-trained models implicitly learn factual knowledge about entities (Petroni et al., 2019;Roberts et al., 2020), but the embeddings that they provide do not explicitly maintain representations of this knowledge (i.e., the dimensions in learned representations have no a priori semantics); consequently, are not directly interpretable. This has motivated the design of knowledge probing tasks to measure a factual knowledge implicit in embeddings (Petroni et al., 2019;Poerner et al., 2019).
Recent work (Onoe and Durrett, 2020) has proposed learning interpretable entity representations using an entity typing model and corresponding fine-grained type system that accepts an entity mention and its context. The output represents a highdimensional sparse embedding whose values correspond to the model's (independently) predicted probabilities that the entity possesses the respective properties defined by the fine-grained type system. This past work proposed general domain pretrained Transformer-based (Vaswani et al., 2017) entity typing models trained on Wikipedia or the ultra-fine entity typing system (Choi et al., 2018), yielding 60k and 10k dimensional embeddings, respectively, which can then be used directly in downstream tasks. Such representations can achieve strong results without learning task specific representations. Thus, in addition to providing interpretability, such representations may be particularly useful for tasks with limited supervision.
Such interpretable entity representations for text can be valuable in domains such as biomedicine, because they afford model transparency which may help with model debugging, or simply to instill confidence in model outputs. For example, if one defines a linear layer on top of entity-type representations, learned coefficients are interpretable as weights assigned to specific entity types. One could debug an incorrect prediction by inspecting the induced representation for potentially erroneous types assigned to it. This sort of insight is particularly important in biomedical NLP, given the potential sensitivity of the tasks in the domain, and the high-level expertise of the 'end-users'.
Motivated by these observations, we extend (Onoe and Durrett, 2020) to learn sparse Biomedical Interpretable Entity Representations (BIERs) in which values encode predicted probabilities of an entity belonging to a type from a fine-grained entity type system. Starting from a corpus of PubMed 1 articles on cancer and drugs as our training data, we induce an entity type system by mapping entities in the articles to their associated UMLS concepts, and then mapping the concepts to Wikipedia pages whose categories we use as our types.
We show that learning a typing model on top of such a system realizes strong performance on a variety of biomedical tasks including named entity disambiguation (NED) and entity label classification using simple cosine similarity or Euclidean distance based methods, and we provide an analysis of the results from an interpretabilty perspective. In addition, we propose a simple technique that facilitates debugging and provides a mechanism by which to improve model performance by exploiting both the proposed sparse interpretable type representations and their internal underlying dense counterparts. Finally, we introduce and release a new medical-centric Wikipedia dataset based on (Rosenthal et al., 2019) for use in the task of biomedical NED.
Our specific contributions 2 are as follows: • We create (and will release) a biomedical entity typing system comprising Wikipedia Categories from pages mapped to UMLS concepts linked to PubMed article entities and learn a model that produces sparse entity representations in which dimensions are imbued with known semantics. We show that these achieve strong performance on biomedical NED and entity label classification tasks. 68k Type Embeddings Figure 1: Model architecture from (Onoe and Durrett, 2019) using our 68k biomedical entity type system. A BERT based encoder embeds a mention and context and the output entity embedding contains probabilities for each type.
uses the proposed representation's performance on downstream tasks to gain insights into the entity typing model and system.
• We release a medical literature centric Wikipedia dataset for use in the task of biomedical NED.

Background: Interpretable Entity Model
We first review the interpretable entity model architecture we extend from (Onoe and Durrett, 2020). Let s = (w 1 , ..., w N ) denote a sequence of input context words, m = (w i , ..., w j ) denote an entity mention span in s, and t ∈ [0, 1] |T | denote a vector whose values are predicted probabilities corresponding to fine-grained entity types T from a predefined type system with higher values identifying types most pertaining to m and s . Given a labeled dataset D = {(m, s, t * ) (1) , ..., (m, s, t * ) (k) } the objective is to learn parameters θ of a function f θ that maps the mention m and its context s to a vector t that captures salient features of the entity mention within its context. The basic idea is that the resultant entity embeddings t (wherein individual dimensions have explicit semantics) can be used as embeddings in downstream tasks, for example by using basic similarity measures such as dot products or cosine similarities. 3 The simple model f θ that produces these embeddings is shown in Figure 1 , where the mention m and context s are segmented into WordPiece tokens (Wu et al., 2016). The hidden vector output corresponding to the [CLS] token can be treated as the intermediate dense mention and context representation: A type embedding layer then projects this intermediate representation to a vector whose dimensions correspond to the entity types T using a single linear layer whose parameters may be viewed as a matrix of type embeddings E ∈ R |T |×d , where d is the dimension of the mention and context representation h [CLS] . Finally, we apply a sigmoid function to each unnormalized score in the vector to obtain the predicted probabilities that form our entity representation t (top of Figure 1). We obtain these output probabilities t by multiplying E by h [CLS] , followed by an element-wise sigmoid function: Following Choi et al. (2018), the training loss we minimize is a sum of binary cross-entropy losses over all entity types T over all training examples D. That is, we treat each type prediction for each example as an independent binary decision, with shared parameters in the BERT encoder. Our loss L is: where i are the data indices, j are indices over types, t ij is the jth component of t i , and t * ij is the jth component of t i * that takes the value 1 if the jth type applies to the current entity mention. We fine-tune all parameters in BERT and the type embedding matrix.
3 Fine-tuning the representations would destroy their interpretability.

Biomedical Interpretable Entity Representations
Biomedical Entity Typing To train an interpretable entity embedding model tailored specifically for biomedical tasks, we must first construct a suitable biomedical entity type system and dataset. PubMed indexes over 30 million biomedical citations across a wide range of topics. To curate a topically focused set of literature, we first used the PubTator tool (Wei et al., 2019) to query PubMed for articles related to drugs used as treatment for cancer; this yielded 461,404 unique citations (titles and abstracts). 4 We used an off the shelf NER tagger available in SciSpacy (Neumann et al., 2019) to identify entity spans within abstracts, and used the Entity Linker component to link those entities to concept unique IDs (CUIDs) within the Unified Medical Language System (UMLS) ontology 5 .
Next we had to decide on the specific entity type system to use, i.e., the set of labels to attach to entities, and chose Wikipedia as our knowledge base. We used this general knowledge base instead of a specialized ontology (for example, MeSH or SNOWMED CT) primarily because it yielded (many) more diverse entity types per mention, comparatively.
To connect UMLS concepts to Wikpedia pages we use the mapping from Cuzzola et al. (2018), which is accurate but incomplete: It provides exact wikipage matches for 221,690 concepts and "close matches" for 26,276 of them, out of a possible 3 million concepts in UMLS. For concepts for which no exact or close match was found, we used SLING (Ringgaard et al., 2017), a framework for frame semantic parsing which allows for querying and resolving wikipages given a search string (in our case, mention surface forms). For high confidence exact or close matches, we return the set of categories found for their combined results. While these results can be slightly noisy, they mostly lead to satisfactory performance. We filter the entity mentions that compose our final set, as follows. If multiple concept CUIDs are found for a given entity, we include the highest scoring matches within two points of each other provided they all exceed a minimum score threshold of 0.8; 6 Additionally, we only include results that are linked to at least one concept CUID and where an associated Wiki link was mapped to directly via Cuzzola et al. (2018) or via SLING. A schematic of this process is shown in Figure 2. An example working through the entity filtering process is shown in the text of Appendix A. In the end about 12.5% of the mappings from PubMed mentions to Wikipedia categories come via SLING.
After processing, linking and filtering the corpus of PubMed abstracts, we were able to extract 37,357,141 triples of the form (mention, context, [list of categories]). This list of triples contains 68,304 unique categories which we use as the entity type system for training BIERs. Appendix 8 contains a list of the top 100 entity types that appear over these articles and Appendix 5 shows a histogram of entity types per mention. As one contribution, we will release this set of derived triples.
To assess the quality of this dataset, we chose 500 triples at random and asked 4 experts (researchers in biomedicine and ML) to score them on a Likert scale from 1 (low) to 5 (high) for accuracy. Experts assessed how well a PubMed mention from a context sentence maps onto a Wikipedia URL. Average expert scores for the triples were [4.01, 4.13, 4.18, 4.20] (overall mean of 4.13) out of 5. The Fleiss-Kappa score which measures inter-annotator agreement was strong at .69. Additionally 77% of scores are >= 4, and for 93% of the examples at least 3/4 experts agree (73% have unanimous agreement). BIER entity typing model training and test results We split our derived dataset of biomedical triples into train, validation, and test sets of sizes 31,340,000, 376,071, and 5,641,070, respectively. For comparison, the total data size used by Onoe and Durrett (2020) is 6.1 million and based on the most popular categories of Wikipedia whereas ours only uses categories on pages linked to UMLS.
We trained different BIER models using variants of BERT as an encoder for mentions and contexts. Specifically we considered BioBERT (Lee et al., 2019), SciBERT  and BLURB (Gu et al., 2020) (we will refer to this as PubMedBERT), which constitute the current state of the art for many biomedical tasks.We compute entity typing macro F1 using development examples to check model convergence and use the hyperparameters from Onoe and Durrett (2020).
Debugging BIERs by combining dense and sparse embeddings We propose a technique for debugging using BIER representations that is in part inspired by prior work that used intermediate layer representations of training examples as additional features (Papernot and McDaniel, 2018). Specifically, we propose to debug BIER performance on downstream tasks by examining instances where dense and sparse representations yield different outputs. For each example, BIER models produce an intermediate dense h [CLS] and interpretable sparse output embeddings t (red and purple, respectively, in Figure 1). We will refer to the two seperate models which use these dense and sparse BIERs embeddings for downstream tasks as BIERDense and BIERSparse respectively.
After performing inference initially, we gather all test examples where the BIERDense makes a correct prediction but BIERSparse does not and we place their mention values into a set Z. Additionally, as a diagnostic measure, we consider an 'oracle' approach in which we use the BIERDense prediction for all instances in Z, and the BIERSparse output otherwise. The intuition is that Z contains examples for which the intermediate dense embeddings better represent a mentioncontext than the more interpretable sparse output embeddings from the BIER model.
Because the sparse embeddings are interpretable, this analysis affords fine-grained analysis of which

Experimental Setup
To evaluate the utility of the proposed biomedical entity representations, we use them for the tasks of biomedical entity label classification (ELC) and named entity disambiguation (NED). We highlight that these models perform well even without finetuning, which is critical in low-or zero-supervision scenarios.

NED on Biomedical Wikipedia articles
The NED task connects entity mentions in text with real world entities in a knowledge base by disambiguating the true entity from a list of candidates. We consider the local resolution setting in which each instance features a single entity mention span in the input text and several possible candidates with corresponding descriptions (e.g., the first paragraph of their Wikipedia article).
NED dataset construction While there exist multiple biomedical named entity recognition and linking datasets (Mohan and Li, 2019; Basaldella et al., 2020), we did not find much in the way of publicly available biomedical NED corpora, and we therefore constructed a new dataset, which we will release for use by other researchers. The dataset is based on the set of Wikipages used by Rosenthal et al. (2019), as relevant medical literature which consists of 34,692 medically relevant articles under the 'Clinical Medicine' category 7 . We used SLING 8 to process these articles and were able to retrieve around 1.5 million training examples (mention, context, [categories]) from them. After obtaining these examples for each entity mention we used the CrossWikis dictionary (Spitkovsky and Chang, 2012) to try to gather between 3 to 5 challenging candidate entities for the example. This range in terms of number of candidates was selected because we wanted to include salient biomedical terms that are difficult to disambiguate; setting a higher number of potential candidates for use with CrossWikis largely gives general and short "popular" candidates (i.e., those that appear often in Wikipedia). This behavior makes sense since many biomedical terms are quite specific and usually only have a few high quality alternative candidates to select from. Additionally, we filter out redirect pages and pages that no longer match the wiki version used to create CrossWikis.
This candidate generation and data content acquisition step filters out considerably the number of available examples. We additionally subsample the dataset to reduce the instances where the "popular" candidate is the correct entity so as to make the task more difficult and to allow for more rare entities to appear in our set. After all the filtering, our ClinicalWikiNED dataset consists of a train/dev/test split of size 5332, 3730, and 800 respectively. Table 1 shows a comparison of the two datasets introduced in this paper with that of one of the largest publicly available linked biomedical datasets (Murty et al., 2018).
Using BIERs for NED Using the BIER architecture, we first train a separate WikiDescription model that takes as input a wikipage title as its mention, its first paragraph as the context, and outputs a sparse embedding that predicts the page's categories. As training data, we use any Wikipedia page that contains categories in our biomedical entity type system. We use 2.5 million such (title, descriptions, [categories]) as our training data, and we check for model convergence on a small development set. This model is used so that candidate embedding dimensions will align with our BIER mention-context embeddings.
For each mention m and context s in the test set, we use a BIER model to induce a sparse rep-  Table 2: BIER zero shot test results vs Logistic Regression Baselines trained on task data for NED task resentation t. We then go through each candidate c i for the current test example and use the WikiDescription model to retrieve the candidate's sparse output embedding t c i . Finally, we compute both the cosine similarity and dot product of t with each candidate t c i and predict the candidate c i that achieves the highest score for each metric as the true one.
Baseline model for NED We use the EntEval (Chen et al., 2019) framework for our experiments and train a logistic regression classifier using a feature vector composed of the mention-context embedding x 1 and current candidate wiki description embedding x 2 from the set of candidates C m as a concatenation of x 1 , x 2 , element-wise product, and absolute difference: [x 1 , x 2 , x 1 x 2 , |x 1 −x 2 |]. Both x 1 and x 2 are obtained via BERT based models. Training minimizes binary log loss using all negative examples. At test time, inference combines this classifier result with the prior probability of how frequently candidates occur in Wikipedia as follows: arg max c∈Cm [p prior (c)+p classif ier (c)] to obtain the final candidate prediction. Directly using the most likely prior as predictions yields an accuracy of 73.9%. We emphasize that these baselines are fine-tuned on the task data while the BIER models only do inference on the test set.
Results Table 2 shows the results of the NED experiments. The biomedical BIER model affords improvements over the prior general domain interpretable model (Onoe and Durrett, 2020), showing that the biomedical type system and training is beneficial for this type of task. In addition, the BIER models outperform the baselines without fine-tuning on the training data.

ELC on Cancer Genetics data
For our entity label classification task we use the Cancer Genetics dataset (Pyysalo et al., 2013) which consists of 10,935 training, 3,634 dev, and 6,955 test examples from 300, 100, and 200 unique PubMed articles, respectively. 9 Given an article title and abstract, mention, and the corresponding entity label, the objective is to predict this label from 16 available coarse labels (see Table 7 in the Appendix for label distribution information).
To assess how well the learned BIER representations fare against comparable baselines, we perform a simple nearest neighbor classification technique using the proposed BIER model variants, the general domain model from Onoe and Durrett (2020), and non-BIER fine-tuned pre-trained language models as standalone encoders.
We  (2020) models we also save the final sparse entity embedding t.
We iterate over the test examples and similarly induce dense representations for these h test [CLS] and (if applicable) sparse representations t test . We find their nearest neighbor (under either 2 distance or dot product similarity) from the saved training set of embeddings, and use its label as the prediction. We use the FAISS semantic indexer (Johnson et al., 2017) for storing embeddings and finding nearest neighbors quickly. We are interested in evaluating the off-the-shelf utility of learned representations, and, as such, we do not train or fine-tune the models in any of these cases; rather, training examples are used only for nearest neighbor retrieval.
That said, for completeness we also performed additional experiments in which we do fine-tune models on the task data, with varying amounts of supervision; we are interested especially in lowsupervision settings. For the fine-tuning experiment, we add a linear layer on top of the best performing BIER and baseline models, using cross entropy loss as our objective and fine-tuning them for 4 epochs on the training data before performing inference. For the low supervision regime experiment, we show how the best nearest neighbor and 9 In our experiments we combine the train and dev sets into a single training set.   [5,10,25,50,75,100,200] Results Table 3 shows the results for our first experiment, in which we use untuned representations. We observe that the baseline language model encodings all perform worse than the proposed BIER sparse and dense models, with the exception of SciBERT, which fares better than the sparse BIER model based on SciBERT. Additionally, we see that BERT and Onoe and Durrett (2020) (which is based on BERT) both perform poorly in this biomedical task compared to the other baselines. Importantly, we notice that the sparse interpretable embedding results for our top performing models (both BIER-PubMedBERT and BIER-BioBERT) perform near the level of their dense, non-interpretable counterparts. In the next section we will look at some illustrative test examples cases along with a simple technique to leverage both the dense and sparse embeddings that a BIER model can give to improve performance on the task and gain insight into where the entity type model and system may be underperforming. Table 4 shows the results of our fine-tuning experiment. Freezing the model and allowing only the classification layer to learn weights doesn't allow enough capacity for either case, while fully finetuning both models gives improved performance in both models. However because the BIER model is no longer tied in, the interpretability component of our representations is eliminated, a limitation left for future work. Figure 3 shows BIER-PubMedBERT performs better than the fine-tuned and non-interpretable PubMedBERT model when there are fewer than 100 examples per class ( which is the case for 6 out Test Acc.

Model
Frozen Model Fine-Tuned   of the 16 test classes in the dataset as seen in table 7 in the appendix).

Debugging with BIERs
One of the claimed advantages of BIERs is their ability to facilitate model debugging. In this section we provide illustrative examples where the interpretability of the underlying representations offers insights into model behavior and suggests avenues for improvements.

Entity Type and Mention Analysis
We illustrate the debugging strategy proposed in the context of entity label classification. Recall that this entails inspecting test examples for which the dense model yields a correct prediction, while the sparse variant does not (implying that the former somehow better represents the instance). We can inspect these cases to understand what entity types are leading to such behavior. Appendix Table 11 and 12 enumerate such mentions and their most probable types. We note the inclusion of many people's names (e.g., "Anthony Campbell", "Tony Walsh") which have been assigned at least some incorrect types in their sparse representations. This highlights a general failure mode of the model: It is assigning incorrect types to person names, which may be causing downstream prediction errors. This is actionable information, as we could remedy the issue via rules, additional, targeted supervision or by down weighting probabilities given to common erroneous types for these mentions.
To better characterize entity type errors, we gather the set of the 20 most probable entity types for all mentions incorrectly predicted by the BIER sparse model and sort types by frequency. We do the same for those predicted correctly. The resulting two lists share many of the same top popular types, but looking at relative rankings and only displaying those that are comparatively far apart 10 reveals some interesting results. Tables 9 and 10 report entity types correlated with correct and incorrect predictions, respectively. We emphasize this type of analysis is only possible due to the interpretable nature of the proposed BIER embeddings.
As a final illustrative debugging example, we consider a test example mention "thyroid carcinomas" with label "Cancer", along with the predictions made by the sparse model, "thyroid" with the incorrect label "Organ", and the dense model,"esophageal carcinoma" with the correct label "Cancer". We also retrieve the first correct prediction from the nearest neighbors of the sparse model embedding "medullary thyroid carcinoma" which we refer to as the counterfactual sparse prediction. 11 We take the dot product of the mentioncontext embedding with these three prediction's embeddings and inspect the top types which lead to their selection in Figure 6 in Appendix C. Both the incorrect sparse and correct counterfactual sparse predictions, at the surface level are quite similar to the test mention, but have lower scores for the entity type 'thyroid cancer' compared with the dense prediction which gives the correct label, but is semantically less similar to the test mention than the counterfactual sparse prediction. Additionally, the noisy type "rtt" erroneously plays more of a role in the sparse model predictions as well.
Diagnosing task results In analyzing errors made by the highest performing BIER dense and sparse nearest neighbor models for the entity label classification task, we noticed that while there was high concurrence for correct predictions (i.e., of the 88% true predictions made by the dense model overall, the sparse model agreed with the prediction 95% of the time), the cases where the model predictions disagreed, but where one of them still predicted the true label, were quite varied. In other Test Acc.  words, the sparse model gave many correct results on test cases when the dense model gave incorrect ones and vice versa. Applying the diagnostic technique from Section 3, we see the classifier's overall performance could have improved from 88.2 to 91.9 had the model known when to utilize its intermediate dense representation over its sparse output.
Similarly we applied the diagnostic technique to the NED task and leave more details in Appendix B. Incorporating mentions that the dense dot product BIER model handles better than the cosine similarity based sparse one does would have given an improvement from our prior accuracy of 84.0 to 91.7. Table 5 shows the possible improvement in task accuracies for both tasks.

Related Work
In this work we have introduced a predefined finegrained biomedical type system comprising 68k types, explicitly tied to PubMed. Instead of using a fixed type system, Raiman and Raiman (2018) seek to dynamically learn a 100 dimensional type system from a much larger general domain type system in order to optimally disambiguate entities.
Aside from work on biomedical NLP and entities specifically, there exists a line of work on interpretable word embeddings (Subramanian et al., 2017;Faruqui et al., 2015). A common approach here is to identify the groups of words most associated with vector components globally, somewhat akin to topic models. This differs from our approach, which is based on an external type system and provides immediate, instance-level interpretable probabilities for each entity type. Hu et al. (2020) proposes transforming dense to sparse representations independent of entity typing.
Another related line of work tests a models' ability to induce syntactic or type information by the measuring accuracy of a probe (Peters et al., 2018;Hewitt and Manning, 2019;Hewitt and Liang, 2019). There is significant uncertainty about how to calibrate such post-hoc probing results (Voita and Titov, 2020) whereas our model's representations are directly interpretable.
While many interesting biomedical entity representation and linking task oriented works (Murty et al., 2018;Vashishth et al., 2020;Mondal et al., 2019;Sung et al., 2020;Liu et al., 2020) leverage PubMed or UMLS for semantic type, entity synonym, or self alignment purposes, our work is the first to incorporate interpretable embeddings that are linked to a biomedical entity type system.

Conclusions
We have introduced a new biomedical entity typing system and training set from a large corpus of biomedical texts. We will release this dataset, which comprises 37 million derived triples. Exploiting this data, we proposed Biomedical Interpretable Entity Representations (BIERs), in which dimensions correspond to fine-grained entity types, and values are predicted probabilities that a given entity is of the corresponding type.
Using two downstream biomedical tasks, we showed that BIER representations yield predictive performance that is competitive with dense (uninterpretable) representations, and that such representations are particularly beneficial in zero-shot or low-supervision settings. We also demonstrated that BIER representations can facilitate meaningful model debugging both at the mention and entity type level.

A BIER system level specifics
To better illustrate the process of which mentions are retained during our filtering process, Table 6 shows the 6 concepts associated with an example mention of "phase II clinical trial" found in a PubMed article. We see all 6 concepts score higher than our minimum threshold and we use the two highest scoring matches that are within 2 points of each other: CUIDs C0282460 and C1096779. the former C0282460 has a WikiPedia data item Q7180990 that corresponds to the page "wiki/Phases of clinical research" whose associated categories are "Clinical research", "Design of experiments", "Life sciences", "industry". The second result C1096779 has no direct WikiPedia match and the results we get from SLING include "Clinical trial", "Scientific control", "Medicine", "Topical medication", "Observational study", "Literature". Hence for this mention and context from a PubMed abstract, we are able to extract a (mention, context, list of types) triple of the form ("phase II clinical trial", context, ["Clinical research", "Design of experiments", "Life sciences" industry", "Clinical trial", "Scientific control", "Medicine" .... ]].  Table 6: Using an NER tagger we find 6 associated concepts in UMLS for the mention "phase II clinical trial" in a context sentence "Unraveling the molecular mechanism of BNC105, a phase II clinical trial vascular disrupting agent, provides insights into drug design."

B NED diagnostic details
For the NED task we used the BIER's sparse embeddings of test mentions in their contexts and took cosine similarity with a separate BIER model's sparse embeddings of candidate wiki descriptions to make our predictions. To use the diagnostic technique we first get task predictions using the dense embeddings from the BIER models which gives results of 81 and 79.25 percent test accuracy using dot product and cosine similarity respectively. Although the prior sparse cosine similarity BIER model in this case gave a higher 84.0 percent test accuracy, using the diagnostic technique in this case by incorporating mentions the dense dot product BIER model handles better would have given an improvement in accuracy from 84.0 to 91.65.