Probing Pre-Trained Language Models for Disease Knowledge

Pre-trained language models such as ClinicalBERT have achieved impressive results on tasks such as medical Natural Language Inference. At first glance, this may suggest that these models are able to perform medical reasoning tasks, such as mapping symptoms to diseases. However, we find that standard benchmarks such as MedNLI contain relatively few examples that require such forms of reasoning. To better understand the medical reasoning capabilities of existing language models, in this paper we introduce DisKnE, a new benchmark for Disease Knowledge Evaluation. To construct this benchmark, we annotated each positive MedNLI example with the types of medical reasoning that are needed. We then created negative examples by corrupting these positive examples in an adversarial way. Furthermore, we define training-test splits per disease, ensuring that no knowledge about test diseases can be learned from the training data, and we canonicalize the formulation of the hypotheses to avoid the presence of artefacts. This leads to a number of binary classification problems, one for each type of reasoning and each disease. When analysing pre-trained models for the clinical/biomedical domain on the proposed benchmark, we find that their performance drops considerably.


Introduction
Pre-trained language models (LMs) such as BERT (Devlin et al., 2019) are currently the de-facto architecture for solving most NLP tasks, and their prevalence in general language understanding tasks is today indisputable (Wang et al., 2018(Wang et al., , 2019. Beyond generic benchmarks, it has been shown that LMs are also extremely powerful in domainspecific NLP tasks, e.g., in the biomedical domain (Lewis et al., 2020). While there are several reasons why they are preferred over standard neural architectures, one important (and perhaps less obvious) reason is that LMs capture a substantial amount of world knowledge. For instance, several authors have found that LMs are able to answer questions without having access to external resources (Petroni et al., 2019;Roberts et al., 2020), or that they exhibit commonsense knowledge (Forbes et al., 2019;Davison et al., 2019). To analyze the capabilities of LMs in a more systematic way, there is a growing interest in designing probing tasks, which are now common across the NLP landscape, e.g., for word and sentence-level semantics (Paperno et al., 2016;Conneau et al., 2018). In this paper we focus on (generic and specialized) LMs in the biomedical domain, and ask the following question: what kinds of medical knowledge do pre-trained LMs capture? More specifically, we focus on disease knowledge, which encompasses for instance the ability to link symptoms to diseases, or treatments to diseases.
Among the several biomedical LMs (i.e. LMs that have been pre-trained on biomedical text corpora) that exist today, some of the most prominent are SciBERT (Beltagy et al., 2019), BioBERT (Lee et al., 2020) and ClinicalBERT (Alsentzer et al., 2019). Rather than architectural features, these models differ from each other mostly in the pre-training corpora: SciBERT was trained from scratch on scientific papers; BioBERT is an adapted version of BERT (Devlin et al., 2019), which was fine-tuned on PubMed articles as well as some full text biomedical articles; and ClinicalBERT was initialized from BioBERT and further fine-tuned on MIMIC-III notes (Johnson et al., 2016), which are clinical notes describing patients admitted to critical care units. These LMs have enabled impressive results on various reading comprehension benchmarks for the medical domain, such as MedNLI (Romanov and Shivade, 2018) and MEDIQA-NLI (Abacha et al., 2019) for Natural Language Infer-ence (NLI), and PubMedQA (Jin et al., 2019b) for QA. As an example,  achieved an accuracy of 98% on MEDIQA-NLI, which might suggest that medical NLI is essentially a solved problem. This would be exciting, as medical NLI intuitively requires a wealth of medical knowledge, much of which is not available in structured form.
However, a closer inspection of MedNLI, the most well-known medical NLI benchmark, reveals three important limitations, namely: (1) only few test instances actually require medical disease knowledge, with instances that (only) require terminological and lexical knowledge (e.g. understanding acronyms or paraphrases) being more prevalent; (2) training and test examples often cover the same diseases, and thus it cannot be determined whether good performance comes from the capabilities of the pre-trained LM itself, or from the fact that the model can exploit similarities between training and test examples; and (3) hypothesis-only baselines perform rather well on MedNLI, which shows that this benchmark has artefacts that can be exploited, similarly to general-purpose NLI benchmarks (Poliak et al., 2018).
We therefore propose DisKnE (Disease Knowledge Evaluation), a new benchmark for evaluating biomedical LMs. This dataset explicitly addresses the three limitations listed above and thus constitutes a more reliable testbed for evaluating the disease knowledge captured by biomedical LMs. DisKnE is derived from MedNLI and is organized into two top-level categories, which cover instances requiring medical and terminological knowledge respectively. The medical category is furthermore divided into four sub-categories, depending on the type of medical knowledge that is required.
We empirically analyse the performance of existing biomedical LMs, as well as the standard BERT model, on the proposed benchmark. Our results show that all the considered LMs struggle with NLI examples that require medical knowledge. We also find that the relative performance of the pre-trained models differs across medical categories, where the best performance is obtained by ClinicalBERT, BioBERT, SciBERT or BERT depending on the category and experimental setting. Conversely, for examples that are based on terminological knowledge, overall performance is much higher, with relatively little difference between different pretrained models. The contributions of this paper are as follows 1 : • We introduce a new benchmark to assess the disease-centred knowledge captured by pretrained LMs, organised into categories that reflect the type of reasoning that is needed, and with training-test splits that avoid leakage of disease knowledge.
• We analyze the performance of several clinical/biomedical BERT variants on each of the considered categories. We find that all considered models struggle with examples that require medical disease knowledge.
• We find that without canonicalizing the hypotheses, hypothesis-only baselines achieve the best results in some categories. This shows that the original MedNLI dataset suffers from annotation artefacts, even within the set of entailment examples.

Related Work & Background
Knowledge Encoded in LMs There is a rapidly growing body of work that is focused on analyzing what knowledge is captured by pre-trained LMs. A recurring challenge in such analyses is to separate the knowledge that is already captured by a pre-trained model from the knowledge that it may acquire during a task-specific fine-tuning step. A common solution to address this is to focus on zeroshot performance, i.e. to focus on tasks that require no fine-tuning, such as filling in a blank (Davison et al., 2019;Talmor et al., 2020). As an alternative strategy, Talmor et al. (2020) propose to analyse the performance of models that were fine-tuned on a small training set. Other work has focused on extracting structured knowledge from pre-trained LMs. Early approaches involved manually designing suitable prompts for extracting particular types of relations (Petroni et al., 2019). Recently, however, several authors have proposed strategies that automatically construct such prompts (Bouraoui et al., 2020;Jiang et al., 2020;Shin et al., 2020). Finally, Bosselut et al. (2019) proposed to fine-tune LMs on knowledge graph triples, with the aim of then using the model to generate new triples.
LMs for Biomedical Text As already mentioned in the introduction, a number of pre-trained LMs have been released for the biomedical domain. Several authors have analyzed the performance of these models, and the impact of including different types of biomedical corpora in particular. For instance, Peng et al. (2019) proposed an evaluation framework for biomedical language understanding (BLUE). They obtained the best results with a BERT model that was pre-trained on PubMed abstracts and MIMIC-III clinical notes. Another large-scale evaluation of biomedical LMs has been carried out by Lewis et al. (2020). To evaluate the biomedical knowledge that is captured in pre-trained LMs, as opposed to acquired during training, Jin et al. (2019a) freeze the transformer layers during training. They find that when biomedical LMs are thus used as fixed feature extractors, BioELMo outperforms BioBERT. Most closely related to our work, He et al. (2020) recently also highlighted the limited ways in which biomedical LMs capture disease knowledge. To address this, they proposed a pre-training objective which relies on a weak supervision signal, derived from the structure of Wikipedia articles about diseases. Other authors have suggested to include structured knowledge, e.g. from UMLS, during the pre-training stage of BERT-based models (Michalopoulos et al., 2020;Hao et al., 2020). Another strategy is to inject external knowledge into task-specific models (rather than at the pretraining stage), for instance in the form of definitions  or again UMLS (Sharma et al., 2019). Kearns et al. (2019) presented a related approach to our work in which they categorize each sentence pair according to the tense and focus (e.g. medication, diseases, procedures, location) of the hypothesis, with the aim of providing a detailed examination of MEDIQA-NLI. Based on this categorization, they compare the performance of Enhanced Sequential Inference Model (ESIM) using ClinicalBERT, Embeddings of Semantic Predications (ESP), and cui2vec. However, their analysis was limited to the MEDIAQ-NLI test set, whereas we include entailment examples from the entire MedNLI and MEDIQA-NLI datasets. Moreover, we focus specifically on the ability of LMs to distinguish between closely related diseases, and we move away from the NLI setting to avoid trainingtest leakage and artefacts.
Adversarial NLI Several Natural Language Inference (NLI) benchmarks have been found to contain artefacts that can be exploited by NLP systems to perform well without actually solving the intended task (Poliak et al., 2018;Gururangan et al., 2018). In particular, it has been found that strong results can often be achieved by only looking at the hypothesis of a (premise, hypothesis) pair. In response to this finding, several strategies for creating harder NLI benchmarks have been proposed. One established approach is to create adversarial stress tests (Naik et al., 2018;Glockner et al., 2018;, in which synthetically generated examples are created to specifically test for phenomena that are known to confuse NLI models. This may, for instance, involve the use of WordNet to obtain nearly identical premise and hypothesis sentences, in which one word is replaced by an antonym or co-hyponym. In this paper, we rely on a somewhat similar strategy, using UMLS to replace diseases in hypotheses. As another strategy to obtain hard NLI datasets, Nie et al. (2020) used human annotators to iteratively construct examples that are incorrectly labelled by a strong baseline model. While the aforementioned works are concerned with open-domain NLI, some work on creating adversarial datasets for the biomedical domain has also been carried out. In particular,  studied the robustness of systems for biomedical named entity recognition and semantic text similarity, by introducing misspellings and swapping disease names by synonyms. To the best of our knowledge, no adversarial NLI datasets for the biomedical domain have yet been proposed.

Dataset Construction
In this section, we describe the process we followed for constructing DisKnE. As we explain in more detail in Section 3.1, this process involved filtering the entailment instances from the MedNLI and MEDIQA-NLI datasets, to select those in which the hypothesis expresses that the patient has (or is likely to have) a particular target disease. These instances were then manually categorized based on the type of knowledge that is needed for recognizing the validity of the entailment. Section 3.  final step, we canonicalize the hypotheses of all examples, as explained in Section 3.4. Note that the benchmark we propose consists of binary classification problems (i.e. predicting entailment or not), rather than the standard ternary NLI setting (i.e. predicting entailment, neutral, or contradiction), which is motivated by the fact that natural contradiction examples are hard to find when focusing on disease knowledge.

Selecting Entailment Pairs
We started from the set of all entailment pairs (i.e. premise-hypothesis pairs labelled with the entailment category) from the full MedNLI and MEDIQA-NLI datasets. We used MetaMap to find those pairs whose hypothesis mentions the name of a disease, and to retrieve the UMLS CUI (Concept Unique Identifier) code corresponding to that disease. We then manually identified those pairs, among the ones whose hypothesis mentions a disease, in which the hypothesis specifically expresses that the patient has that disease. For instance, in this step, a number of instances were removed in which the hypothesis expresses that the patient does not have the disease. The remaining cases were manually assigned to categories that reflect the type of disease knowledge that is needed to identify that the hypothesis is entailed by the premise. The considered categories are described in Table 1, which also shows the number of (positive) examples we obtained and illustrative examples 2 . The primary distinction we make is between examples that need medical knowledge and those that need terminological knowledge. The former category is divided into four sub-categories, depending on the type of inference that is needed. First, we have the symptoms-to-disease category, containing examples where the premise describes the signs or symptoms exhibited by the patient, and the hypothesis mentions the corresponding diagnosis. Second, we have the treatments-to-disease category, where the premise instead describe medications (or other treatments followed by the patient). The third category, tests-to-disease, involves instances where the premise describes lab tests and diagnostic tools such as X-rays, CT scans and MRI. Finally, the procedures-to-disease category has instances where the premise describes surgeries and therapeutic procedures that the patient underwent.
In the terminological category, the disease is mentioned in both the premise and hypothesis, either as an abbreviation, a synonym or within a rephrased sentence.

Generating Examples
The process outlined in Section 3. X from a given positive example by other diseases Y 1 , ..., Y n that are similar to X, but not ancestors or descendants of X in SNOMED CT (Donnelly et al., 2006

Training-Test Splits
Because our focus is on evaluating the knowledge captured by pre-trained language models, we want to avoid overlap in the set of diseases in the training and test splits. In other words, if the model is able to correctly identify positive examples for a target disease X, this should be a reflection of the knowledge about X in the pre-trained model, rather than knowledge that it acquired during training. However, any single split into training and test diseases would leave us with a relatively small dataset. For this reason, we consider each disease X in isolation. Let E be the set of all positive examples, obtained using the process from Section 3.1. Furthermore, we write E X for the set of those examples from E in which the target disease in the hypothesis is X. Finally, we write neg(X) for the set {Y 1 , ..., Y n } of associated diseases that was selected to construct negative examples, following the process from Section 3.2. For each target disease X, we define a corresponding test set Test X and training set Train X as follows. Test X contains all the positive examples from E X . Moreover, for each e ∈ E X and each Y ∈ neg(X) we add a negative example e X→Y to Test X which is obtained by replacing the occurrence of X by Y . If the word before the occurrence  of X is a or an, we modify it depending on whether Y starts with a vowel or consonant. The positive examples in Train X consist of all examples from E in which X is not mentioned. Note that we also remove examples in which these diseases are only mentioned in the premise. Furthermore, we check for occurrences of all the synonyms of these diseases that are listed in UMLS. The process of creating the training and test set for a given target disease X is illustrated in Figure 1.

Canonicalization
We noticed that the way in which a given hypothesis expresses that "the patient has disease X" is correlated with the type of the disease. For this reason, as a final step, we canonicalize the hypotheses in the dataset. Specifically, we replace each hypothesis by the name of the corresponding disease X. Several hypotheses in the dataset already have this form. By converting the other hypotheses in this format, we eliminate any artefacts that are present in their specific formulation.

Experiments
We experimentally compare a number of pretrained biomedical LMs on our proposed DisKnE benchmark. In Section 4.1, we first describe the considered LMs and the experimental setup. The main results are subsequently presented in Section 4.2. This is followed by a discussion in Section 4.3.

Experimental Setup
Pre-trained LMs.
To understand to what extent the pretraining data of an LM affects its performance on our fine-grained evaluation of disease knowledge, we used the following BERT variants: BERT We use the BERT base -cased model (Devlin et al., 2019).
BioBERT  proposed a model based on BERT base -cased, which they further trained on biomedical corpora. We use the version where PubMed and PMC were utilized for this further pre-training.  semantic scholar, 82% of which were biomedical articles. The full text of the papers was used for training. We use the cased version.
Training Details. For fine-tuning, model hyperparameters were the same across all BERT variants such as the random seeds, batch size and the learning rate. In this study, we fix the the learning rate at 2e-5, batch size of 8 and we set the maximum number of epochs to 8 with the use of early stopping. We used 10% of the training set as validation split.
Evaluation Protocol. We analyze the results per disease and per category in terms of F1 score for the positive class, reporting results for all diseases that have at least two positive examples for the considered category. To this end, for each disease X, we start from its corresponding training-test split, which was constructed as explained in Section 3.3. To show the results for a particular category, we remove from the test set all the examples that do not belong to that category.

Results
The main results are shown in Tables 2-6. A number of clear observations can be made. First, the results for the terminological category are substantially higher than the results for the other categories, which suggests that the masked language modelling  objective, which is used as the main pre-training task in all the considered LMs, may not be ideally suited for learning medical knowledge. Second, recall that the main difference between the considered biomedical LMs comes from the corpora that were used for pre-training them. As the results for the terminological category (Table 6) reveal, the inclusion of domain-specific corpora does not seem to benefit their ability to model biomedical terminology, as similar results for this category are obtained with the standard BERT model, which was pre-trained on Wikipedia and a corpus of books and movie scripts. For the Symptoms → Disease category, we see that ClinicalBERT outperforms the other biomedical LMs, although the standard BERT model actually achieves the best performance overall. The results suggest that Clini-calBERT is better at distinguishing between relatively rare diseases, but that the focus on encyclopedic text benefits BERT for more common diseases. Intuitively, we can indeed expect that the encyclopedic style of Wikipedia focuses more on symptoms of diseases than scientific articles, which might focus more on treatments, procedures and diagnostic tests. This is also in accordance with the findings from He et al. (2020), who ob-     calization, the hypothesis only baseline performs similarly to the full model, even outperforming it in a few cases, with the exception of the Terminological category where a clear drop in performance for the hypothesis-only baseline can be seen. In contrast, for the canonicalized version of the dataset, we can see that the hypothesis only baseline, which only gets access to the name of the disease in this case, under-performs consistently and substantially. Note that the hypothesis-only baseline still achieves a non-trivial performance in most cases, noting that an uninformed classifier that always predicts true would achieve an F1 score of 0.167. However, this simply shows that the model has learned to prefer frequent diseases over rare ones.

Adversarial Examples.
A key design choice has been to select negative examples from the diseases that are most similar to the target disease. To analyse the impact of this choice, we carried out an experiment in which negative examples were instead randomly selected. As before, we only consider diseases that are present in the dataset, and we ensure that negative examples are not ancestors or descendants of the target disease in SNOMED CT. The results are presented in Table 8. As expected, the results are overall higher than those from the main experiment. More surprisingly, this easier setting benefits some models more than others. The relative performance of ClinicalBERT in particular is now clearly better, with this model achieving the best results for Symptoms → Disease. Furthermore, the standard BERT model now clearly underperforms the biomedical LMs, except for Procedures → Disease where it outperforms Clin-icalBERT and BioBERT.

Conclusion
We have proposed DisKnE, a new benchmark for analysing the extent to which biomedical language models capture knowledge about diseases. Positive examples were obtained from MedNLI and MEDIQA-NLI, by manually identifying and categorizing hypotheses that express that the patient has some disease. Negative examples were selected to be similar to the target disease. To prevent shortcut learning, the hypotheses were canonicalized, such that models only get access to the name of the disease that is inferred. Our empirical analysis shows that existing biomedical language models particularly struggle with cases that require medical knowledge. The relative performance on the different categories suggests that different (biomedical) LMs have complementary strengths.