Can Language Models be Biomedical Knowledge Bases?

Pre-trained language models (LMs) have become ubiquitous in solving various natural language processing (NLP) tasks. There has been increasing interest in what knowledge these LMs contain and how we can extract that knowledge, treating LMs as knowledge bases (KBs). While there has been much work on probing LMs in the general domain, there has been little attention to whether these powerful LMs can be used as domain-specific KBs. To this end, we create the BioLAMA benchmark, which is comprised of 49K biomedical factual knowledge triples for probing biomedical LMs. We find that biomedical LMs with recently proposed probing methods can achieve up to 18.51% Acc@5 on retrieving biomedical knowledge. Although this seems promising given the task difficulty, our detailed analyses reveal that most predictions are highly correlated with prompt templates without any subjects, hence producing similar results on each relation and hindering their capabilities to be used as domain-specific KBs. We hope that BioLAMA can serve as a challenging benchmark for biomedical factual probing.


Introduction
Recent success in natural language processing can be largely attributed to powerful pre-trained language models (LMs) that learn contextualized representations of words from large amounts of unstructured corpora (Peters et al., 2018;Devlin et al., 2019). There have been recent works in probing how much knowledge these LMs contain in their parameters (Petroni et al., 2019) and how to effectively extract such knowledge. (Shin et al., 2020;Jiang et al., 2020b;Zhong et al., 2021). While factual probing of LMs has attracted much attention from researchers, a more practical application would be to leverage the power of domainspecific LMs (Beltagy et al., 2019;Lee et al., 2020) as domain knowledge bases (KBs). Unlike recent works that probe general domain knowledge, we ask whether it is also possible to retrieve expert knowledge from LMs. Specifically, we tune our focus on factual knowledge probing for the biomedical domain as shown in Figure 1.
To inspect the potential utility of LMs as biomedical KBs, we create and release the Biomedical LAnguage Model Analysis (BIOLAMA) probe. BIOLAMA consists of 49K biomedical factual triples whose relations have been manually curated from three different knowledge sources: the Comparative Toxicogenomics Database (CTD), the Unified Medical Language System (UMLS), and Wikidata. While our biomedical factual triples are inherently more difficult to probe (see Table 1 for examples), BIOLAMA also poses technical challenges such as multi-token object decoding.
Initial probing results on BIOLAMA show that the best performing LM achieves up to 7.28% Acc@1 and 18.51% Acc@5, and outperforms an information extraction (IE) baseline (Lee et al., 2016). Although this result seems promising, we find that their output distributions are largely biased to a small number of entities in each relation. Along this line, we use two metrics, prompt bias (Cao et al., 2021) and synonym variance, to investigate the behavior of LMs as KBs. Our analysis shows that while LMs seem to be more aware of synonyms than the IE baseline, they output highly biased predictions given the prompt template of each relation. Our result calls for better LMs and probing methods that can retrieve rich but still useful biomedical entities.

BIOLAMA
In this section, we detail the construction of BI-OLAMA including the data curation process and pre-processing steps. Statistics and examples of BIOLAMA are shown in Table 1 along with those from LAMA (Petroni et al., 2019).

Knowledge Sources
CTD The CTD 2 is a public biomedical database on relationships and interactions between biomedical entities such as diseases, chemicals, and genes (Davis et al., 2020). It provides both manually curated and automatically inferred triples in English, and we only use the manually curated triples for a better quality of our dataset. We use the April 1st, 2021 version of the CTD.

Data Pre-processing
From our initial factual triples from the knowledge sources above, we apply several pre-processing steps to further improve the quality of BIOLAMA. First, considering the trade-off between the coverage and difficulty of probing, we restrict the lengths of entities to be ≤10 subwords, which covers 90% of the entities. 5 Note that LAMA only contains single-token objects, which makes the task easier, but less practical. Following Poerner et al.
(2020), we also discard easy triples where objects are substrings of the paired subjects (e.g., "iron deficiency"-"iron"), which prevents trivial solutions using the surface forms of the subjects. For each relation, we split samples into training, development, and test sets with a 40:10:50 ratio. The training set is provided for learning or finding good  Table 3: Main experimental results on BIOLAMA. We report Acc@1/Acc@5 of each model including the macro average across three different knowledge sources. We also report ratios of the majority objects in each knowledge source (averaged over its relations) in the parentheses. Highest and second-highest scores are boldfaced and underlined, respectively. Manual: manual prompt. Opti.: OptiPrompt. The results of OptiPrompt are the mean of 5 runs with different seeds. See Appendix E for the performance on each relation.
prompts for each relation. More details on preprocessing steps are available in Appendix A.
After the pre-processing, we are able to obtain 22K triples with 15 relations from the CTD, 21.2K triples with 16 relations from the UMLS, and 5.8K triples with 5 relations from Wikidata (see Appendix B for the detailed statistics). In Table 2, we compare various probing benchmarks with BI-OLAMA. By design, BIOLAMA has no objects that are substrings of their subjects and object entities are much longer on average, which makes our benchmark challenging but much more practical.
Evaluation Metric We use top-k accuracy (Acc@k), which is 1 if any of the top k object entities are included in the annotated object list, and is 0 otherwise. We use both Acc@1 and Acc@5 since most biomedical entities are related to multiple biomedical entities (i.e., N-to-M relations).

Models
Information Extraction Many biomedical NLP tools rely on automated IE systems that can provide relevant entities or articles given a query. In this work, we use the Biomedical Entity Search Tool (BEST) (Lee et al., 2016) 6 as an IE system and compare it with LM-based probing methods. BEST incorporates biomedical entities when building their search index over PubMed, a large-scale biomedical corpus, and returns biomedical entities given a keyword-based query. To fully make use of BEST, we create AND queries using a subject entity and a lemmatized relation name (e.g., "(meclozine) AND (medical condition treat)"), and use retrieved entities as its predictions.

Probing Methods
Prompts We use a fill-in-the-blank cloze statement (i.e., a "prompt") for probing and choose two different methods of prompt generation: manual prompts (Petroni et al., 2019) and Op-tiPrompt (Zhong et al., 2021). For each relation, we first create manual prompts with domain experts (Appendix C). On the other hand, OptiPrompt automatically learns continuous embeddings that can better extract factual knowledge for each relation, which are trained with our training examples. Following Zhong et al. (2021), we initialize the continuous embeddings with the embeddings of manual prompts, which worked consistently better than random initialization in our experiments.
Multi-token Object Decoding Since the majority of entities in BIOLAMA are made up of multiple tokens, we implement a multi-token decoding strategy following Jiang et al. (2020a). Among their decoding methods, we use the confidencebased method which produced the best results. The confidence-based method greedily decodes output tokens sorted by the maximum logit in each token position. Note that we do not restrict our output spaces by any pre-defined sets of biomedical entities since we are more interested in how accurately the LMs contain biomedical knowledge in  an unconstrained setting. 9 See Appendix D for the implementation details of our decoding method.

Main Results
Experimental results on BIOLAMA are summarized in Table 3. First, BioBERT and Bio-LM are both able to retrieve factual information better than BERT, which demonstrates the effectiveness of domain-specific pre-training. Also, Bio-LM shows consistently better performance than BioBERT (BERT < BioBERT < Bio-LM). We believe that this may be attributed to the custom vocabulary of Bio-LM learned from a biomedical corpus. Using OptiPrompt also shows consistent improvement over manual prompts in all LMs. Notably, the IE system is able to achieve the best performance on the CTD relations, but performs worse than BioBERT and Bio-LM on the UMLS and Wikidata relations. While we are able to achieve 18.51% Acc@5 with Bio-LM (w/ OptiPrompt) on average, note that the average Acc@1s on the CTD and UMLS relations are lower than majority voting (e.g., 9.67% (majority) vs. 8.25% Acc@1 (Bio-LM) in UMLS), which shows the difficulty of accurately extracting biomedical facts from these models.
tively characterize the behavior of each model. Our analyses suggest that we might need stronger biomedical LMs and probing methods to make use of these LMs as domain-specific knowledge bases.

Predictions
In Table 4, we present two correct and two incorrect predictions for three different relations where Bio-LM (w/ OptiPrompt) achieves high accuracy. One aspect that stands out is that predictions tend to be highly biased towards a few objects (e.g., "headache", "pain", or "ESR1"). Motivated by this observation, we further measure two metrics that can characterize the behavior of each model in detail: prompt bias and synonym variance.

How Biomedical LMs Predict
Prompt Bias To serve as accurate KBs, LMs must make appropriate object entity predictions given the input subject entity. Cao et al. (2021) quantified prompt biases by measuring how insensitive LMs are to input subjects. For each relation, we first obtain the probability histogram of each unique object entity being a top-1 prediction when the subject is given. For example, if one relation has 100 test samples and "pain" appears 20 times as its top-1 prediction, the probability mass of "pain" becomes 20%. At the same time, we calculate the probability distribution over unique object entities when the subject is masked out (see Figure 2). For instance, a model might assign 30% to "pain" even when the subject is masked out from the prompt. Prompt bias is the Pearson's correlation coefficient Figure 2: Examples of inputs for measuring prompt bias and synonym variance. We use a [MASK] token for the subject when measuring prompt bias, and replace each subject into their synonyms when measuring synonym variance. between these two distributions, which indicates how biased the model is to a prompt. A lower prompt bias means that a model is giving less biased predictions for each relation (i.e., prompt).
Synonym Variance Biomedical entities often have a number of synonyms, which are often leveraged for modeling biomedical entity representations (Sung et al., 2020). Hence, it is important that predictions over our factual triples do not change when the input subject is replaced by its synonyms. To assess this aspect, we propose a metric called synonym variance, which measures how much each prediction changes when the subject is replaced with its synonyms (see Figure 2). We create 10 copies of our datasets by replacing the subjects with one of their synonyms chosen randomly. Synonym variance is the standard deviation of Acc@5 calculated from these new test sets. Lower synonym variance means that a model is giving more consistent predictions even with different synonyms.
Results Figure 3 shows the results of prompt biases in four different models. Compared to the IE system, the LMs have relatively higher correlations (over 0.6) meaning that their predictions are more biased towards the prompts. On the other hand, in Figure 4, LMs show relatively lower standard deviations over variations of synonyms than the IE system does. While this can be interpreted that the LMs are more robust to synonym variations, it might also be the result of strong biases in LMs on their prompts. For example, while BERT has the smallest synonym variance, it has the largest prompt bias, meaning that it is not a synonymaware model, but just a highly biased model.

Conclusion
In this work, we explore the possibility of using LMs as biomedical KBs. To this end, we release BI-OLAMA as a probing benchmark to measure how much biomedical knowledge can be extracted from LMs. While biomedical LMs can extract useful facts to some extent, our analysis shows that this is largely due to their predictions being biased towards certain prompts. In future work, we plan to overcome the underlying challenges in BIO-LAMA and improve the probing accuracy of LMs.

Ethical Considerations
The aim of factual probing is to verify how much knowledge can be retrieved from language models pre-trained using large amoung of corpora. Due to a lack of data for factual probing in the biomedical domain, we collected data from widely used knowledge sources: the CTD, the UMLS, and Wikidata. Although these data have undergone inspection by domain experts, biomedical knowledge is continuously growing and therefore we cannot guarantee that this biomedical knowledge is absolute. Furthermore, without careful inspection, outputs of these LMs should not be considered as a means of drug recommendation or any other medical activity. We caution future researchers when using BIOLAMA to keep this caveat in mind. A Pre-processing of BIOLAMA After applying basic pre-processing steps in §2, we aggregate samples with the same subject and relation, which makes each sample contain multiple object answers (e.g., subj="COVID-19", rela-tion="symptoms of", obj={headache, cough, fever, . . . }). We also set the maximun number of triples in each relation as 2,000 while removing relations having less than 500 triples, which are mostly less useful to extract (e.g., "affects methylation" in CTD) or too complicated (e.g., "positive therapeutic predictor" in Wikidata) according to our manual inspection with domain experts. For the UMLS, out of 974 relations, we select 16 relations that are considered to be the most important by domain experts.

References
To mitigate the class imbalance problem in object entities, we also undersample highly frequent object entities to be as frequent as the fifth frequent object entity in each relation.

B Statistics of BIOLAMA
The CTD split has a total of 22,017 samples, the UMLS split a total of 21,164, and the Wikidata split a total of 5,855 samples. This sums up to a total of 49,036 samples. Table 5 displays the number of samples in each train/dev/test split of each relation.

C Manual Prompts
We create multiple manual prompts with the help of domain experts' insight on each relation in BI-OLAMA and select the best performing prompts on the development set. Selected prompts for the relations are listed in Table 6.

D Implementation Details
For confidence-based decoding (Jiang et al., 2020a), we use the open-source code provided by the authors 10 and make slight changes for BIO-LAMA. We set the beam size to 5 to get the top 5 predictions and the number of masks to 10. We also set the iteration method to "None" as additional iteration did not help to increase the performance. For OptiPrompt (Zhong et al., 2021), we modify the open-source code provided by the authors 11 to allow training over the multi-token objects. We set the learning rate to 3e-3 and the mini-batch size to 16. We train OptiPrompt for 10 epochs and select the best checkpoint based on Acc@1 on the 10 https://github.com/jzbjyb/X-FACTR 11 https://github.com/princeton-nlp/OptiPrompt development set. It takes 3 hours to test all samples with manual prompts and 8 hours to train and test with OptiPrompt using 1 Titan X (12GB) GPU.

E Result on Each Relation
In addition to the averaged performances presented in Table 3, we present Acc@1 and Acc@5 on each relation in Table 7.

F More Prediction Examples
We provide more examples on 8 relations where Bio-LM (w/ OptiPrompt) achieves decent top-1 accuracy in Table 8.