Dataset Debt in Biomedical Language Modeling

Large-scale language modeling and natural language prompting have demonstrated exciting capabilities for few and zero shot learning in NLP. However, translating these successes to specialized domains such as biomedicine remains challenging, due in part to biomedical NLP’s significant dataset debt – the technical costs associated with data that are not consistently documented or easily incorporated into popular machine learning frameworks at scale. To assess this debt, we crowdsourced curation of datasheets for 167 biomedical datasets. We find that only 13% of datasets are available via programmatic access and 30% lack any documentation on licensing and permitted reuse. Our dataset catalog is available at: https://tinyurl.com/bigbio22.


Introduction
Natural language prompting has recently demonstrated significant benefits for language model pretraining, including unifying task inputs for largescale multi-task supervision (Raffel et al., 2019) and improving zero-shot classification via explicit, multi-task prompted training data (Wei et al., 2022;Sanh et al., 2022).With performance gains reported when scaling to thousands of prompted training tasks (Xu et al., 2022), tools that enable largescale integration of expert-labeled datasets hold great promise for improving zero-shot learning.
However, translating these successes to specialized domains such as biomedicine face strong headwinds due in part to the current state of dataset accessibility in biomedical NLP.Recently data cascades was proposed as a term-of-art for the costs of undervaluing data in machine learning (Sambasivan et al., 2021).We propose a similar term, dataset debt, to capture the technical costs (Sculley et al., 2015) of using datasets which are largely open and findable, but inconsistently documented, structured, and otherwise inaccessible via a consistent, programmatic interface.This type of debt creates significant practical challenges when integrating complex domain-specific corpora into popular machine learning frameworks.
We claim that biomedical NLP suffers from significant dataset debt.For example, while Hug-gingFace's popular Datasets library (Lhoest et al., 2021) contains over 3,000 datasets, biomedical data are underrepresented and favor tasks with general domain appeal such as question answering or semantic similarity (PubmedQA, SciTail, BIOSSES).To assess the state of biomedical dataset debt, we built, to our knowledge, the largest catalog of metadata for publicly available biomedical datasets.We document provenance, licensing, and other key attributes per (Gebru et al., 2021) to help guide future efforts for improving dataset access and machine learning reproducibility.
Our effort found low overall support for programmatic access, with only 13% (22/167) of our datasets present in the Datasets hub.Despite a proliferation of schemas designed to standardize dataset loading and harmonize task semantics.there remains no consistent, API interface for easily incorporating biomedical data into language model training at scale.

Data-Centric Machine Learning
Deep learning models are increasingly moving to commodified architectures.Data-centric machine learning (vs.model-centric) is inspired by the observation that the performance gains provided by novel architectures are often smaller than gains obtained using better training data.We outline some key challenges and opportunities in data-centric language modeling.These are broadly applicable to NLP, but have strong relevance to biomedicine and the current state of dataset debt.

Curating and Cleaning Training Data
Popular language models such as GPT-3 (Brown et al., 2020) do not incorporate scientific or medical corpora in their training mixture, contributing to their lower performance when used in biomedical domains and few-shot tasks (Moradi et al., 2021).Additionally, simply training the language model on in-domain data might lead to non-trivial risks associated with the recapitulated biases from the training corpora (Zhang et al., 2020;Gururangan et al., 2022).
In scientific literature, discounting source provenance could manifest as language models parroting conflicting or inaccurate scientific findings.Zhao et al. (Zhao et al., 2022) curated scientific corpora to identify patient-specific information (e.g., mining PubMed Central to identify case reports that respect licensing for re-use and re-distribution).With sufficient metadata and dataset provenance, this level of curation could be extended to the entire training corpus for a biomedical language model.Data cleaning has a large impact on language model performance.Deduplicating data leads to more accurate, more generalizable models requiring fewer training steps (Cohen et al., 2013;Lee et al., 2021).Cleaning up the consistency of answer response strings was reported to improve biomedical question answering (Yoon et al., 2021).Duplication contamination is a serious risk in biomedical datasets, which often iteratively build or extend prior annotations, introducing risk of test leakage in evaluation (Elangovan et al., 2021).

Programmatic Labeling
Biomedical domains require specialized knowledge, making expert-labeled datasets timeconsuming and expensive to generate.In limiteddata settings, distant and weakly supervised methods (Craven and Kumlien, 1999) are often used to combine curated, structured resources (e.g., knowledge bases, ontologies) with expert rules to programmatically label data.These approaches have demonstrated success across NER, relation extraction, and other biomedical applications (Kuleshov et al., 2019;Fries et al., 2021).However these approaches typically are applied to real, albeit unlabeled data, creating challenges when modeling rare classes.A recent trend is transforming structured resources directly into realistic-looking, but synthetic training examples.KELM (Agarwal et al., 2021) converts Wiki knowledge graph triplets into synthesized natural language text for language model pretraining.
Natural language prompting has emerged as a powerful technique for zero/few shot learning, where task guidance from prompts reduces sample complexity (Le Scao and Rush, 2021).Crosslingual prompting (English prompts, non-English examples) has demonstrated competitive classification performance (Lin et al., 2021).Training language models directly on prompts has resulted in large gains in zero-shot performance over GPT-3 as well as producing models with fewer trained parameters (Sanh et al., 2022;Wei et al., 2022).
PromptSource (Bach et al., 2022) is a recent software platform for creating prompts and applying them to existing labeled datasets to build training data.These developments highlight a promising trend toward defining programmatic transformations on top of existing datasets, enabling them to be configured into new tasks.However, leveraging large-scale prompting remains challenging in biomedicine due to the lack of programmatic access to a large, diverse collections of biomedical datasets and tasks.

Diverse Evaluation and Benchmarking
Inspired by standardized benchmarks in general domain NLP research (Wang et al., 2018(Wang et al., , 2019)), BioNLP takes similar initiatives by establishing a benchmark of 10 datasets spanning 5 tasks (Peng et al., 2019, BLUE), an improved benchmark on BLUE with 13 datasets in 6 tasks (Gu et al., 2022, BLURB), and a benchmark of 9 different tasks for Chinese biomedical NLP (Zhang et al., 2021, CBLUE).While these benchmarks provide tools for consistent evaluation, only BLURB supports a leaderboard and none directly provide dataset access.Evaluation frameworks that provide programmatic access are often restricted to single and well-established tasks and impose pre-processing choices that can make inconsistent performance comparisons (Crichton et al., 2017;Weber et al., 2021).
To the best of our knowledge, there are currently no zero-shot evaluation frameworks for biomedical data similar to BIG-Bench1 , which currently contains little-to-no biomedical tasks.
Evaluation frameworks must also allow probing the trained language models' intrinsic properties, rather than only measure downstream classification performance.Following (Petroni et al., 2019) in the general NLP domain, (Sung et al., 2021) introduce BioLAMA, a benchmark making available 49K biomedical knowledge triplets to probe the relational knowledge present in pre-trained language models.

Metadata/Datasheet Curation
Our inclusion criteria targeted expert-annotated datasets designated as public, reusable research benchmarks for one or more NLP tasks.We excluded: (1) multimodal datasets where removing the non-text modality undermines the task, e.g., visual question answering, audio transcription, image-to-text generation; (2) general resource datasets, e.g, the PMC Open Access Subset, MIMIC-III (Johnson et al., 2016); (3) derived resources, e.g., knowledge bases constructed via text mining; and (4) modeling artifacts, e.g., static embeddings or pretrained language models.
We recruited 8 volunteers to identify datasets and crowdsource their metadata curation for an open, community dataset catalog.Participants reviewed dataset publications and websites which described the curation process, and then completed the metadata schema outlined in Table 1 This schema loosely assesses compliance with FAIR data principles (Wilkinson et al., 2016).
Our initial effort identified 101 datasets.We combined this list with a contemporaneously curated catalog of biomedical datasets, identified via systematic literature review (Blagec et al., 2022).Since the catalog described in Blagec et al. ( 2022) was generated using broader inclusion criteria (e.g., non-public data, imaging and video datasets) we identified 104/475 entries that met our criteria.After merging, we conducted a second round of crowdsourcing to annotate metadata, resulting in our current catalog of 167 biomedical datasets.We did not conduct a formal assessment of interannotator agreement.

Dataset Access
Only 22/167 (13%) of biomedical datasets are available via the Datasets API, despite 123/167 (74%) being openly hosted on public websites.The remaining datasets require authentication to access  Table 2 outlines the diversity of commonly used biomedical file formats.Most datasets are provided in semi-structured form (51%), followed by structured (22%), and non-standard plain text files 139 (17%).There are several structured formats which propose a data model for parsing and standardizing task semantics (e.g., BRAT (Stenetorp et al., 2012), BioC (Comeau et al., 2013)).However, for information extraction tasks which could use these formats, only 31/86 (36%) actually do.

Task Category
Eng. Non-Eng.Given all tasks, 14 languages are covered.Five languages make up 95% of all datasets.English is the majority (80%), followed by Spanish (7.5%), German (2.4%), French (2.4%), and Chinese (2.4%).Table 4 contains counts of task categories binned into English and Non-English .Question answering and semantic similarity have zero non-English datasets.

Conclusion
In this work, we outlined several challenges in training biomedical language models.With increasingly large biomedical language models (Yang et al., 2022), limitations in the quality and properties of training data grow more stark.We argue that biomedical NLP suffers from significant dataset debt, with only 13% of datasets accessible via API access and readily usable in state-of-the-art NLP tools.Current biomedical datasets are homogeneous, largely focusing on NER and relation extraction tasks, and predominantly English language.These limitations highlight opportunities presented by recent data-centric machine learning methods such as prompting, which enables experts to inject task guidance into training and more easily reconfigure existing datasets into new training tasks.

Table 1 :
Metadata collected for all biomedical datasets.See Appendix A for more details on each category.

Table 2 :
Distribution of file formats for biomedical datasets.

Table 7 :
Zhiyong Lu, YifanPeng, Fabio Rinaldi, Manabu Torii, et al. 2013.Bioc: a minimalist approach to interoperability for biomedical text processing.Database, 2013.Mark Craven and Johan Kumlien.1999.Constructing biological knowledge bases by extracting information from text sources.In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, August 6-10, 1999, Heidelberg, Germany, pages 77-86.AAAI. Figure 3: Scientific/biomedical domain (e.g., PubMed abstracts) cumulative distribution of available tasks, ordered by year of dataset release.Tasks by language