ClinicalT5: A Generative Language Model for Clinical Text

,


Introduction
In the past few years, large pre-trained language models (PLMs), such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), GPT-3 (Brown et al., 2020), BART (Lewis et al., 2020), T5 (Raffel et al., 2020), etc., have achieved great success over a variety of downstream tasks in natural language processing (NLP).These PLMs mainly depend on self-supervised pre-training on large amounts of general-domain textual data, e.g., Wikipedia, news articles, web crawl corpus, etc., and are widely adopted in downstream applications.Despite the superior performance of these PLMs on general-domain text, their performance over domain-specific text is relatively poor (Ma et al., 2019).To bridge this gap, researchers propose to build domain-specific PLMs through fine-tuning or pre-training from scratch over domain corpora.For example, in the biomedical and clinical domains, various domain-specific PLMs have been explored and released, including BioBERT (Lee et al., 2020), SciBERT (Beltagy et al., 2019), BlueBERT (Peng et al., 2019), ClinicalBERT (Huang et al., 2019), BioClinicalBERT2 (Alsentzer et al., 2019), umls-BERT (Michalopoulos et al., 2020), diseaseBERT (He et al., 2020), SciFive (Phan et al., 2021), and BioBART (Yuan et al., 2022).Domain-specific language models have been extensively explored in different kinds of NLP-related downstream applications, ranging from entity linking (Bhowmik et al., 2021) to document classification (Allada et al., 2021).Generally, a typical and popular usage of the aforementioned PLMs is to leverage them to encode domain text, the learned representations of which are then fed into some task-specific structures for label prediction.Taking a complicated real-world task as an example, (Huang et al., 2019) predicts patients' risk of readmission within 30 days after discharge using clinical notes in the Electronic Health Records (EHRs).Essentially, they encode discharge summaries of patients with ClinicalBERT, and put the learned embeddings of the [CLS] token to a linear layer on top for prediction, leading to better performance than traditional models.Moreover, (Lu et al., 2021c) constructs a document-level multi-view graph out of each clinical note and predicts patients' 30-day readmission risk with a graph-based model, and they use BioClinicalBERT (Alsentzer et al., 2019) as the encoder within the graph model.
Recently, generative language models, e.g., BART (Lewis et al., 2020) and T5 (Raffel et al., 2020), have attracted attention since they are naturally effective for natural language generation tasks, such as document summarization (Chen and Yang, 2021), question answering (Zhu et al., 2021;Sachan et al., 2021), data augmentation (Lu et al., 2021b), etc.Meanwhile, a novel paradigm of leveraging generative language models has gained popularity, where researchers cast non-generation tasks as generative problems, e.g., to directly generate textual labels to incorporate their semantics, and report promising results (De Cao et al., 2021;De Cao et al., 2022).However, such approaches are still underexplored in certain domains due to lack of domain-specific generative language models, i.e., most of the aforementioned domain-specific PLMs are notably domain-adapted BERT-style models.In the biomedical domain, two generative language models SciFive (Phan et al., 2021) and BioBART (Yuan et al., 2022) have been released, but in the clinical domain, the situation is worse and no such generative models exist to our knowledge.Though the two domains are relatively close, clinical text poses unique challenges compared to general and non-clinical biomedical text due to its specific linguistic characteristics (Alsentzer et al., 2019).Previous studies list some of the linguistic features of clinical text, e.g., heavy use of professional technical terminology, abbreviations and acronyms, passive verbs, omission of subjects and verbs, etc., and these features make clinical text divergent from standard language (Smith et al., 2014).
Aiming to fulfill this gap, we adapt T5 (Raffel et al., 2020) to the clinical domain by training a domain-specific variant using clinical text, i.e., ClinicalT5.We demonstrate the capabilities of the model by conducting both intrinsic and extrinsic evaluations.For intrinsic evaluation, we aim to evaluate its capability to capture the similarity and relatedness of the Unified Medical Language System (UMLS) concept pairs, where we measure the correlation coefficient between the similarity scores of the encoded representations for the concept pairs and those judged by human experts.For extrinsic evaluation, we evaluate the proposed model along with baselines over a diverse set of benchmark datasets, ranging from document classification (DC), named entity recognition (NER), to natural language inference (NLI).Furthermore, we also evaluate on three more complicated real-world tasks of clinical importance, i.e., patients' 30-day readmission risk, 30-day and 1-year mortality risk.We show that ClinicalT5 dramatically outperforms T5 and compares favorably with its close baselines across all of these tasks.

Biomedical Domain-Adapted Models
The biomedical domain has been an active area of research in the NLP community for the past few years.Many relevant studies have been presented, ranging from domain-specific language models, external knowledge infusion, and various downstream applications, etc. (Peng et al., 2019;Beltagy et al., 2019;Lee et al., 2020;He et al., 2020;Michalopoulos et al., 2020;Lu et al., 2021a).Most of the biomedical language models are BERT (Devlin et al., 2019) variants fine-tuned to biomedical text, e.g., BioBERT is trained on PubMed abstracts and PMC full text articles (Lee et al., 2020) and SciBERT is trained on the full text of biomedical and computer science papers from the Semantic Scholar corpus (Beltagy et al., 2019).Besides, researchers inject external domain knowledge into adapted biomedical language models due to the knowledge-intensive nature of this domain, e.g., umlsBERT is directly trained using UMLS text (Michalopoulos et al., 2020), He et al. infuse disease information from the corresponding Wikipedia passages into language models (He et al., 2020), and Lu et al. inject biomedical knowledge from multiple sources into language models via adapters (Lu et al., 2021a).For generative language models, SciFive is an adapted T5 model pre-trained on PubMed abstracts and PMC articles (Phan et al., 2021) and BioBART is an adapted BART model pre-trained on PubMed abstracts (Yuan et al., 2022).

Clinical Domain-Adapted Models
In the clinical domain, there are mainly two popular BERT models, i.e., ClinicalBERT (Huang et al., 2019) and BioClinicalBERT (Alsentzer et al., 2019), which are both trained on the clinical notes in the MIMIC-III database (Johnson et al., 2016).For generative language models, however, the topic is not well explored and this situation motivates our work.
In particular, we initialize the weights from the SciFive-PubMed-PMC model (base and large) (Phan et al., 2021) and further pre-train with the span-mask denoising objective (Raffel et al., 2020) on the pre-processed MIMIC-III notes.The base and large models have ∼ 220M parameters with 12 layers and ∼ 770M parameters with 24 layers, respectively.For each of the two versions, we further pre-train ClinicalT5 on the unlabeled text for extra 10k steps, with a max sequence length of 512, a batch size of 8, and a learning rate of 1e−4.The pre-training is performed on 3 Nvidia Tesla V100-32GB GPUs.We provide a reproducibility checklist in Appendix A, and we refer the readers to (Raffel et al., 2020) for more detailed treatment of the architecture and training objectives of T5.

Intrinsic Evaluation
We conduct intrinsic evaluation on the datasets UMNSRS-Sim and UMNSRS-Rel (Pakhomov et al., 2010), which consist of 566 and 587 UMLS term pairs respectively.Each pair comes with a similarity score and a relatedness score that are manually assigned by human experts.Similar to previous work (Zhang et al., 2019), we encode the terms with ClinicalT5 and the baselines.Essentially, we use the mean-pooled vectors of the last hidden states of the encoders as the term embeddings and calculate a cosine similarity score for each pair.Then we compute the Pearson's correlation coefficient and Spearman's correlation coefficient between the computed scores and the expertassigned scores.As shown in Table 1, ClinicalT5 demonstrates a better ability to capture the similarity of UMLS terms than T5 and Scifive, indicating the effectiveness of the training.

Extrinsic Evaluation
For extrinsic evaluation, we consider three different tasks, i.e., document classification (DC), named entity recognition (NER), and natural language inference (NLI).To validate the models' capability on clinical text, we select datasets that are closely relevant to clinical targets rather than biomedical or chemical related data such as BC5CDR-chemical (Li et al., 2016).We fine-tune the evaluating models on 4 corresponding datasets across these tasks in a single-task text-to-text manner.For all the experiments, we use a batch size of 16 and a learning rate of 1e−4.Due to different targets, we set the max source text length to 256, and the max target text lengths to 52, 256, 256, 15 for the datasets HOC, NCBI, BC5CDR and MEDNLI, respectively.

Document Classification
We conduct document classification on the HOC dataset (Baker et al., 2016), which consists of 9, 972 samples for training and 4, 947 samples for testing.Essentially, we fine-tune the evaluating models to categorize the texts into 10 categories by directly generating the class labels, e.g., "empty", "evading growth suppressors", "genomic instability and mutation", etc.

Named Entity Recognition
We conduct named entity recognition on two popular datasets, i.e., NCBI-disease (Dogan et al., 2014) and BC5CDR-disease (Li et al., 2016).The input text sequence may contain a disease term and the term should be identified and labeled in the target text, e.g., for the input text "Genotype and phenotype in patients with dihydropyrimidine dehydrogenase deficiency", the target is "Genotype and phenotype in patients with disease* dihydropyrimidine dehydrogenase deficiency *disease".

Natural Language Inference
We conduct natural language inference evaluation on the MEDNLI dataset (Romanov and Shivade, 2018) and 1, 422 testing samples.Essentially, we convert the premise-hypothesis pair to a sequence and prepend a task-specific prefix to it, e.g., "mednli: premise: [...].hypothesis: [...]."We take the converted sequence as the input text and fine-tune the evaluating models to generate the target labels, i.e., "contradiction", "neutral", "entailment".

Results
The results are shown in Table 2. Generally, Clin-icalT5 outperforms T5 and SciFive across most of these metrics, and the advantage indicates the success of the training over clinical text.However, ClinicalT5-large is on par with T5-large and has a slightly lower recall than SciFive-large on the HOC dataset.We conjecture that the large versions of BART and T5 already have enough capacity for the task which makes domain-specific training less impressive, as reflected by the fact that BioBARTlarge is only marginally better than BART-large.

Real-world Evaluation
We also evaluate the models on more complicated real-world applications of clinical importance, i.e., 30-day unplanned ICU patient readmission risk, 30-day and 1-year patient mortality risk.The experiment is conducted based on the MIMIC-III dataset (Johnson et al., 2016).Following previous work (Zhang et al., 2020;Lu et al., 2021c), we extract the discharge summaries from EHRs and generate 48, 393 documents.Essentially, we take the evaluating models to encode the last 512 tokens of each  note, the last hidden states of which are fed into a linear layer on top for prediction.As shown in Table 3, ClinicalT5 shows the best results across almost all the metrics, demonstrating its potential for real-world applications in the clinical domain.

Conclusion
In this study, we explore and propose ClinicalT5, a T5-based text-to-text transformer model for clinical text.We evaluate the proposed model both intrinsically and extrinsically, and the results show that ClinicalT5 compares favorably with its close baselines.We also test upon more complicated patient outcomes prediction tasks, the results of which indicates its potential for these real-world downstream tasks in the clinical domain.

Limitations
In this work we present a generative language model for clinical texts based on T5.Although our experiments demonstrate the effectiveness of our method, there are still some limitations that can be improved in future work.First, our evaluation has not included question answering and other related tasks for clinical texts.These are important tasks (Phan et al., 2021) and can be further explored in future work.Second, our pre-training method for ClinicalT5 has mainly inherited the objectives from T5 using direct unlabeled texts.As such, many important domain-specific knowledge for clinical domain (e.g., knowledge bases, concept definition) has not been explored to improve our generative model, serving as a promising direction for future research.

Ethics Statement
All datasets used in this research are publicly available and are obtained according to each dataset's respective data usage policy.We avoid showing any direct excerpts of the data in the paper.We do not attempt to identify or deanonymize users in the data in any way during our research, thus preventing any bias in our methods toward any specific users.
More specifically, the proposed models are trained on the clinical notes of the public MIMIC-III database, which are already deidentified in accordance with Health Insurance Portability and Accountability Act (HIPAA) standards using structured data cleansing and date shifting.As such, all identifying data elements in HIPAA, including patient name, telephone number, address, and dates, are already removed (Johnson et al., 2016) from our training data to hinder attempts to retrieve personal information from our models.Similar to existing pre-trained and publicly available models for the clinical domain, i.e., ClinicalBERT (Huang et al., 2019) and BioClinicalBERT (Alsentzer et al., 2019), the proposed models serve as a resource to facilitate future research on clinical text.The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

Acknowledgement
This research has been supported by the Army Research Office (ARO) grant W911NF-21-1-0112 and the NSF grant CNS-1747798 to the IU-CRC Center for Big Learning.This research is also based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA Contract No. 2019-19051600006 under the Better Extraction from Text Towards Enhanced Retrieval (BETTER) Program.The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ARO, ODNI, IARPA, the Department of Defense, or the U.S. Government.

Table 2 :
, which consists of 11, 232 training samples Performance comparison over document classification, named entity recognition, and medical natural language inference.

Table 3 :
Performance on patients' outcomes prediction.