Clinical NLP research faces a scarcity of publicly available datasets due to privacy concerns. MIMIC-III marked a significant milestone, enabling substantial progress, and now, with MIMIC-IV, the dataset has expanded significantly, offering a broader scope. In this paper, we focus on the task of predicting clinical outcomes from clinical text. This is crucial in modern healthcare, aiding in preventive care, differential diagnosis, and capacity planning. We introduce a novel clinical outcome prediction dataset derived from MIMIC-IV. Furthermore, we provide initial insights into the performance of models trained on MIMIC-III when applied to our new dataset, with specific attention to potential data drift. We investigate challenges tied to evolving documentation standards and changing codes in the International Classification of Diseases (ICD) taxonomy, such as the transition from ICD-9 to ICD-10. We also explore variations in clinical text across different hospital wards. Our study aims to probe the robustness and generalization of clinical outcome prediction models, contributing to the ongoing advancement of clinical NLP in healthcare.
Clinical phenotyping enables the automatic extraction of clinical conditions from patient records, which can be beneficial to doctors and clinics worldwide. However, current state-of-the-art models are mostly applicable to clinical notes written in English. We therefore investigate cross-lingual knowledge transfer strategies to execute this task for clinics that do not use the English language and have a small amount of in-domain data available. Our results reveal two strategies that outperform the state-of-the-art: Translation-based methods in combination with domain-specific encoders and cross-lingual encoders plus adapters. We find that these strategies perform especially well for classifying rare phenotypes and we advise on which method to prefer in which situation. Our results show that using multilingual data overall improves clinical phenotyping models and can compensate for data sparseness.
Diagnosis prediction on admission notes is a core clinical task. However, these notes may incompletely describe the patient. Also, clinical language models may suffer from idiosyncratic language or imbalanced vocabulary for describing diseases or symptoms. We tackle the task of diagnosis prediction, which consists of predicting future patient diagnoses from clinical texts at the time of admission. We improve the performance on this task by introducing an additional signal from support sets of diagnostic codes from prior admissions or as they emerge during differential diagnosis. To enhance the robustness of diagnosis prediction methods, we propose to augment clinical text with potentially complementary set data from diagnosis codes from previous patient visits or from codes that emerge from the current admission as they become available through diagnostics. We discuss novel attention network architectures and augmentation strategies to solve this problem. Our experiments reveal that support sets improve the performance drastically to predict less common diagnosis codes. Our approach clearly outperforms the previous state-of-the-art PubMedBERT baseline by up 3% points. Furthermore, we find that support sets drastically improve the performance for pregnancy- and gynecology-related diagnoses up to 32.9% points compared to the baseline.
Universal embeddings, such as BERT or ELMo, are useful for a broad set of natural language processing tasks like text classification or sentiment analysis. Moreover, specialized embeddings also exist for tasks like topic modeling or named entity disambiguation. We study if we can complement these universal embeddings with specialized embeddings. We conduct an in-depth evaluation of nine well known natural language understanding tasks with SentEval. Also, we extend SentEval with two additional tasks to the medical domain. We present PubMedSection, a novel topic classification dataset focussed on the biomedical domain. Our comprehensive analysis covers 11 tasks and combinations of six embeddings. We report that combined embeddings outperform state of the art universal embeddings without any embedding fine-tuning. We observe that adding topic model based embeddings helps for most tasks and that differing pre-training tasks encode complementary features. Moreover, we present new state of the art results on the MPQA and SUBJ tasks in SentEval.