Team:PULSAR at ProbSum 2023:PULSAR: Pre-training with Extracted Healthcare Terms for Summarising Patients’ Problems and Data Augmentation with Black-box Large Language Models

Medical progress notes play a crucial role in documenting a patient’s hospital journey, including his or her condition, treatment plan, and any updates for healthcare providers. Automatic summarisation of a patient’s problems in the form of a “problem list” can aid stakeholders in understanding a patient’s condition, reducing workload and cognitive bias. BioNLP 2023 Shared Task 1A focusses on generating a list of diagnoses and problems from the provider’s progress notes during hospitalisation. In this paper, we introduce our proposed approach to this task, which integrates two complementary components. One component employs large language models (LLMs) for data augmentation; the other is an abstractive summarisation LLM with a novel pre-training objective for generating the patients’ problems summarised as a list. Our approach was ranked second among all submissions to the shared task. The performance of our model on the development and test datasets shows that our approach is more robust on unknown data, with an improvement of up to 3.1 points over the same size of the larger model.


Introduction
Medical progress notes are used to document a patient's course in a hospital, including their current condition, treatment plan, and any updates to the plan . Automated identification of treated problems from the assessment sections of a progress note in form of a "problem list" can help healthcare stakeholders to gain an accurate understanding of the patient's condition, reducing workload and cognitive bias (Gao et al., 2022a). This problem list is then used to outline and pursue a detailed treatment plan.
The majority of studies on clinical summarisation have focused on clinical notes; radiology reports (Zhang et al., 2018;MacAvaney et al., 2019;Gharebagh et al., 2020;Kondadadi et al., 2021;Dai et al., 2021), and progress notes (Moen et al., 2016;Liang et al., 2019;Adams et al., 2021;Gao et al., 2022a). In contrast, some studies have focused on dialogues (Yim and Yetisgen-Yildiz, 2021;Manas et al., 2021;Zhang et al., 2021). Recently, Gao et al. (2022b) proposed the task of "progress note understanding", where the goal is to generate problem lists given the assessment sections of a progress note. They further explored the performance of T5 (Raffel et al., 2020), BART (Kondadadi et al., 2021) based on pre-training tasks with masked healthcare concepts (Gao et al., 2022a). To draw further attention to this task, The BioNLP 2023 Shared Task 1A (Gao et al., 2023) invited external participants to develop approaches to advance the state-of-the-art on the proposed task.
The main contribution of this work is a novel framework for data augmentation and summarisation of diagnoses/problems. In our approach, first, we optimise a domain-specific Language Model (LM) using a combination of different pre-training objectives, depicted in Figure 1; this model significantly outperforms the state-of-the-art, even when optimised on a limited number of manually annotated progress notes. Second, we instruct Large Language Models (LLMs) to generate synthetic data, in order to reduce the reliance on large, highquality annotated datasets. Finally, we use the generated data to fine-tune the domain-specific LM on the task of problem list generation, given appropriate progress note sections. Our approach ranked second among all submissions to the shared task without additional annotated data. The results of our evaluation suggest that our pre-training objectives are aligned with the downstream task of summarisation and can significantly improve performance.   Figure 1 shows the two components of our framework: first we pre-train an encoder-decoder model on MIMIC-III progress notes (Johnson et al., 2016) using three different concept masking pre-training objectives. Then we employ data augmentation when fine-tuning our model for the summarisation task.

Pre-training Model
The items on the problem list are not necessarily directly extracted from the original progress notes and hence we cast the problem as abstractive summarisation. Drawing inspiration from PEGASUS (Zhang et al., 2020a), we used an objective which closely resembles the abstractive summarisation objective, to gain better and faster fine-tuning performance.
Following the success obtained through masking words and contiguous spans (Joshi et al., 2020;Raffel et al., 2020), we propose to select and mask text Infant remains on prong CPAP of 5. Occaisional brief O2 sat drifts noted.

<extra_id_0> CPAP <extra_id_1> sat drifts <extra_id_2>
Figure 2: Our pre-training objective. The terms "CPAP" and "sat drifts" are identified by the NER models and replaced by a unique sentinel token respectively. The objective is to predict these masked-out spans. spans or whole sentences from input documents. We concatenate these "gap text spans (sentences)" into a pseudo-summary. Gap text spans were selected by the QuickUMLS entity linking (Soldaini and Goharian, 2016) and an NER model trained on the i2b2-2010 challenge (Uzuner et al., 2011). Similar to the T5 pre-training procedure (Raffel et al., 2020), these text spans were replaced by "sentinel" mask tokens < extra id i > to inform the model that input was masked. Here, i indicates the number of the mask (from left to right). The output sequence thus consists of the dropped-out text spans, delimited by the sentinel token between terms and the last < extra id i > input representing the end of the output. Figure 2 illustrates our pre-training objective.
Specifically, we considered three masking policies in our pre-training objective. For each sentence, When both tools identified entities, we selected UMLS terms with the probability of 0.7 and i2b2 terms with the probability of 0.3. When only one tool identified entities, these entities were selected. Finally, when no entities were identified, the entire sentence was masked with a probability of 0.15. In order to provide the model with the necessary medical knowledge and reduce domain barriers (Pandey et al., 2022)

Data Augmentation (DA)
The lack of high-quality annotated data is a bottleneck that inhibits supervised learning methods in the healthcare field. For example, BioNLP Task 1A (Gao et al., 2023) has only 764 annotated training examples. Therefore, we rely on data augmentation techniques to obtain more training samples. Specifically, we propose a novel healthcare data generation (DG) framework based on DINO Li et al., 2023), which exploits the generative abilities of LLMs by relying on instruction following rather than model training. Our instructions to the LLMs include task-specific descriptions (i.e., "Write two sentences that mean the same thing but keep these two healthcare terms    algorithm to achieve this objective, i.e. when predicting the next token, not only the probability of the corresponding label is considered, but also the counter label is taken into account. We then use BERTScore (Zhang et al., 2020b) and BLEURT (Sellam et al., 2020) to assess the similarity between each generated sample and the source, removing 85% of the lowest scoring generated sentences. The backbone of the framework can be any generative LLM, such as GPT3.5 § , GPT3 (Brown et al., 2020) and GPT2 (Radford et al., 2019) series models. Limited by the data use agreement, we used BioMedLM (Bolton et al., 2022), an open-source GPT-style model pretrained on the biomedical abstracts and papers, ¶ .
Data Augmentation: We employ BioMedLM § chat.openai.com ¶ The official test set result for PULSAR-11B was finetuned after the 0.33 pre-training epoch. (Radford et al., 2019) as the data augmentation model with default settings, setting maximum output length to 40. Finally, the generated data are matched with the corresponding summaries, subjective and objective to create a training set of 1k instances. The DA model  is run on a single NVIDIA Tesla V100 32G GPU, with each run taking up to twelve hours. Example templates and the full dataset description can be seen in Appendix A.

Experimental Setup
Baselines: We have chosen to adapt T5-base as one of our baselines, similar to the approach taken by Gao et al. (2022a). Additionally, we have incorporated various state-of-the-art models such as FlanT5 (Chung et al., 2022), ClinicalT5 (Lehman and Johnson, 2023) and PEGASUS (Zhang et al., 2020a). Whereas FlanT5 is an enhanced version of T5 that has been finetuned in a mixture of tasks (Chung et al., 2022) and Evaluation metrics: We calculate ROUGE (Lin, 2004) scores on the test set, by comparing the generated summaries to their corresponding references, averaging for all generation samples. For all experiments, the data set was divided into a "train" and a "dev" set with a ratio of 8:2 for training and evaluation, respectively. The results are presented in Table 1, left column, and Table 2. Table 1, right column, shows the performance of the models on the official withheld test set. In this case, both train and dev splits were used for training.

Results and Analysis
Pre-training helps: Both Table 1 and Table 2 demonstrate that the pre-training objective improves task performance (compare 3B and 11B PULSAR to corresponding FlanT5 models). The best performance of PULSAR was 3.1 points higher than the FlanT5-11B on the development set as the training set and 11.2 points higher than ClinicalT5-large on the official test set. The small difference in performance between PULSAR-11B and PULSAR-3B is primarily because the former has only completed 1/3 of the first pretraining epoch, potentially resulting in a lack of relevant medical knowledge and familiarity with downstream task patterns.
Data augmentation is effective when the data distribution is consistent; It is significantly more helpful for small models when on a random data distribution: Table 1 shows that, data augmentation improves performance (3 point on average, compared to not using DA). This shows that the proposed DA approach can effectively alleviate the lack of annotated healthcare data, when the distribution of training and test set data is consistent. From Table 1, it becomes evident that smaller models (ClinicalT5-large) can improve by up to 6 points with the help of data augmentation, but the effect diminishes with model size as it increases max to 2.5 on LLMs. The potential reason is that the test set for the sharing task differs significantly from the training set, in the vary of length of the summary.
The model is capable of discriminating irrelevant information, but longer input lengths may result in decreased performance: We conducted ablation experiments on PULSAR-3B to verify the impact of the input text type. In contrast to Gao et al. (2022b)'s findings on the small model, the results (PULSAR-3B vs. PULSAR-3B-A) in Table 1 show that if the input is ASSESSMENT + SUBJEC-TIVE + OBJECTIVE, the model performs better (by 2.9 points on the official test set and by 7 points on the development set) compared with only using AS-SESSMENT as input. This indicates that while most of the relevant information can be inferred from the ASSESSMENT section alone, additional input can be beneficial. However, increasing the input length appears to not be useful: Table 2 shows that models trained with longer input lengths (1024 tokens) do not improve over models that were trained on 512-token-long input.

Conclusion
This paper contributed to the development of summarising patients' problems. Firstly, we proposed a novel task-specific pre-training LLM objective. Compared with other submissions, we rank 2nd at the official shared task without using additional manually annotated training samples. Secondly, we propose a new data augmentation framework and demonstrate its effectiveness in the healthcare domain. In the future, we will explore the applicability of our approach to other domain-specific generative tasks and conduct a deeper analysis of factors that contribute to overall model performance.

Limitations
The proposed model is computationally demanding. Recent work on parameter-efficient fine-tuning methods, such as LoRA (Hu et al., 2022), suggests that they can significantly reduce the number of trainable parameters at a minimal performance cost, which may help further democratise the development of domain-and task-specific models. In addition, as we continued to pretrain, to obtain the PULSAR models, their tokenizer was inherited from corresponding Flan-T5 model. Thus it does not contain domain-specific terminology, which may be a limitation in terms of representation density (i.e. frequent clinical terms may be split in multiple rare sub-tokens).

Ethics Statement
For the present work, we used an existing anonymised dataset from BioNLP 2023 Shared Task 1A without any data protection issues. In addition, data augmentation only uses an open-source, off-line model which is not offensive to the data user agreement that is shared with a third party.