WangLab at MEDIQA-Chat 2023: Clinical Note Generation from Doctor-Patient Conversations using Large Language Models

This paper describes our submission to the MEDIQA-Chat 2023 shared task for automatic clinical note generation from doctor-patient conversations. We report results for two approaches: the first fine-tunes a pre-trained language model (PLM) on the shared task data, and the second uses few-shot in-context learning (ICL) with a large language model (LLM). Both achieve high performance as measured by automatic metrics (e.g. ROUGE, BERTScore) and ranked second and first, respectively, of all submissions to the shared task. Expert human scrutiny indicates that notes generated via the ICL-based approach with GPT-4 are preferred about as often as human-written notes, making it a promising path toward automated note generation from doctor-patient conversations.


Introduction
The growing burden of clinical documentation has emerged as a critical issue in healthcare, increasing job dissatisfaction and burnout rates among clinicians and negatively impacting patient experiences (Friedberg et al., 2013;Babbott et al., 2014;Arndt et al., 2017).On the other hand, timely and accurate documentation of patient encounters is critical for safe, effective care and communication between specialists.Therefore, interest in assisting clinicians by automatically generating consultation notes is mounting (Finley et al., 2018;Enarvi et al., 2020;Molenaar et al., 2020;Knoll et al., 2022).
To further encourage research on automatic clinical note generation from doctor-patient conversations, the MEDIQA-Chat Dialogue2Note shared task was proposed (Ben Abacha et al., 2023).Here, we describe our submission to subtask B: the generation of full clinical notes from doctor-patient dia-Figure 1: (A) Fine-tuning a pre-trained language model (PLM), Longformer-Encoder-Decoder (LED, Beltagy et al. 2020).(B) In-context learning (ICL) with large language models (LLMs).We rank train examples based on their similarity to the test dialogue using Instructor (Su et al., 2022a).Notes of the top-k most similar examples are then used as in-context examples to form a prompt alongside natural language instructions and fed to GPT-4 (OpenAI, 2023) to generate the clinical note.
logues.We explored two approaches; the first finetunes a pre-trained language model (PLM, §3.1), while the second uses few-shot in-context learning (ICL, §3.2).Both achieve high performance as measured by automatic natural language generation metrics ( §4) and ranked second and first, respectively, of all submissions to the shared task.In a human evaluation with three expert physicians, notes generated via the ICL-based approach with GPT-4 were preferred about as often as humanwritten notes ( §4.3).

Example Doctor-Patient Conversation
[doctor] hi, ms.thompson.i'm dr.moore.how are you?
[patient] i'm doing okay except for my knee.
[doctor] all right, hey, dragon, ms.thompson is a 43 year old female here for right knee pain.so tell me what happened with your knee?[patient] well, i was, um, trying to change a light bulb, and i was up on a ladder and i kinda had a little bit of a stumble and kinda twisted my knee as i was trying to catch my fall.
[doctor] okay.and did you injure yourself any place else?
[patient] no, no. it just seems to be the knee.
[doctor] all right.and when did this happen?
[patient] it was yesterday.
[doctor] all right.and, uh, where does it hurt mostly?
[patient] it hurts like in, in, in the inside of my knee.
[doctor] all right.and anything make it better or worse?
[patient] i have been putting ice on it, uh, and i've been taking ibuprofen, but it doesn't seem to help much.
[doctor] okay.so it sounds like you fell a couple days ago, and you've hurt something inside of your right knee.
[doctor] and you've been taking a little bit of ice, uh, putting some ice on it, and hasn't really helped and some ibuprofen.is that right?---------TRUNCATED ---------[doctor] so in summary after my exam, uh, looking at your knee, uh, on the x-ray and your exam, you have some tenderness over the medial meniscus, so i think you have probably an acute medial meniscus sprain right now or strain.uh, at this point, my recommendation would be to put you in a knee brace, uh, and we'll go ahead and have you use some crutches temporarily for the next couple days.we'll have you come back in about a week and see how you're doing, and if it's not better, we'll get an mri at that time.
[doctor] i'm going to recommend we give you some motrin, 800 milligrams.uh, you can take it about every six hours, uh, with food.uh, and we'll give you about a two week supply.
[doctor] okay.uh, do you have any questions?[patient] no, i think i'm good.
[doctor] all right.hey, dragon, order the medications and procedures discussed, and finalize the report.okay, come with me and we'll get you checked out.

RESULTS
X-rays of the right knee show no obvious signs of acute fracture or dislocation.Mild effusion is noted.Dialogue has been lightly cleaned for legibility (e.g.remove trailing white space).Parts of the dialogue and note have been truncated.During evaluation, sections are grouped under one of four categories: "Subjective", "Objective Exam", "Objective Results", and "Assessment and Plan" (see §2.1 for details).

IMPRESSION
task is to produce a clinical note summarizing the conversation with one or more note sections (e.g.Assessment, Past Medical History).
2. Note2Dialogue Generation: Given a clinical note, the task is to generate a synthetic doctorpatient conversation related to the information described in the note.
We focused on Dialogue2Note, which is divided into two subtasks.In subtask 'A' (Ben Abacha et al., 2023), the goal is to generate specific sections of a note given partial doctor-patient dialogues.In subtask 'B' (Yim et al., 2023), the goal is full note generation from complete dialogues.The remainder of the paper focuses on subtask B; see Appendix A for our approach to subtask A, which also ranks first of all submissions to the shared task.

Task definition
Each of the k examples consist of a doctor-patient dialogue, D = d 1 , . . ., d k and a corresponding clinical note, N = n 1 , . . ., n k .The aim is to automatically generate a note n i given a dialogue d i .
Each note comprises one or more sections, such as "Chief Complaint", and "Family history".During evaluation, sections are grouped under one of four categories: "Subjective", "Objective Exam", "Ob- jective Results", and "Assessment and Plan". 2 See Figure 2 for an example doctor-patient conversation and clinical note pair.

Dataset
The dataset comprises 67 train and 20 validation examples, featuring transcribed dialogues from doctor-patient encounters and the resulting clinician-written notes.Each example is labelled with the 'dataset source', indicating the dialogue transcription system used to produce the note.

Approach
We take two high-performant approaches to the shared task.In the first, we fine-tune a pre-trained language model (PLM) on the provided training set ( §3.1).In the second, we use in-context learning (ICL) with a large language model (LLM, §3.2).

Fine-tuning pre-trained language models
As a first approach, we fine-tune a PLM on the training set following a canonical, sequence-tosequence training process (Figure 1 A; see Appendix C for details).Given the length of input dialogues (Figure 3), we elected to use Longformer-Encoder-Decoder (LED, Beltagy et al. 2020), which has a maximum input size of 16,384 tokens.We begin fine-tuning from a LED LARGE checkpoint tuned on the PubMed summarization dataset (Cohan et al., 2018), which performed best in prelimi-

Natural language instructions
Write a clinical note reflecting this doctor-patient dialogue.Use the example notes below to decide the structure of the clinical note.Do not make up information.nary experiments. 3The model was fine-tuned using HuggingFace Transformers (Wolf et al., 2020) on a single NVIDIA A100-40GB GPU.Hyperparameters were lightly tuned on the validation set.4

In-context learning with LLMs
As a second approach, we attempt subtask B with ICL.We chose GPT-4 (OpenAI, 2023)5 as the LLM and designed a simple prompt, which included natural language instructions and in-context examples (Figure 4).We limited the prompt size to 6192 tokens -allowing for 2000 output tokens, as the model's maximum token size is 8,192 -and used as many in-context examples as would fit within this token limit, up to a maximum of 3. We set the temperature parameter to 0.2 and left all other hyperparmeters of the OpenAI API at their defaults.
Natural language instructions During preliminary experiments, we found that GPT-4 was not overly sensitive to the exact phrasing of the natural language instructions in the prompt.We, therefore, elected to use short, simple instructions (Figure 4). .Dialogues were embedded using Instructor (Su et al., 2022a), a text encoder that supports natural language instructions.6Lastly, we restricted in-context examples to be of the same 'dataset source' (see §2.2) as the input dialogue, hypothesizing that this may improve performance.7

Evaluation
Models are evaluated with the official evaluation script8 on the validation set (as test notes are not provided).Generated notes are evaluated against the provided ground truth notes with ROUGE (Lin, 2004), BERTScore (Zhang et al., 2020) and BLEURT (Sellam et al., 2020).We report performance as the arithmetic mean of ROUGE-1 F1, BERTScore F1 and BLEURT-20 (Pu et al., 2021).

Fine-tuning pre-trained language models
We present the results of fine-tuning LED in Table 1.Due to the non-determinism of the LED implementation,9 we report the mean results of three training runs.Unsurprisingly, we find that scaling the model size from LED BASE (12 layers, ∼162M parameters) to LED LARGE (24 layers, ∼460M parameters) leads to sizable gains in performance.
Performance further improves by initializing the model with a checkpoint fine-tuned on the PubMed summarization dataset (LED LARGE-PubMed ).This is likely because (1) Dialouge2Note resembles a summarization task, and (2) text from PubMed is more similar to clinical text than is the general domain text used to pre-train LED. 10 Our submission to the shared task using this approach ranked second overall, outperforming the next-best submission by 2.7 average score; a difference comparable to the  1).

In-context learning with LLMs
We present the results of ICL with GPT-4 in Table 2.We note several interesting trends in order of magnitude of impact.2) and achieves first place of all submissions to the shared task, out-performing the runner up by > 9 average score.We conclude that (1) few-shot ICL with GPT-4, using as little as one example, is a performant approach for note generation from doctorpatient conversations, and (2) using the notes of semantically similar dialogue-note pairs is a strong strategy for selecting the in-context examples.

Human evaluation
Automatic evaluation metrics like ROUGE, BERTScore and BLEURT are imperfect and may not correlate with aspects of human judgment. 11Therefore, we conducted an expert human evaluation to validate our results.To make annotation feasible, we conducted it on the validation set (20 examples) using the best performing fine-tuned model: LED LARGE-PubMed (Table 1), and best performing ICL-based approach: 3-shot, similar, note- only examples filtered by dataset type (Table 2).Three senior resident physicians 12 were shown a ground truth note, a note generated by the fine-tuned model, and a note generated by the ICL-based approach for each example (presented in random order as clinical note 'A', 'B' and 'C') and asked to select which note(s) they preferred, given a dialogue and some simple instructions: Instructions: Please asses the clinical notes A, B and C relative to the provided doctor-patient dialogue.For each set of notes, you should select which note you prefer ('A', 'B', or 'C').If you have approximately equal preference for two notes, select ('A/B', 'B/C', or 'C/A').If you have no preference, select 'A/B/C'.A 'good' note should contain all critical, most non-critical and very little irrelevant information mentioned in a dialogue: • Critical: Items medico-legally required to document the diagnosis and treatment decisions whose absence or incorrectness may lead to wrong diagnosis and treatment later 12 The three annotators are a subset of the authors who did not interact with the model or model outputs before annotation on, e.g. the symptom "cough" in a suspected chest infection consultation.This is the key information a note needs to capture correctly in order to not mislead clinicians.
• Non-critical: Items that should be documented in a complete note but whose absence will not affect future treatment or diagnosis, e.g."who the patient lives with" in a consultation about chest infection.
• Irrelevant: Medically irrelevant information covered in the consultation, e.g. the pet of a patient with a suspected chest infection just died.
The definitions of critical, non-critical and irrelevant information are taken from previous work on human evaluation of generated clinical notes (Moramarco et al., 2022;Savkov et al., 2022).
In short, notes generated by ICL are strongly preferred over notes generated by the fine-tuned model and, on average, slightly preferred over the human-written notes (Table 3), validating the high performance reported by the automatic metrics.We note, however, that inter-annotator agreement is low and speculate why this might be in §6.
Others have focused on curating data for training and benchmarking (Papadopoulos Korfiatis et al., 2022), including the use of LLMs to produce synthetic data (Chintagunta et al., 2021).Lastly, there have been efforts to improve the evaluation of generated clinical notes, both with automatic metrics (Moramarco et al., 2022) and human evaluation (Savkov et al., 2022).While recent literature has commented on the potential of ICL for note generation (Lee et al., 2023), our work is among the first to evaluate this approach rigorously.

Conclusion
We present our submission to the MEDIQA-Chat shared task for clinical note generation from doctorpatient dialogues.We evaluated a fine-tuning-based approach with LED and an ICL-based approach with GPT-4, ranking second and first, respectively, among all submissions.Human evaluation with three physicians revealed that notes produced by GPT-4 via ICL were strongly preferred over notes produced by LED and, on average, slightly preferred over human-written notes.We conclude that ICL is a promising path toward clinical note generation from doctor-patient conversations.

Limitations
Evaluation of generated text is difficult Evaluating automatically generated text, including clinical notes, is generally hard due to the inherently subjective nature of many aspects of output quality.Automatic evaluation metrics such as ROUGE and BERTScore are imperfect (Deutsch et al., 2022) and may not correlate with aspects of expert judgment.However, they are frequently used to evaluate model-generated clinical notes and do correlate with certain aspects of quality (Moramarco et al., 2022).To further validate our findings, we also conducted a human evaluation with three expert physicians ( §4.3).As noted previously (Savkov et al., 2022), even human evaluation of clinical notes is far from perfect; inter-annotator agreement is generally low, likely because physicians have differing opinions on the importance of each patient statement and whether it should be included in a consultation note.We also found low interannotator agreement in our human evaluation and speculate this is partially due to differences in specialties among the physicians.Physicians 1 and 3, both from family medicine, had high agreement with each other but low agreement with physician 2 (cardiac surgery, see Table 3).Investigating better automatic metrics and best practices for evaluating clinical notes (and generated text more broadly) is an active field of research.We hope to integrate novel and performant metrics in future work.
Data privacy While our GPT-4 based solution achieves the best performance, it is not compliant with data protection regulations such as HIPAA; although Azure does advertise a HIPAA-compliant option. 13From a privacy perspective, locally deploying a model such as LED may be preferred; however, our results suggest that more work is needed for this approach to reach acceptable performance (see Table 3).In either case, when implementing automated clinical note-generation systems, healthcare providers and developers should ensure that the whole system -including text-tospeech, data transmission & storage, and model inference -adheres to privacy and security requirements to maintain trust and prevent privacy violations in the clinical setting.

Ethics Statement
Developing an automated system for clinical note generation from doctor-patient conversations raises several ethical considerations.First, informed consent is crucial: patients must be made aware of their recording, and data ownership must be prioritized.Equitable access is also important; the system must be usable for patients from diverse backgrounds, including those with disabilities, limited technical literacy, or language barriers.Addressing issues of data bias and fairness are necessary to avoid unfair treatment or misdiagnosis for certain patient groups.The system must implement robust security measures to protect patient data from unauthorized access or breaches.Establishing clear lines of accountability for errors or harms arising from using an automated system for note generation is paramount.Disclosure of known limitations or potential risks associated with using the system is essential to maintain trust in the patient-physician relationship.Finally, ongoing evaluations are necessary to ensure that system performance does not degrade and negatively impact the quality of care.

A Subtask A
In subtask A of the Dialogue2Note Summarization shared task, given a partial doctor-patient dialogue, the goals are to: (1) predict the appropriate section header, e.g."PASTMEDICALHX" and (2) generate that specific section of a note.We approached this task by fine-tuning a PLM on the provided training set, following a canonical, sequence-tosequence training process (see Appendix C for details).In preliminary experiments, we found that the instruction-tuned FLAN-T5 (Chung et al., 2022) performed particularly well at this task.
We hypothesized that jointly learning to predict the section header and generate the section text would improve overall performance.To do this, we preprocessed the training set so the targets were of the form: "Section header: {section_header} Section text: {section_text}".After decoding, the section header and text were parsed using regular expressions and evaluated separately (Figure 5).Section header prediction was evaluated as the fraction of predicted headers that match the ground truth (accuracy), and section text was evaluated similarly to subtask B (see §3.3).In cases where the model output an invalid section header, 14 we replaced it with "GENHX" (general history), which tends to summarize the contents of the other sections.The model was fine-tuned on a single NVIDIA A100-40GB GPU.Hyperparameters were lightly tuned on the validation set (Table 4).
We present the results of our approach on the 14 In practice, we found that the fine-tuned model rarely, if ever, generates invalid section headers validation set in Table 5.Similar to subtask B (see §4.1), we find, perhaps unsurprisingly, that scaling the model size from FLAN-T5 BASE (24 layers, ∼250M parameters) to FLAN-T5 LARGE (48 layers, ∼780M parameters) leads to large improvements in performance.Performance is further improved by jointly learning to predict section headers and generate note sections.Our submission to the shared task based on this approach tied for first on section header prediction (78% accuracy), and ranked first for note section generation (average ROUGE-1, BERTScore and BLEURT F1-score of 57.9).

B Subtask B B.1 Hyperparameter tuning of LED
We lightly tuned the hyperparameters of LED LARGE-PubMed on the subtask B validation set against the average ROUGE-1 F1, BERTScore F1 and BLEURT-20 scores.The best hyperparameters obtained are given in Table 6.We used the same hyperparameters when fine-tuning LED BASE and LED LARGE in §4.1.

B.2 Post processing LEDs outputs
In practice, we found that the fine-tuned LED model sometimes produces invalid section headers; notably, this problem did not occur with the ICL-based approach using GPT-4.Therefore, we lightly post-processed LEDs outputs using a simple script that identifies section headers produced by the model not in the ground truth set and uses fuzzy  string matching 15 to replace them with the closest valid header.For example, in one run, this process converted the (incorrect) predicted section header "HISTORY OF PRESENT" to the nearest valid header "HISTORY OF PRESENT ILLNESS".

C Fine-tuning Seq2Seq Models
When training the sequence-to-sequence (seq2seq) models for both subtask A (Appendix A) and B ( §3.1), we followed a canonical supervised fine-tuning (SFT) process.We start with a pretrained, encoder-decoder transformer-based language model (Vaswani et al., 2017).First, the encoder maps each token in the input to a contextual embedding.Then, the autoregressive decoder generates an output, token-by-token, attending to the outputs of the encoder at each timestep.Decoding proceeds until a special "end-of-sequence" token 15 We used https://github.com/seatgeek/thefuzz(e.g.</s>) is generated, or a maximum number of tokens have been generated.Formally, X is the input sequence, which in our case is a doctor-patient dialogue, and Y is the corresponding output sequence of length T , in our case a clinical note.We model the conditional probability: During training, we optimize over the model parameters θ the sequence cross-entropy loss: (θ) = − As is common, we use teacher forcing during training, feeding previous ground truth inputs to the decoder when predicting the next token in the sequence.During inference, we generate the output using beam search (Graves, 2012).Beams are ranked by mean token log probability after applying a length penalty.Models are fine-tuned using the HuggingFace Transformers library. 16 is a 43-year-old female who presents today for an evaluation of right knee pain.She states she was trying to change a lightbulb on a ladder [..
Right knee acute medial meniscus sprain.PLAN At this point, I discussed the diagnosis and treatment options with the patient.I have recommended a knee brace.She will take Motrin 800 mg, every 6 hours with food, for two weeks.She will use crutches for the next couple of days.She will follow up with me in 1 week [...] Objective Exam EXAM Examination of the right knee shows pain with flexion.Tenderness over the medial joint line.No pain in the calf.Pain with valgus stress.Sensation is intact.

Figure 2 :
Figure 2: Example of a paired doctor-patient conversation and clinical note from the subtask B validation set.Dialogue has been lightly cleaned for legibility (e.g.remove trailing white space).Parts of the dialogue and note have been truncated.During evaluation, sections are grouped under one of four categories: "Subjective", "Objective Exam", "Objective Results", and "Assessment and Plan" (see §2.1 for details).

Figure 3 :
Figure 3: Histogram of token lengths for subtask B train and validation sets.Dialogues and notes were tokenized with tiktoken using the "gpt-4" encoding.

2
See here for the mappingPrompt TemplateIn-context examples (up to 3) EXAMPLE NOTE: HISTORY OF PRESENT ILLNESS\nMr.Fisher is a 59-year-old male who presents for routine follow up of his chronic problems.[...]

Figure 4 :
Figure 4: Prompt template for our in-context learning (ICL) based approach.Each prompt includes natural language instructions, up to 3 in-context examples, and an unseen doctor-patient dialogue as input.
Each in-context example is a note from the train set.To select the notes, we first embed the dialogues of each training example and the input dialogue.Train dialogues are then ranked based on cosine similarity to the input dialogue; notes of the resulting top-k training examples are selected as the in-context examples (see Figure 1, B) Parse section headers and text from model generations Target "Section header: {section_header} Section text: {section_text}" Output Section header: GENHX Section text: The patient is a 26 YO female, referred to Physical Therapy for low back pain [...]

Figure 6 :
Figure 6: Histogram of token lengths for subtask A train and validation sets.Dialogues and notes were tokenized with HuggingFace Tokenizers using "google/flan-t5-large".Lengths greater than the 99th-percentile are omitted to make the plot legible.
y t |X, y <t ; θ) (2)maximizing the log-likelihood of the training data.

Table 1 :
Fine-tuning LED.Mean and standard deviation (SD) of three training runs is shown.Scaling model size and pre-training on a related task improve performance.Bold: best scores.
The best strategy out-performs LED by almost 3 average score (60.8 vs. 57.9,seeTable1 & Table First, selecting in-context examples based on the similarity of dialogues has a strong positive impact, typically improving average score by 4 or more.Using only notes as in-context examples, as opposed to dialogue-note pairs, also has a positive impact, typically improving average score by ∼1.Surprisingly, increasing the number of in-context examples had a marginal effect on performance.Together these results suggest that the in-context examples' primary benefit is providing guidance with regard to the expected note structure, style and length.Finally, filtering in-context examples to be of the same 'dataset source' as the input dialogue has a negligible impact on performance.

Table 2 :
ICL with GPT-4.Mean of ROUGE-1 F1, BERTScore F1 and BLEURT for three runs is shown.Selecting in-context examples based on similarity to input dialogue improves performance.Dialogue-note pairs as in-context examples (omitting 3-shot results due to token length limits) underperforms notes only.Filtering in-context examples to be of the same 'dataset source' as the input dialogue has little effect.Bold: best scores.SD < 0.1 in all cases.

Table 3 :
Human evaluation.Three physicians selected their preference from human written ground-truth notes (GT), notes produced by the fine-tuned model (FT) and notes produced by in-context learning (ICL).Win rate is % of cases where note was preferred, excluding ties.

Table 4 :
Hyperparameters used with FLAN-T5 on the Dialogue2Note subtask A

Table 5 :
Fine-tuning FLAN-T5.Accuracy of predicted section headers and score of generated note sections is shown.Jointly learning to predict section headers and generate notes improve performance.Bold: best scores.