Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques

Following each patient visit, physicians draft long semi-structured clinical summaries called SOAP notes. While invaluable to clinicians and researchers, creating digital SOAP notes is burdensome, contributing to physician burnout. In this paper, we introduce the first complete pipelines to leverage deep summarization models to generate these notes based on transcripts of conversations between physicians and patients. After exploring a spectrum of methods across the extractive-abstractive spectrum, we propose Cluster2Sent, an algorithm that (i) extracts important utterances relevant to each summary section; (ii) clusters together related utterances; and then (iii) generates one summary sentence per cluster. Cluster2Sent outperforms its purely abstractive counterpart by 8 ROUGE-1 points, and produces significantly more factual and coherent sentences as assessed by expert human evaluators. For reproducibility, we demonstrate similar benefits on the publicly available AMI dataset. Our results speak to the benefits of structuring summaries into sections and annotating supporting evidence when constructing summarization corpora.


Introduction
Electronic health records (EHR) play a crucial role in patient care. However, populating them can take as much time as attending to patients (Sinsky et al., 2016) and constitutes a major cause of physician burnout (Kumar and Mezoff, 2020). In particular, doctors document patient encounters with SOAP notes, semi-structured written accounts containing four sections: (S)ubjective information reported by the patient; (O)bjective observations, e.g., lab results; (A)ssessments made by the doctor (typically, the diagnosis); and a (P)lan for future care, including diagnostic tests, medications, and treatments. Sections can be subdivided into 15 subsections.
In a parallel development, patients increasingly record their doctor's visits, either in lieu of taking notes or to share with a family member. A budding line of research has sought to leverage transcripts of these clinical conversations both to provide insights to patients and to extract structured data to be entered into EHRs (Liu et al., 2019b;Schloss and Konam, 2020;Krishna et al., 2021).
In this paper, we introduce the first end-to-end methods for generating whole SOAP notes based on clinical conversations. Our work builds on a unique corpus, developed in collaboration with Abridge AI, Inc. 1 ), that consists of thousands of transcripts of recorded clinical conversations together with associated SOAP notes drafted by a work force trained in the official style of SOAP note documentation. On one hand, this task is much harder than traditional summarization benchmarks, in part, because SOAP notes are longer (320 words on average) than summaries in popular datasets like CNN/Dailymail (Nallapati et al., 2016), Newsroom (Grusky et al., 2018), and Sam-Sum (Gliwa et al., 2019) (55, 27, and 24 words on average). On the other hand, our dataset offers useful structure: (i) segmentation of each SOAP note into subsections; and (ii) a set of supporting utterances that provide evidence for each sentence in the SOAP note. Exploiting this structure, our methods outperform appropriate baselines.
Our first methodological contribution is to propose a spectrum of methods, for decomposing summarizaton tasks into extractive and abstractive subtasks. Starting from a straightforward sequence-tosequence model, our methods shift progressively more work from the abstractive to the extractive component: (i) CONV2NOTE: the extractive module does nothing, placing the full burden of summarization on an end-to-end abstractive module. (ii) (1) Extract Figure 1: Workflow of our best performing approach involving extraction and clustering of noteworthy conversation utterances followed by abstractive summarization of each cluster (fictitious data) EXT2NOTE: the extractive module selects all utterances that are noteworthy (i.e., likely to be marked as supporting utterances for at least one SOAP note sentence), and the decoder is conditioned only on these utterances; (iii) EXT2SEC: the extractive module extracts per-subsection noteworthy utterances and the decoder generates each subsection, conditioned only on the corresponding utterances; (iv) CLUSTER2SENT: the extractive module not only extracts per-subsection noteworthy utterances but clusters together those likely to support the same SOAP sentence-here, the decoder produces a single sentence at a time, each conditioned upon a single cluster of utterances and a token indicating the SOAP subsection. We see consistent benefits as we move from approach (i) through (iv). Both to demonstrate the generality of our methods and to provide a reproducible benchmark, we conduct parallel experiments on the (publicly available) AMI corpus (Carletta, 2007) 2 Like our medical conversations dataset, the AMI corpus exhibits section-structured summaries and contains annotations that link summary sentences to corresponding supporting utterances. Our experiments with AMI data show the same trends, favoring pipelines that demand more from the extractive component. These results speak to the wider usefulness of our proposed approaches, EXT2SEC and CLUSTER2SENT, whenever section-structured summaries and annotated evidence utterances are available.
Our best performing model, CLUSTER2SENT (Figure 1), demands the most of the extractive module, requiring that it both select and group each subsection's noteworthy utterances. Interestingly, we observe that given oracle (per-subsection) noteworthy utterances, a simple proximity-based 2 Our code and trained models for the AMI dataset: https://github.com/acmi-lab/ modular-summarization clustering heuristic leads to similar performance on SOAP note generation as we obtain when using ground-truth clusters-even though the ground truth noteworthy utterances are not always localized. Applied with predicted noteworthy utterances and clusters, this approach achieves the highest ROUGE scores and produces the most useful (factual, coherent, and non-repetitive) sentences as rated by human experts. As an additional benefit of this approach, due to the smaller lengths of the input and output sequences involved, we can feasibly train large transformer-based abstractive summarization models (e.g., T5), whose memory requirements grow quadratically with sequence length. Additionally, our approach localizes the precise utterances upon which each SOAP note sentence depends, enabling physicians to verify the correctness of each sentence and potentially to improve the draft by highlighting the correct noteworthy utterances (versus revising the text directly).
In summary, we contribute the following: • The first pipeline for drafting entire SOAP notes from doctor-patient conversations.
• A new collection of extractive-abstractive approaches for generating long sectionsegmented summaries of conversations, including new methods that leverage annotations attributing summary sentences to conversation utterances.
• A rigorous quantitative evaluation of our proposed models and appropriate baselines for both the extractive and abstractive components, including sensitivity of the pipeline to simulated ASR errors.
• A detailed human study to evaluate the factuality and quality of generated SOAP notes, and qualitative error analysis.
In the space of two-step extractive-abstractive summarization approaches, Subramanian et al.
(2019) summarize scientific papers by first extracting sentences from it and then abstractively summarizing them. Chen and Bansal (2018) extract important sentences from the input and then paraphrase each of them to generate the abstractive summary. While they assume that each summary sentence is supported by exactly one source sentence, in our medical conversations, many summary sentences synthesize content spread across multiple dialogue turns (e.g., a series of questions and answers).
Past work on abstractive summarization of medical conversations has focused on summarizing patient-nurse conversations with goals including capturing symptoms of interest (Liu et al., 2019c) and past medical history (Joshi et al., 2020). These tasks are respectively similar to generating the review of systems and past medical history subsections of a SOAP note. In contrast, we aim to generate a full-length SOAP note containing up to 15 subsections, and propose methods to address this challenge by extracting supporting context for smaller parts and generating them independently.

Dataset
We use two different datasets in this work. The primary medical dataset, developed through a collaboration with Abridge AI, consists of doctor-patient conversations with annotated SOAP notes. Additionally, we evaluate our summarization methods on the AMI dataset (Carletta, 2007), comprised of business meeting transcripts and their summaries.

Medical dataset
Our work builds on a unique resource: a corpus consisting of thousands of recorded English-language clinical conversations, with associated SOAP notes created by a work force trained in SOAP note documentation standards. Our dataset consists of transcripts from real-life patient-physician visits from which sensitive information such as names have been de-identified. The full medical dataset consists of 6862 visits consisting of 2732 cardiologist visits, 2731 visits for family medicine, 989 interventional cardiologist visits, and 410 internist visits. Owing to the sensitive nature of the data, we cannot share it publicly (an occupational hazard of research on machine learning for healthcare).
For each visit, our dataset contains a humangenerated transcript of the conversation. The transcript is segmented into utterances, each annotated with a timestamp and speaker ID. The average conversation lasts 9.43 minutes and consists of around 1.5k words (Appendix Figure A1). Associated with each conversation, we have a human-drafted SOAP note created by trained, professional annotators. The annotators who created the SOAP notes worked in either clinical transcription, billing, or related documentation-related departments, but were not necessarily professional medical scribes. The dataset is divided into train, validation and test splits of size 5770, 500 and 592, respectively.
Our annotated SOAP notes contain (up to) 15 subsections, each of which may contain multiple sentences. The subsections vary in length. The Allergies subsections is most often empty, while the Assessment subsection contains 5.16 sentences on average (Table 1). The average SOAP note contains 27.47 sentences. The different subsections also differ in the style of writing. The Medications subsection usually consists of bulleted names of medicines and their dosages, while the Assessment subsection typically contains full sentences. On average, the fraction of novel (i.e., not present in the conversation) unigrams, bigrams, and trigrams, in each SOAP note are 24.09%, 67.79% and 85.22%, respectively.
Each SOAP note sentence is also annotated with utterances from the conversation which provide evidence for that sentence. A SOAP note sentence can have one or more supporting utterances. On average, each SOAP sentence has 3.84 supporting utterances, but the mode is 1 (Appendix Figure A1). We refer to these utterances as noteworthy utterances throughout this paper. Throughout this work, we deal with the 15 more granular subsections rather than the 4 coarse sections of SOAP notes, and thus for convenience, all further mentions of section technically denote a SOAP subsection.

AMI dataset
The AMI dataset is a collection of 138 business meetings, each with 4 participants with various roles (e.g., marketing expert, product manager, etc.). Each meeting transcript comes with an associated abstractive summary that is divided into four sections-abstract, decisions, actions, and problems. Each conversation also has an associated extractive summary, and there are additional annotations linking the utterances in the extractive summary to sentences in the abstractive summary. For any given sentence in the abstractive summary, we refer to the linked set of utterances in the extractive summary as its noteworthy utterances. We note that 7.9% of the abstractive summary sentences have no annotated noteworthy utterances. To simplify the analysis, we remove these sentences from summaries in the training, validation, and test splits.

Methods
We investigate the following four decompositions of the summarization problem into extractive and abstractive phases, ordered from abstraction-heavy to extraction-heavy: CONV2NOTE takes an endto-end approach, generating the entire SOAP note from the entire conversation in one shot. EXT2NOTE first predicts all of the noteworthy utterances in the conversation (without regard to the associated section) and then generates the entire SOAP note in one shot from only those utterances. EXT2SEC extracts noteworthy utterances, while also predicting the section(s) for which they are relevant, and then generates each SOAP section separately using only that section's predicted noteworthy utterances. CLUSTER2SENT attempts to group together the set of noteworthy utterances associated with each summary sentence. Here, we cluster separately among each set of section-specific noteworthy utterances and then generate each section one sentence at a time, conditioning each on the associated cluster of utterances.
Each of these pipelines leaves open many choices for specific models to employ for each subtask. For the abstractive modules of CONV2NOTE and EXT2NOTE, we use a pointer-generator network. The abstractive modules of EXT2SEC and CLUSTER2SENT, which require conditioning on section are modeled using conditioned pointergenerator networks (described in Section 5), and fine-tuned T5 models which condition on the section being generated by means of prepending it to the input. T5 models could not be used in the CONV2NOTE and EXT2NOTE settings because their high memory requirement for long inputs could not be accommodated even with 48GB of GPU memory.
For noteworthy utterance extraction, we primarily use a hierarchical LSTM model and a BERT-LSTM model as described in the next section. All models are configured to have a scalar output for binary classification in EXT2NOTE, whereas for EXT2SEC and CLUSTER2SENT, they have multilabel output separately predicting noteworthiness for each section. Note that the same utterance can be noteworthy with respect to multiple sections. We use the same trained utterance extraction models for both EXT2SEC and CLUSTER2SENT.
For the clustering module in CLUSTER2SENT, we propose a heuristic that groups together any two supporting utterances that are close, meaning they have less than or equal to τ utterances separating them, where τ is a hyperparameter. This process is iterated, with the clusters growing in size by merging with other singletons or clusters, until every pair of close utterances have the same cluster membership. The value of τ is tuned on the validation set. Since each cluster necessarily produces one sentence in the SOAP note, having too many or too few clusters can make the SOAP note too long or too short, respectively. Therefore, for any given value of the hyper-parameter τ and any given section, the prediction thresholds of the extractor are tuned on the validation set to produce approximately the same number of clusters over the entire validation set as present in the ground truth for that section. Among ground truth clusters containing multiple noteworthy utterances, 82% are contiguous. In an experiment where the heuristic is used to cluster the oracle noteworthy utterances for each section, and summaries are subsequently generated via the abstractive modules from CLUSTER2SENT, ROUGE-1 and ROUGE-2 metrics deteriorate by less than 1 point as compared to oracle clusterings (Appendix Table A3), demonstrating our heuristic's effectiveness.

Pointer-Generator Network
We use the pointer-generator network introduced by See et al. (2017) for CONV2NOTE and EXT2NOTE. The model is a bidirectional LSTM-based encoderdecoder model with attention. It employs a pointer mechanism to copy tokens directly from the input in addition to generating them by predicting generation probabilities for the entire vocabulary. The model also computes the weights that govern copying versus generating at each decoding timestep.

Section-conditioned Pointer-Generator Network
We modify the pointer-generator network for algorithms EXT2SEC and CLUSTER2SENT, to condition on the (sub)section of the summary to be generated. The network uses a new lookup table to embed the section z into an embedding e z . The section embedding is concatenated to each input word embedding fed into the encoder. The section embedding is also appended to the inputs of the decoder LSTM in the same fashion.
T5 We use the recently released T5 model (Raffel et al., 2020) as an abstractive module. It is an encoder-decoder model, where both encoder and decoder consist of a stack of transformer layers. The T5 model is pre-trained on 5 tasks, including summarization, translation etc. We use the pre-trained T5 model parameters and fine-tune it on our task dataset. For introducing the sectionconditioning in EXT2SEC and CLUSTER2SENT, we simply add the name of the section being generated to the beginning of the input.

Hierarchical LSTM classifier(H-LSTM)
In this model, we first encode each utterance u i independently by passing its tokens through a bidirectional LSTM and mean-pooling their encoded representations to get the utterance representation h i . We pass the sequence of utterance representations {h 1 , h 2 , ..., h n } through another bidirectional LSTM to get new utterance representations which incorporate neighboring contexts. These are then passed through a sigmoid activated linear layer to predict each utterance's probability of noteworthiness with respect to each section.

BERT-LSTM classifier(B-LSTM)
In this model, tokens in the utterance u i are passed through a BERT encoder to obtain their contextualized representations, which are mean-pooled to get the utterance representation h i . The subsequent architecture exactly mirrors hierarchical LSTM, and involves passing utterance representations through a bidirectional LSTM and linear layer to get output probabilities. BERT-LSTM is fine-tuned in an end-to-end manner.

Experiments
We first establish two baselines. RANDOMNOTE randomly and uniformly samples a SOAP note from the training set and outputs it as the summary for any input conversation. ORACLEEXT presents all the ground truth noteworthy utterances (evidence) from the conversation as the SOAP note without any abstractive summarization. Thus, the ORACLEEXT baseline has the advantage of containing all the desired information (e.g., names of medicines) from the conversation, but the disadvantage of not being expressed in the linguistic style of a SOAP note which leads to lower n-gram overlap. The opposite is true for the RANDOMNOTE baseline. Both baselines give similar performance and are outperformed by the simple CONV2NOTE approach (Table 2).
We train the abstractive modules for the 4 approaches described in Section 4 with the ground truth noteworthy utterances as inputs. To estimate an upper bound on the performance we can reasonably hope to achieve by improving our noteworthy utterance extractors, we test our models with oracle noteworthy utterances in the test set. All algorithms relying on oracle noteworthy utterances outperform CONV2NOTE, and exhibit a monotonic and significant rise in ROUGE scores as we move towards the extraction-heavy end of the spectrum (Table 3) 3 . For predicting noteworthy utterances, we use two baselines: (i) logistic regression on TF-IDF utterance representations; and (ii) a model with a bidirectional LSTM to compute token-averaged utterance representations, followed by a linear classification layer. These two models make the predictions for each utterance independent of others.
In contrast, we also use models which incorporate context from neighboring utterances: (a) a hierarchical LSTM; and (b) a BERT-LSTM model as described in Section 5. The latter two methods perform much better (Table 5), demonstrating the benefit of incorporating neighboring context, with BERT-LSTM performing the best (see Appendix  Table A6 for section-wise performance).
Using predicted noteworthy utterances and clusters instead of oracle ones leads to a drop in ROUGE scores, but the performance of EXT2SEC and CLUSTER2SENT is still better than CONV2NOTE (Table 2). For the medical dataset, using a BERT-LSTM extractor leads to the best performance, with CLUSTER2SENT outperforming CONV2NOTE by about 8 points in ROUGE-1 (see Appendix Table A5 for section-wise performance). Interestingly, the T5-Small variant achieves similar performance to T5-Base, despite being only about a quarter of the latter's size.

Performance on AMI dataset
We see a similar trend in the ROUGE scores when applying these methods on the AMI dataset. One exception is the poor performance of pointer-generator based EXT2NOTE, which excessively repeated sentences despite using a high coverage loss coefficient. There is a larger gap between the performance of the T5-Small and T5-Base abstractive models on this dataset. As an extractor, the performance of BERT-LSTM is again better than HLSTM (Table 5), but when used in tandem with the abstractive module, ROUGE scores achieved by the overall pipeline do not always follow the same order. We also observe that the clustering heuristic does not work as well on this dataset. Specifically, tuning the thresholds of the extractive model, while fixing the clustering threshold τ gave worse results on this dataset. Tuning the thresholds independent 3 The character '-' represents GPU memory overflow of the clusters performed better. However, the best method still outperforms CONV2NOTE by about 11 ROUGE-1 points (Table 2).
Performance with ASR errors In the absence of human-generated transcripts of conversations, Automatic Speech Recognition (ASR) techniques can be used to transcribe the conversations for use by our models. To account for ASR errors, we artificially added errors in transcripts of the medical dataset by randomly selecting some percentage of the words and replacing them with phonetically similar words using RefinedSoundEx (Commons) (details in the Appendix). Models trained on clean dataset perform worse on a 10% corrupted test dataset (Table 4). Since ASR errors lead to replacement of a correct word by only a small set of phonetically similar words, there is still some information indicating the original word that can be used by the models. When we train our models on data corrupted at the 10% ASR error rate, our models recover much of the performance drop (Table 4). Notably when simulated ASR errors are dialed up to a 30% error rate, (both at train and test time) we see a smaller performance drop for CLUSTER2SENT as compared to CONV2NOTE.

Qualitative Analysis
The conditioned pointer-generator and T5 models used in CLUSTER2SENT learn to place information regarding different topics in appropriate sections. Hence, given a cluster of supporting utterances, the models can generate different summaries for multiple sections (Figure 2). For example, given the same supporting utterances discussing the patient's usage of lisinopril for low blood pressure, a model generates "low blood pressure" in the review of systems section, and "lisinopril" in medications section. We direct the reader to the appendix for examples of full-length generated SOAP notes.
Interestingly, when the abstractive model is given a cluster of utterances that are not relevant to the section being generated, the model sometimes outputs fabricated information relevant to that section such as saying the patient is a non-smoker in social history, or that the patient has taken a flu shot in immunizations . Hence, the quality of produced summaries heavily depends on the ability of the extractive step to classify the extracted utterances to the correct section. Another cause of false information is the usage of pronouns in clusters without a mention of the referred entity. In such AMI corpus   situations, T5 models frequently replace the pronoun with some arbitrary entity (e.g. "she" with "daughter", compounds with "haemoglobin", and medicines with "lisinopril").
Occasionally, the abstractive module produces new inferred information that is not mentioned explicitly in the conversation. In one instance, the model generated that the patient has a history of heart disease conditioned on a cluster that men-tioned he/she takes digoxin, a popular medicine for heart disease. Similarly, the model can infer past medical history of "high cholesterol" upon seeing pravastatin usage. Such inferences can also lead to incorrect summaries, e.g., when a doctor explained that a patient has leaky heart valves, a model added a sentence to the diagnostics and appointments section saying "check valves".   Table 4: Performance of models trained and tested on data with different simulated ASR error rates. BLS: BERT-LSTM of the conversation independently, which may lead to contradictions in the SOAP note. In one visit, the patient was asked about chest pain twice-once in the beginning to get to know his/her current state, and once as a question about how he/she felt just before experiencing a fall in the past. This led to the model generating both that the patient denied chest pain as well as confirmed chest pain, without clarifying that one statement was for the present and another for the past.

Human evaluation
We asked trained human annotators to evaluate generated SOAP notes for 45 conversations. Every sentence in each SOAP note was labeled according to various quality dimensions such whether it was factually correct, incoherent, irrelevant, redundant, or placed under an inappropriate section. The detailed statistics of annotations received for each quality dimension are provided in the Appendix. We also collected aggregate annotations for the comprehensiveness of each SOAP note and the extent to which it verbatim copied the transcript on a 5-point Likert scale.
Human raters were presented with a web in-  Table 5: Performance on multilabel classification of noteworthy utterances with logistic regression(LR), LSTM(LS), Hierarchical-LSTM(HLS) and BERT-LSTM(BLS). Ma:macro-averaged. Mi:microaveraged terface showing the conversation, along with a search feature to help them in looking up desired information.
The summaries generated by three methods (CONV2NOTE(pointergenerator), CLUSTER2SENT(pointer-generator) and CLUSTER2SENT(T5-base)) were presented in random order to hide their identities. For each sentence, we asked for (i) Factual correctness of the sentence; (ii) If the statement is simply repeating what has already been mentioned before; (iii) If the statement is clinically irrelevant; (iv) If the statement is incoherent (not understandable due to grammatical or semantic errors); and (v) If the statement's topic does not match the section in which it is placed. In addition, we asked two separate questions for rating the overall summary on a scale of 1-5 for its (i) comprehensiveness and (ii) extent of verbatim copying from conversation. The human evaluation of the SOAP notes was done by workers who had also participated in the creation of the dataset of SOAP notes. Hence, they had already been extensively trained in the task of SOAP note creation, which gave them appropriate knowledge to judge the SOAP notes.
To quantify the performance among different methods, we consider a scenario where each generated SOAP note has to be post-edited by discarding undesirable sentences. For a generated SOAP note, we define its yield as the fraction of its total sentences that are not discarded. The sentences that are retained are those that are both factually correct and were not labeled as either repetitive or incoherent. The human annotations show that both CLUSTER2SENT-based methods tested produced a higher yield than the CONV2NOTE baseline (p< 0.02). T5-base performs better than conditioned pointer-generator as the abstractive module in CLUSTER2SENT setting, producing significantly more yield (Table 6). T5 also produces fewer inco-  Table 6: Averages of different metrics for CONV2NOTE(C2N), CLUSTER2SENT with pointer-generator (C2S-P) and T5-base (C2S-T). Comp:comprehensiveness, Copy:amount of copying. Length: number of sentences generated. herent sentences (Appendix Table A4) likely due to its exposure to a large number of well-formed coherent sentences during pretraining.
We conducted an analogous human evaluation of summaries generated for all 20 conversations in the test set of the AMI corpus, and saw a similar trend in the expected yield for different methods. Notably, for the AMI corpus, CONV2NOTE produced a very high proportion of redundant sentences (> 0.5) despite using the coverage loss, while the pointer-generator based CLUSTER2SENT produced a high proportion of incoherent sentences (Appendix Table A4).

Conclusion
This paper represents the first attempt at generating full-length SOAP notes by summarizing transcripts of doctor-patient conversations. We proposed a spectrum of extractive-abstractive summarization methods that leverage: (i) section-structured form of the SOAP notes and (ii) linked conversation utterances associated with every SOAP note sentence. The proposed methods perform better than a fully abstractive approach and standard extractiveabstractive approaches that do not take advantage of these annotations. We demonstrate the wider applicability of proposed approaches by showing similar results on the public AMI corpus which has similar annotations and structure. Our work demonstrates the benefits of creating section-structured summaries (when feasible) and collecting evidence for each summary sentence when creating any new summarization dataset.

Ethics Statement
The methods proposed in this work to generate SOAP notes involve neural models that sometimes generate factually incorrect text (Maynez et al., 2020). The detection and correction of such factual errors in automatically generated summaries is an active area of research (Cao et al., 2018;Zhang et al., 2020;Dong et al., 2020). We emphasize that the methods are intended to be used with supervision from a medical practitioner who can check for factual errors and edit the the generated SOAP note if needed. We have estimated the frequency of such factual errors (Appendix Table A4) and characterized multiple types of errors seen in generated SOAP notes in Section 7, for which the medical practitioners should remain vigilant. For example, there is a bias to incorrectly generate information that occur frequently in specific sections (e.g. "patient took flu shot"), and to replace pronouns with frequently seen entities (such as "lisinopril" for references to medicine). All data used in this study was manually de-identified before we accessed it. Deploying the proposed methods does not require long-term storage of conversations. After the corresponding SOAP notes are generated, conversations can be discarded. Hence, we do not anticipate any additional privacy risks from using the proposed methods.

Decoder Results with Oracle extracts
We present additional quantitative results (Table A3), including (i) The ROUGE scores on the test set when using oracle noteworthy utterances with both oracle and predicted clusters (for CLUSTER2SENT models). (ii) Two ablations on EXT2SEC: ALLEXT2SEC uses binary classification to extract all noteworthy utterances (not persection), and an abstractive decoder that conditions on the section; while EXT2SECNOCOND uses a multilabel classification based extractor but does not use section-conditioning in the abstractive module. Both methods mostly perform worse than EXT2SEC demonstrating the benefit of using both section-specific extraction and sectionconditioning in abstractive decoder.

Impact of copy mechanism
When we do not use copy mechanism in the pointergenerator model, we observed a drop in its performance in the CLUSTER2SENT setting with oracle noteworthy noteworthy utterances and clusters (Table A1). Hence, we have used copy mechanism in all the pointer-generator models we train in this work.

Impact of pretraining
When training a randomly initialized T5-Base model on the medical dataset, even in CLUS-TER2SENT setting with oracle clusters, it only got a ROUGE-1 around 40 (Table A2). This is over 16 points lower than what we get by starting with off-the-shelf pretrained T5 parameters, and is even Copy mechanism R-1 R-2 R-L   Table A2: Impact of pretraining on performance of T5-Base model on medical dataset with CLUSTER2SENT using oracle noteworthy utterance clusters worse than CONV2NOTE, highlighting the importance of pretraining.

Sample generated SOAP notes
Due to privacy concerns, we can not publish conversations from our dataset. Here, we present an obfuscated conversation from our test dataset, modified by changing sensitive content such as medicines, diseases, dosages ( Figure A2). We also present the SOAP note generated by our best method, as well as the ground truth.

Model implementation details
For the hierarchical LSTM classifier, we have a word embedding size of 128 and both bidirectional LSTMs have a hidden size of 256. For BERT-LSTM, the BERT embeddings are initialized from bert-base-uncased (768 dimensions). LSTMs in either direction have a hidden-layer of size 512 and the entire model is optimized end-to-end with a learning-rate of 0.001. For BERT-LSTM, an input conversation is divided into chunks of 128 utterances. Due to GPU constraints, these chunks are processed one at a time. The pointer-generator models have a word embedding size of 128, and a hidden size of 256 for both the encoder and the decoder. The section embeddings used in sectionconditioned pointer-generator network have 32 dimensions. During training of all pointer-generator models, the model is first trained without coverage loss (Tu et al., 2016) to convergence, and then trained further with coverage loss added. We tried coverage loss coefficients varying from 0.5 to 8.
The pointer-generator models were trained using Adam optimizer before coverage and using SGD after adding coverage. We tried learning rates between 10 −4 and 10 −3 with Adam. The next word prediction accuracy was used as the validation criterion for early stopping while train- Figure A1: Histogram of number of words in a conversation and the number of evidence utterances per summary sentence for the medical dataset  ing abstractive modules, with the exception of coverage-augmented models that used a combination of crossentropy and coverage loss. Microaveraged AUC was used as the validation criterion for training of extractive modules. We employ beam search with beam size 4 to decode outputs from both models. For the vanilla pointer-generator model used in CONV2NOTE and EXT2NOTE, we modified the beam search procedure to make sure that all the SOAP note sections are generated in proper order. We start the beam search procedure by feeding the header of the first section (chief complaint). Whenever the model predicts a section header as the next word and it shows up in a beam, we check if it is the next section to be generated. If not, we replace it with the correct next section's header. Any end-of-summary tokens generated before all the sections have been produced are also replaced similarly. Note that producing all sections simply means that the headers for each section have to be generated, and a section can be left empty by starting the next section immediately after generating the previous header. The decoding length for beam search is constrained to be between 5 th and 95 th percentile of the target sequence length distribution, calculated on the training set.

Simulating ASR Errors
We simulate ASR errors at any given percentage rate by randomly selecting the percentage of the words in the conversation and replacing them with phonetically similar words. To reduce the search space of possible candidates for each word, we use the suggest() function taken from the Pyenchant 4 library that provides auto-correct suggestions for the input word. Each suggestion is then passed through the Refined SoundEx algorithm to find the phonetic distance between the original and the suggested word. We use the pyphonetics 5 package for a python implementation of this algorithm. For our final candidate list, we choose words that are at phonetic distance of 1 from the original word. Finally, a candidate is chosen at random from this list to replace the original.

More Experimental Details
We trained models on multiple Nvidia Quadro RTX 8000, RTX 2080Ti and V100 GPUs. The extractive modules were evaluated using standard classification metrics from scikit-learn 6 and quality of summaries were evaluated using ROUGE scores calculated with the pyrouge Python package 7 which is a wrapper around the ROUGE-1.5.5 Perl script.   DR Right, and I think that's in the, we can all take a little note for but one of things that really got me worried because your last Hemoglobin was really low -PT Uh-huh. (LIR) (A) DR It was below , it was below 10 , and we 've had this consistent pattern and you 've really , I mean , you really have given it an effort and I have to give it up to you that you 've been trying and , um , so we 're down to like just a couple of options and so I want to just kind of put them before you . DR Um , I do have one other option , um , but I want to counsel you that , that Metformin , even if , if we did , we do go to it , it is not a punishment .

(A)
DR It is something to kind of get your baseline down to a regular, regular situation and you only have to do it once a day. (A) DR Um, and I know that one of the things that we have for anemics is their eating habits . (A) (PT) DR And, so , I am proposing as instead of using Metformin this time , um , that we use something called Lipitor for the , for the eating at nighttime . (A) DR Um, it's supposed to reduce the incidence of having those nighttime cravings so that you can work , you can do your things , you can plan a little bit better . (A) DR It 's , it's originally for ADHD so some people actually feel a little bit more focused , um , and controlled but it also affects appetite centers and so it's supposed to do it for the longer term as opposed to using like a fen phen , um , so , which is short term . DR So, um , I 'm really hoping with your interest in it and with the coverage hopefully , I know , with your particular plan it should be covered and we can get a discount . DR And, so what we do is we say , you know , it should be , we usually will do three months but then eight weeks we should see some difference from today . DR We should see some kind of improvement and then we can sort of celebrate that in and of itself, if that's okay with you.
PT That sounds great. (DA) DR Cool, all right well we will plan to meet again in eight weeks . PT Okay. DR And, uh , and we 'll go from there . PT Okay. DR Cool, all right , cool .
Miscellaneous: patient has snacking and stress eating . poor meal planning . Laboratory and Imaging Results: last hemoglobin was low at 10 . Assessment: discussed about being a metformin candidate . discussed about hemoglobin and the things that keep patient from managing anemia well. discussed that patient 's last hemoglobin was really low , it was really low , it was really low , it was really low , it was really low , it was really low , and we have had this consistent pattern and you really have given it effort and we have had this. followup in 8 weeks .
Diagnostics and Appointments: followup in 8 weeks .
Prescriptions and Therapeutics: the patient will be a metformin candidate. metformin once a day. coumadin twice a day with other medications, which are actually pretty minor .

Ground truth
Chief Complaint: follow-up. anemia. Past Medical History: anemia. Medications: metformin Miscellaneous: patient is not following a correct diet plan (snacking and stress eating). Laboratory and Imaging Results: hemoglobin was really low below 10. Assessment: anemia. night time eating. discussed with the patient the importance of bringing up the hemoglobin to a considerable level and also discussed couple of other options. discussed the new medication called lipitor which will help the patient bringing up the hemoglobin and can take it once a day with other medications. discussed that the lipitor will reduce the nighttime cravings so that the patient can plan better (originally for ADHD to better focus). discussed with the patient that with the current insurance coverage , the patient may get a discount with lipitor. Diagnostics and Appointments: advised to follow up in 8 weeks.
Prescriptions and Therapeutics: metformin. lipitor. Figure A2: Sample conversation (obfuscated) with SOAP note generated by the best method and the ground truth