MedicalSum: A Guided Clinical Abstractive Summarization Model for Generating Medical Reports from Patient-Doctor Conversations

,


Introduction
The volume of data created in healthcare has grown considerably as a result of record keeping and regulatory requirements policies (Kudyba, 2010).The documentation requirements for electronic health records (EHR) have been shown to be a significant factor contributing to physician burnout (van Buchem et al., 2021;Tran et al., 2020).As a result, the automatic creation of medical documentation has been proposed as one way to address this issue.
To date, there have been several attempts at automatically generating summaries of clinical encounters.Enarvi et al. (2020) employed a transformer model for summarizing doctor-patient conversations.Joshi et al. (2020) developed models to summarize dialogue snippets between two to ten physician-patient turns long.Finally, Jeblee et al. (2019) and Lacson et al. (2006) utilized extractive methods to identify the most important utterances which are combined to form the final summary.
The summaries generated by current summarization models are not straightforwardly controllable (Li et al., 2018).Dialogue summarization is also challenging because casual conversation can include interruptions, repetitions, and sudden topic transitions (Khalifa et al., 2021), and generally does not follow the structure of a written document (Zhu and Penn, 2006).These challenges can lead to problems, such as the omission of key information, or the hallucination of unsupported information.Summaries for medical documentation must also use the correct medical terminology expected by physicians (Knoll et al., 2022).To help address this problem, we propose a novel knowledge-augmented transformer model that uses medical knowledge to guide the summarization process in various ways to increase the likelihood of relevant medical facts being included in the summarized output (An example of such output is in Figure 1).Key paper contributions include: (i) We are the first, to the best of our knowledge, to pro-pose the usage of medical knowledge from a clinical Metathesaurus (UMLS (Bodenreider, 2004)) in the summarization process of a transformer-based model in order to generate 'medically focused' clinical note summaries.(ii) We answer the question of how to incorporate structured medical knowledge in medical documentation generation by designing 3 specific signals over medical entities.(iii) By leveraging these methods the MedicalSum model achieves ROUGE-1 and ROUGE-L improvement between 0.8% and 2.1% in all experiments on medical note summarization.

Related Work
There are two main approaches for summarization.Extractive methods (Kupiec et al., 1995) where the summary is created from passages that are copied from the source text and abstractive (Chopra et al., 2016) methods where phrases and words not in the source text can be used to create the summary.
Neural Abstractive Summarization: For the task of abstractive summarization, sequence-tosequence (seq-to-seq) summarization models have achieved state-of-the-art results (Sutskever et al., 2014).Furthermore, different architectures have been proposed to improve the performance of a seqto-seq model.In Enarvi et al. (2020), the authors incorporated a transformer-based (Vaswani et al., 2017) encoder-decoder architecture in order to produce highly-accurate summaries.In addition, in See et al. (2017), a pointing mechanism was used for copying words from the source document.
Guided Summarization: Several studies have focused on including guidance signals in the standard seq-to-seq architecture.Zhu et al. (2020) proposed the usage of relational triples (subject, relation, object).Narayan et al. (2021) Joshi et al. (2020) used a variation of the pointergenerator model that leveraged shared medical terminology between source and target to distinguish important words from unimportant words.

Dataset
For the training of the MedicalSum model, we have to select a large enough dataset that would provide the necessary data for the medical signals to meaningfully affect the performance of the model.However, there are no publicly available large-scale datasets for medical summarization and thus we have to use a proprietary one.We use English data consisting of recently recorded Family Medicine patient-doctor visits.The speaker-diarized conversation transcripts corresponding to the audio files were obtained using an automatic speech recognizer system and medical professionals created the associated clinical notes.
The reports for family medicine are organized under three sections that correspond to three broad areas of a medical note: (i) History of Present Illness (HPI) which captures the reason for the visit.(ii) Physical Examination (PE) which captures findings from a physical examination.(iii) Assessment and Plan (AP) which captures the assessment by the doctor and the treatment plan.Table 1 shows detailed statistics of our dataset.As the above-mentioned dataset contains patients' private medical information it cannot be made publicly available, and that is the reason that we decided to experiment with a public dataset as well to allow for a more open comparison.We tackle the third task of the MEDIQA 2021 challenge (Ben Abacha et al., 2019) on summarization of radiology reports (RAD) (Johnson et al., 2019).
From Table 1, it can be observed that the input documents in the MEDIQA dataset are much smaller than the documents of the family medicine dataset.However, we include it in order to have an evaluation of the models and the baseline on an external dataset.Our experiments are consistent with the datasets' intended use, as they were created for research purposes and we did not notice any indication of offensive content in the datasets.

MedicalSum: Medical Guided Transformer Pointer Generator Model
We adopt the transformer self-attention model from (Vaswani et al., 2017) in the encoder and in the decoder to create context-dependent representations of the inputs.Both encoder and decoder consist of six layers of self-attention with 8 attention heads and each decoder layer attends to the top of the encoder stack after the self-attention.We use the base model size of 8 attention heads with a total of 512 token outputs and a 2048-dimensional feed-forward network.Furthermore, each encoder and decoder layer contains a position-wise feedforward network that consists of two transformations and a ReLU activation in between.A simplified image of the MedicalSum model can be found in Figure 2. The details of each added component are discussed in the following sections.

Pointer-Generator
We implement the pointer generator mechanism as described in (Enarvi et al., 2020;See et al., 2017).
We choose to use a single attention head to attend to the tokens that are good candidates for copying.In (Garg et al., 2019) it was stated that the penultimate layer seems to naturally learn alignments, so we use its first attention head for pointing.

Medical Guidance Signal
We include a medical guidance signal in the summarization process, that consists of all the medical terms in the input sequence that could be identified in UMLS using the MedCAT toolkit (Kraljevic et al., 2021), by introducing two encoders (that share weights) that encode the input text and the guidance signal respectively (Dou et al., 2021).Each encoder layer for the input and the guidance signal consists of a self-attention block and a feed-forward block.Each decoder layer consists of a self-attention block, a cross-attention block with the medical guidance signal, in order to inform the decoder which sections of the source document are important, a cross-attention block with the encoded input where the decoder attends to the whole source document based on the guidance-aware representations and a feed-forward block.
As MedicalSum focuses on the creation of summaries on medical data, we create a medical guidance signal with all the words with a medical meaning (as they are written in the input text).We believe that this signal will be beneficial to the performance of the model as a guidance signal which is created as a set of individual keywords {w 1 , ..., w n }, can help the model to focus on specific desired aspects of the input (Dou et al., 2021).We chose to identify medical entities with UMLS as it is a compendium of many biomedical vocabularies (e.g.MeSH (Dhammi and Kumar, 2014), ICD-10 (WHO, 2004)) and thus it contains all the major standardized clinical terminologies.

Semantic Type Embeddings
We introduce a new embedding matrix called S ∈ R Ds×d into the input layer where d is the transformer hidden dimension and D s = 50 is the number of UMLS semantic types used by our model.It should be noted that in the S matrix, each row represents the unique semantic type in UMLS that a word can be identified with.
To incorporate the S embedding matrix into the input embedding layer, all the words with a clinical meaning defined in UMLS are identified and their corresponding semantic type is extracted.By introducing the semantic type embedding, the input vector for each word w j is updated to: where s wj ∈ R Ds is a 1-hot vector corresponding to the semantic type of the medical word w j and p (j) ∈ R d is the position embedding of the j th token in the sentence.Finally, E ∈ R d×D is the token embedding matrix where D is the size of the model's vocabulary and w j ∈ R D is a 1-hot vector corresponding to the j th input token.It should be noted that the semantic type vector is set to a zerofilled vector for words that do not have a clinical meaning

Medical Weighted Loss Function
We update the loss function of the summarization task to provide a stronger incentive to correctly predict medical words.In our summarization model we use the cross-entropy loss of the Fairseq library (Ott et al., 2019) for the target word x t for each timestep t.We modify the loss function to a weighted loss function where the weight for all the medical words is higher in order to provide a stronger incentive to the model to correctly predict the words with a medical meaning.Specifically, the summarization loss is updated to : where w t = 1 for all the non-medical words and w t = 1 + α for all the medical words, where α is an additional weight value for these words.

Discussion
Previous work (UmlsBERT (Michalopoulos et al., 2021)) introduced a semantic type embedding, for the medical words that could be tokenized into a single token.Our semantic type signal extends the semantic policy for all the medical words (i.e multitoken words).Also, our medical guidance signal is the first attempt to 'guide' a summarization model by combining the dual-encoder architecture with structured medical information.Finally, our loss function, which incorporates a different weight for the medical terms, has not been used in prior work.

Results
We report the results of the comparison of our proposed MedicalSum model with the baseline pointer generator model (Enarvi-PG) (Enarvi et al., 2020).
We also experiment with a model which contains only the guidance signal (MedicalSum guidance ), a model that only includes the semantic type embedding (MedicalSum semantic ), and a model with the medical weighted loss function (MedicalSum loss ).These models are trained for a maximum of 20k steps using the Fairseq library (Ott et al., 2019) on PyTorch 1.5.0 on V100 GPU with 32G GB of system RAM on Ubuntu 18.04.3LTS.

Hyperparameter tuning
We provide the search strategy and the bound for each hyperparameter: the batch size is set between 4 and 8, and the α parameter of the medical weight loss is tested with the values 0.01, 0.1, and 0.2.The best values are chosen based on the validation set micro ROUGE-1 F1 values, using the scoring code with the same setting, that is provided with the family medicine dataset For the Enarvi-PG, MedicalSum, and the models with each individual medical signal, the batch size is set to 4 and the medical weight loss parameter to 0.01.We run our model on three different (random) seeds and we provide the average scores and standard deviation.We compare the models on the ROUGE-1 F1 score (the overlap of unigram) and ROUGE-L F1 score (the lengths of the longest common subsequences) between the summary and the output of the model.

Summarization model comparison
The mean and standard deviation of ROUGE-1 F1 and ROUGE-L F1 for all the competing models on the test set of each dataset are reported in Table 2 (we also provide the results on the validation set in Appendix A.2). MedicalSum outperforms the Enarvi-PG baseline on all the datasets.It achieves an improvement between 0.8% (on the publicly available radiology dataset) and 2.1% (on the PE section, where the ROUGE-1 improvement from 66.11 to 68.22 is a 6.2% reduction in error).These results indicate that the combination of all three previously mentioned medical signals can indeed boost the performance of a medical summarization model.We also provide a qualitative review of summaries produced by each model variant in Appendix A.1, where we observe that MedicalSum can generate clinical notes with desirable medical terms missing from the output of the baseline Enarvi-PG model.MedicalSum semantic , MedicalSum loss model, and the Enarvi-PG baseline model have similar running times (117K seconds for the family medicine and 64K seconds for the radiology dataset).MedicalSum and the MedicalSum guidance are slower (by 4%) due to the second 'guidance' encoder.We chose to compare our model with the Enarvi-PG model (Enarvi et al., 2020), as it has achieved state-of-the-art results in a similar medical summarization dataset.In addition, in their experimentation setup, they actually compared their model with other summarization models like the model of (See et al., 2017) and showcased that their model outperformed it in the task of medical summarization.
We did not re-do the experiments multiple times with different splits in order to be consistent with the literature in terms of testing.For both datasets the splits were provided by the team who created them and creating new splits will not provide a fair comparison with other (current and future) research models that will be tested on these datasets.However, we run each model multiple times (with different random seeds) and we provide the average scores and standard deviation for the testing and the validation set in order to be sure that the improvement was not due to the random seed.

Ablation Study
In order to understand the effect that each medical signal has on the model performance, we conduct an ablation test where the performance of three variations of the MedicalSum model are compared, where each model is allowed access to only one of the medical signals.The results of this comparison are listed in Table 2.
We observe that for every dataset, MedicalSum achieves its best performance when all the medical signals are available, and each model that has access to any of the medical signals outperforms the baseline model.The guidance signal (MedicalSum guidance ) appears to have the most positive effect as it can guide the model to the most important sections of each input.Also, enriching the input embedding with semantic information (MedicalSum semantic ) appears to boost the performance of the model as it forces the embeddings of words that are associated with the same semantic type to become more similar in the embedding space.The medical weight loss model (MedicalSum loss ) appears to have the least improvement but it still outperformed the baseline.

Conclusion and Future Work
In this paper, we present MedicalSum, a novel approach for medical summarization.MedicalSum can provide external medical guidance that helps key information pass the model's decision process and appear in the summary.Furthermore, its novel weighted loss function provides a stronger incentive to the model to correctly predict words with a medical meaning.MedicalSum can also create more meaningful input embeddings by forcing the embeddings of the words that are associated with the same semantic type to become more similar.Our analysis shows that these features allowed MedicalSum to produce more accurate AIgenerated medical documentation.Future work includes examining additional guidance signals (e.g., relational triples), and exploring UMLS hierarchical associations.
This work is the first to show how external medical domain (UMLS) knowledge can effectively improve the performance of a medical note-generation model.Leveraging external knowledge may become an important component of scaling and improving future medical AI systems that automatically generate medical documentation to combat physician burnout and improve patient care.

Limitations
In this paper, we present MedicalSum, a novel medical conversation summarization model which achieves state-of-the-art ROUGE score improvements by integrating structured medical knowledge into the summarization process of a contextual word embedding model.However, one of the obstacles for adopting such a model in any system lies in the computing cost of training.For example, our MedicalSum model was trained on V100 GPU with 32G GB of system RAM on Ubuntu 18.04.3LTS, and we acknowledge that investing in these types of computational resources is not a viable option for many research groups, let alone regular healthcare providers.In addition, another limitation of our work is that relies on the existence of an external medical metathesaurus (UMLS) and thus our model may not be easily adapted to other languages for which a detailed medical database (such as the UMLS for the English language) may not exist.

Ethical Consideration
Medical Note generation by abstractive summarization is crucial for reducing physician burnout due to the vast amount of documentation requirements for electronic health records (EHR).Traditionally, clinical professionals review clinical documents and manually create the appropriate summaries by following specific guidelines.Models such as our MedicalSum model could help to reduce physician burnout, as well as enable physicians to devote more quality time and attention to their patients.
However, we need to be aware of the risks of over-relying on any automatic abstractive summarization model.No matter how efficient a summarization model is, it is still possible to omit key information or to hallucinate unsupported information.This is especially of concern in the medical domain, as inaccuracies could have a significant adverse effect on future patient health outcomes.Thus we believe that any automatic summarization model should only be used to assist, not replace trained clinical professionals.

Figure 1 :
Figure 1: Distinct output from the baseline model and the MedicalSum model, with formatting tokens removed.MedicalSum generates a clinical summary that contains relevant medical facts.

Figure 2 :
Figure 2: Illustration of MedicalSum a transformer sequence-to-sequence model with a pointer-generator and guidance mechanism.