Wen-wai Yim


pdf bib
An Investigation of Evaluation Methods in Automatic Medical Note Generation
Asma Ben Abacha | Wen-wai Yim | George Michalopoulos | Thomas Lin
Findings of the Association for Computational Linguistics: ACL 2023

Recent studies on automatic note generation have shown that doctors can save significant amounts of time when using automatic clinical note generation (Knoll et al., 2022). Summarization models have been used for this task to generate clinical notes as summaries of doctor-patient conversations (Krishna et al., 2021; Cai et al., 2022). However, assessing which model would best serve clinicians in their daily practice is still a challenging task due to the large set of possible correct summaries, and the potential limitations of automatic evaluation metrics. In this paper we study evaluation methods and metrics for the automatic generation of clinical notes from medical conversation. In particular, we propose new task-specific metrics and we compare them to SOTA evaluation metrics in text summarization and generation, including: (i) knowledge-graph embedding-based metrics, (ii) customized model-based metrics with domain-specific weights, (iii) domain-adapted/fine-tuned metrics, and (iv) ensemble metrics. To study the correlation between the automatic metrics and manual judgments, we evaluate automatic notes/summaries by comparing the system and reference facts and computing the factual correctness, and the hallucination and omission rates for critical medical facts. This study relied on seven datasets manually annotated by domain experts. Our experiments show that automatic evaluation metrics can have substantially different behaviors on different types of clinical notes datasets. However, the results highlight one stable subset of metrics as the most correlated with human judgments with a relevant aggregation of different evaluation criteria.

pdf bib
An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters
Asma Ben Abacha | Wen-wai Yim | Yadan Fan | Thomas Lin
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Medical doctors spend on average 52 to 102 minutes per day writing clinical notes from their patient encounters (Hripcsak et al., 2011). Reducing this workload calls for relevant and efficient summarization methods. In this paper, we introduce new resources and empirical investigations for the automatic summarization of doctor-patient conversations in a clinical setting. In particular, we introduce the MTS-Dialog dataset; a new collection of 1,700 doctor-patient dialogues and corresponding clinical notes. We use this new dataset to investigate the feasibility of this task and the relevance of existing language models, data augmentation, and guided summarization techniques. We compare standard evaluation metrics based on n-gram matching, contextual embeddings, and Fact Extraction to assess the accuracy and the factual consistency of the generated summaries. To ground these results, we perform an expert-based evaluation using relevant natural language generation criteria and task-specific criteria such as critical omissions, and study the correlation between the automatic metrics and expert judgments. To the best of our knowledge, this study is the first attempt to introduce an open dataset of doctor-patient conversations and clinical notes, with detailed automated and manual evaluations of clinical note generation.

pdf bib
Overview of the MEDIQA-Chat 2023 Shared Tasks on the Summarization & Generation of Doctor-Patient Conversations
Asma Ben Abacha | Wen-wai Yim | Griffin Adams | Neal Snider | Meliha Yetisgen
Proceedings of the 5th Clinical Natural Language Processing Workshop

Automatic generation of clinical notes from doctor-patient conversations can play a key role in reducing daily doctors’ workload and improving their interactions with the patients. MEDIQA-Chat 2023 aims to advance and promote research on effective solutions through shared tasks on the automatic summarization of doctor-patient conversations and on the generation of synthetic dialogues from clinical notes for data augmentation. Seventeen teams participated in the challenge and experimented with a broad range of approaches and models. In this paper, we describe the three MEDIQA-Chat 2023 tasks, the datasets, and the participants’ results and methods. We hope that these shared tasks will lead to additional research efforts and insights on the automatic generation and evaluation of clinical notes.


pdf bib
Towards Automating Medical Scribing : Clinic Visit Dialogue2Note Sentence Alignment and Snippet Summarization
Wen-wai Yim | Meliha Yetisgen
Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations

Medical conversations from patient visits are routinely summarized into clinical notes for documentation of clinical care. The automatic creation of clinical note is particularly challenging given that it requires summarization over spoken language and multiple speaker turns; as well, clinical notes include highly technical semi-structured text. In this paper, we describe our corpus creation method and baseline systems for two NLP tasks, clinical dialogue2note sentence alignment and clinical dialogue2note snippet summarization. These two systems, as well as other models created from such a corpus, may be incorporated as parts of an overall end-to-end clinical note generation system.


pdf bib
Alignment Annotation for Clinic Visit Dialogue to Clinical Note Sentence Language Generation
Wen-wai Yim | Meliha Yetisgen | Jenny Huang | Micah Grossman
Proceedings of the Twelfth Language Resources and Evaluation Conference

For every patient’s visit to a clinician, a clinical note is generated documenting their medical conversation, including complaints discussed, treatments, and medical plans. Despite advances in natural language processing, automating clinical note generation from a clinic visit conversation is a largely unexplored area of research. Due to the idiosyncrasies of the task, traditional methods of corpus creation are not effective enough approaches for this problem. In this paper, we present an annotation methodology that is content- and technique- agnostic while associating note sentences to sets of dialogue sentences. The sets can further be grouped with higher order tags to mark sets with related information. This direct linkage from input to output decouples the annotation from specific language understanding or generation strategies. Here we provide data statistics and qualitative analysis describing the unique annotation challenges. Given enough annotated data, such a resource would support multiple modeling methods including information extraction with template language generation, information retrieval type language generation, or sequence to sequence modeling.


pdf bib
Automatic rubric-based content grading for clinical notes
Wen-wai Yim | Ashley Mills | Harold Chun | Teresa Hashiguchi | Justin Yew | Bryan Lu
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)

Clinical notes provide important documentation critical to medical care, as well as billing and legal needs. Too little information degrades quality of care; too much information impedes care. Training for clinical note documentation is highly variable, depending on institutions and programs. In this work, we introduce the problem of automatic evaluation of note creation through rubric-based content grading, which has the potential for accelerating and regularizing clinical note documentation training. To this end, we describe our corpus creation methods as well as provide simple feature-based and neural network baseline systems. We further provide tagset and scaling experiments to inform readers of plausible expected performances. Our baselines show promising results with content point accuracy and kappa values at 0.86 and 0.71 on the test set.


pdf bib
Annotation of pain and anesthesia events for surgery-related processes and outcomes extraction
Wen-wai Yim | Dario Tedesco | Catherine Curtin | Tina Hernandez-Boussard
BioNLP 2017

Pain and anesthesia information are crucial elements to identifying surgery-related processes and outcomes. However pain is not consistently recorded in the electronic medical record. Even when recorded, the rich complex granularity of the pain experience may be lost. Similarly, anesthesia information is recorded using local electronic collection systems; though the accuracy and completeness of the information is unknown. We propose an annotation schema to capture pain, pain management, and anesthesia event information.


pdf bib
In-depth annotation for patient level liver cancer staging
Wen-wai Yim | Sharon Kwan | Meliha Yetisgen
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis