2023
pdf
bib
abs
An Investigation of Evaluation Methods in Automatic Medical Note Generation
Asma Ben Abacha
|
Wen-wai Yim
|
George Michalopoulos
|
Thomas Lin
Findings of the Association for Computational Linguistics: ACL 2023
Recent studies on automatic note generation have shown that doctors can save significant amounts of time when using automatic clinical note generation (Knoll et al., 2022). Summarization models have been used for this task to generate clinical notes as summaries of doctor-patient conversations (Krishna et al., 2021; Cai et al., 2022). However, assessing which model would best serve clinicians in their daily practice is still a challenging task due to the large set of possible correct summaries, and the potential limitations of automatic evaluation metrics. In this paper we study evaluation methods and metrics for the automatic generation of clinical notes from medical conversation. In particular, we propose new task-specific metrics and we compare them to SOTA evaluation metrics in text summarization and generation, including: (i) knowledge-graph embedding-based metrics, (ii) customized model-based metrics with domain-specific weights, (iii) domain-adapted/fine-tuned metrics, and (iv) ensemble metrics. To study the correlation between the automatic metrics and manual judgments, we evaluate automatic notes/summaries by comparing the system and reference facts and computing the factual correctness, and the hallucination and omission rates for critical medical facts. This study relied on seven datasets manually annotated by domain experts. Our experiments show that automatic evaluation metrics can have substantially different behaviors on different types of clinical notes datasets. However, the results highlight one stable subset of metrics as the most correlated with human judgments with a relevant aggregation of different evaluation criteria.
pdf
bib
abs
An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters
Asma Ben Abacha
|
Wen-wai Yim
|
Yadan Fan
|
Thomas Lin
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Medical doctors spend on average 52 to 102 minutes per day writing clinical notes from their patient encounters (Hripcsak et al., 2011). Reducing this workload calls for relevant and efficient summarization methods. In this paper, we introduce new resources and empirical investigations for the automatic summarization of doctor-patient conversations in a clinical setting. In particular, we introduce the MTS-Dialog dataset; a new collection of 1,700 doctor-patient dialogues and corresponding clinical notes. We use this new dataset to investigate the feasibility of this task and the relevance of existing language models, data augmentation, and guided summarization techniques. We compare standard evaluation metrics based on n-gram matching, contextual embeddings, and Fact Extraction to assess the accuracy and the factual consistency of the generated summaries. To ground these results, we perform an expert-based evaluation using relevant natural language generation criteria and task-specific criteria such as critical omissions, and study the correlation between the automatic metrics and expert judgments. To the best of our knowledge, this study is the first attempt to introduce an open dataset of doctor-patient conversations and clinical notes, with detailed automated and manual evaluations of clinical note generation.
2022
pdf
bib
abs
MedicalSum: A Guided Clinical Abstractive Summarization Model for Generating Medical Reports from Patient-Doctor Conversations
George Michalopoulos
|
Kyle Williams
|
Gagandeep Singh
|
Thomas Lin
Findings of the Association for Computational Linguistics: EMNLP 2022
We introduce MedicalSum, a transformer-based sequence-to-sequence architecture for summarizing medical conversations by integrating medical domain knowledge from the Unified Medical Language System (UMLS). The novel knowledge augmentation is performed in three ways: (i) introducing a guidance signal that consists of the medical words in the input sequence, (ii) leveraging semantic type knowledge in UMLS to create clinically meaningful input embeddings, and (iii) making use of a novel weighted loss function that provides a stronger incentive for the model to correctly predict words with a medical meaning. By applying these three strategies, MedicalSum takes clinical knowledge into consideration during the summarization process and achieves state-of-the-art ROUGE score improvements of 0.8-2.1 points (including 6.2% ROUGE-1 error reduction in the PE section) when producing medical summaries of patient-doctor conversations.
2020
pdf
bib
abs
Generating Medical Reports from Patient-Doctor Conversations Using Sequence-to-Sequence Models
Seppo Enarvi
|
Marilisa Amoia
|
Miguel Del-Agua Teba
|
Brian Delaney
|
Frank Diehl
|
Stefan Hahn
|
Kristina Harris
|
Liam McGrath
|
Yue Pan
|
Joel Pinto
|
Luca Rubini
|
Miguel Ruiz
|
Gagandeep Singh
|
Fabian Stemmer
|
Weiyi Sun
|
Paul Vozila
|
Thomas Lin
|
Ranjani Ramamurthy
Proceedings of the First Workshop on Natural Language Processing for Medical Conversations
We discuss automatic creation of medical reports from ASR-generated patient-doctor conversational transcripts using an end-to-end neural summarization approach. We explore both recurrent neural network (RNN) and Transformer-based sequence-to-sequence architectures for summarizing medical conversations. We have incorporated enhancements to these architectures, such as the pointer-generator network that facilitates copying parts of the conversations to the reports, and a hierarchical RNN encoder that makes RNN training three times faster with long inputs. A comparison of the relative improvements from the different model architectures over an oracle extractive baseline is provided on a dataset of 800k orthopedic encounters. Consistent with observations in literature for machine translation and related tasks, we find the Transformer models outperform RNN in accuracy, while taking less than half the time to train. Significantly large wins over a strong oracle baseline indicate that sequence-to-sequence modeling is a promising approach for automatic generation of medical reports, in the presence of data at scale.
2012
pdf
bib
Mining Entity Types from Query Logs via User Intent Modeling
Patrick Pantel
|
Thomas Lin
|
Michael Gamon
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
pdf
bib
Entity Linking at Web Scale
Thomas Lin
|
Mausam
|
Oren Etzioni
Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX)
pdf
bib
No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities
Thomas Lin
|
Mausam
|
Oren Etzioni
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
2010
pdf
bib
Identifying Functional Relations in Web Text
Thomas Lin
|
Mausam
|
Oren Etzioni
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
pdf
bib
Machine Reading at the University of Washington
Hoifung Poon
|
Janara Christensen
|
Pedro Domingos
|
Oren Etzioni
|
Raphael Hoffmann
|
Chloe Kiddon
|
Thomas Lin
|
Xiao Ling
|
Mausam
|
Alan Ritter
|
Stefan Schoenmackers
|
Stephen Soderland
|
Dan Weld
|
Fei Wu
|
Congle Zhang
Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading