From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

Karen Zhou; John Michael Giorgi; Pranav Mani; Peng Xu; Davis Liang; Chenhao Tan

doi:10.18653/v1/2025.emnlp-industry.104

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

Karen Zhou, John Michael Giorgi, Pranav Mani, Peng Xu, Davis Liang, Chenhao Tan

Abstract

AI-generated clinical notes are increasingly used in healthcare, but evaluating their quality remains a challenge due to high subjectivity and limited scalability of expert review. Existing automated metrics often fail to align with real-world physician preferences. To address this, we propose a pipeline that systematically distills real user feedback into structured checklists for note evaluation. These checklists are designed to be interpretable, grounded in human feedback, and enforceable by LLM-based evaluators. Using deidentified data from over 21,000 clinical encounters (prepared in accordance with the HIPAA safe harbor standard) from a deployed AI medical scribe system, we show that our feedback-derived checklist outperforms a baseline approach in our offline evaluations in coverage, diversity, and predictive power for human ratings. Extensive experiments confirm the checklist’s robustness to quality-degrading perturbations, significant alignment with clinician preferences, and practical value as an evaluation methodology. In offline research settings, our checklist offers a practical tool for flagging notes that may fall short of our defined quality standards.

Anthology ID:: 2025.emnlp-industry.104
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: November
Year:: 2025
Address:: Suzhou (China)
Editors:: Saloni Potdar, Lina Rojas-Barahona, Sebastien Montella
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1485–1499
Language:
URL:: https://aclanthology.org/2025.emnlp-industry.104/
DOI:: 10.18653/v1/2025.emnlp-industry.104
Bibkey:
Cite (ACL):: Karen Zhou, John Michael Giorgi, Pranav Mani, Peng Xu, Davis Liang, and Chenhao Tan. 2025. From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1485–1499, Suzhou (China). Association for Computational Linguistics.
Cite (Informal):: From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes (Zhou et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-industry.104.pdf

PDF Cite Search Fix data