Med-CoDE: Medical Critique based Disagreement Evaluation Framework

Mohit Gupta, Akiko Aizawa, Rajiv Ratn Shah


Abstract
The emergence of large language models (LLMs) has significantly influenced numerous fields, including healthcare, by enhancing the capabilities of automated systems to process and generate human-like text. However, despite their advancements, the reliability and accuracy of LLMs in medical contexts remain critical concerns. Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance, leading to potential risks in clinical settings. In this work, we propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges. The framework leverages a critique-based approach to quantitatively measure the degree of disagreement between model-generated responses and established medical ground truths. This framework captures both accuracy and reliability in medical settings. The proposed evaluation framework aims to fill the existing gap in LLM assessment by offering a systematic method to evaluate the quality and trustworthiness of medical LLMs. Through extensive experiments and case studies, we illustrate the practicality of our framework in providing a comprehensive and reliable evaluation of medical LLMs.
Anthology ID:
2025.naacl-srw.11
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Month:
April
Year:
2025
Address:
Albuquerque, USA
Editors:
Abteen Ebrahimi, Samar Haider, Emmy Liu, Sammar Haider, Maria Leonor Pacheco, Shira Wein
Venues:
NAACL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
112–119
Language:
URL:
https://aclanthology.org/2025.naacl-srw.11/
DOI:
Bibkey:
Cite (ACL):
Mohit Gupta, Akiko Aizawa, and Rajiv Ratn Shah. 2025. Med-CoDE: Medical Critique based Disagreement Evaluation Framework. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 112–119, Albuquerque, USA. Association for Computational Linguistics.
Cite (Informal):
Med-CoDE: Medical Critique based Disagreement Evaluation Framework (Gupta et al., NAACL 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.naacl-srw.11.pdf