Beyond Lexical Similarity: Evaluating Faithfulness in LLM-Based Medical Question Reformulation

Md Rabiul Hasan; Aleka Melese Ayalew; Mourad Oussalah

Beyond Lexical Similarity: Evaluating Faithfulness in LLM-Based Medical Question Reformulation

Md Rabiul Hasan, Aleka Melese Ayalew, Mourad Oussalah

Abstract

Medical query rewriting transforms verbose consumer health questions into concise clinical queries, a critical step in health information retrieval. Large language models (LLMs) perform well on this task by standard metrics, yet high ROUGE or BERTScore does not guarantee preservation of clinical content. To address this issue, we introduce MedFaith-F1, a category-level faithfulness metric over four clinically salient categories: diagnoses, medications, procedures, and follow-up intent. We further propose a hybrid Evidence and Knowledge-Grounded Retrieval-Augmented Generation EKG-RAG, an evidence and knowledge-grounded framework combining hybrid retrieval over PubMed and MedlinePlus resources with UMLS (Unified Medical Language System)-aligned ontology grounding. Evaluating large language models LLaMA-3 and Qwen2.5 across zero-shot, few-shot, and QLoRA settings on MeQSum and medical question-pair (MQP) datasets revealed that base models exhibit category-level hallucination rates exceeding 40%, invisible to standard metrics, while EKG-RAG with QLoRA reduces this rate to 26.75%, achieving MedFaith-F1 of 0.73. Our findings call for faithfulness-aware evaluation in clinical query rewriting, and MedFaith-F1 provides a reproducible step in that direction.

Anthology ID:: 2026.smm4h-1.16
Volume:: Proceedings of the 11th Social Media Mining for Health Research and Applications (SMM4H-HeaRD 2026) Workshop and Shared Tasks
Month:: July
Year:: 2026
Address:: San Diego, United States
Editors:: Guillermo Lopez-Garcia, Graciela Gonzalez-Hernandez
Venues:: SMM4H | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 93–102
Language:
URL:: https://aclanthology.org/2026.smm4h-1.16/
DOI:
Bibkey:
Cite (ACL):: Md Rabiul Hasan, Aleka Melese Ayalew, and Mourad Oussalah. 2026. Beyond Lexical Similarity: Evaluating Faithfulness in LLM-Based Medical Question Reformulation. In Proceedings of the 11th Social Media Mining for Health Research and Applications (SMM4H-HeaRD 2026) Workshop and Shared Tasks, pages 93–102, San Diego, United States. Association for Computational Linguistics.
Cite (Informal):: Beyond Lexical Similarity: Evaluating Faithfulness in LLM-Based Medical Question Reformulation (Hasan et al., SMM4H 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.smm4h-1.16.pdf

PDF Cite Search Fix data