Yujan Ting
2025
LEAF: Learning and Evaluation Augmented by Fact-Checking to Improve Factualness in Large Language Models
Hieu Tran
|
Junda Wang
|
Yujan Ting
|
Hong Yu
|
Weijing Huang
|
Terrence Chen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) often struggle with factual accuracy in knowledge-intensive domains like healthcare. We introduce LEAF (Learning and Evaluation Augmented by Fact-Checking), a framework for improving LLM factuality in medical question answering. LEAF comprises three components: (1) RAFE, a robust fact-checking system using open-source LLMs and domain-specific retrieval to evaluate response accuracy; (2) Fact-Check-then-RAG, which leverages fact-checking results to guide retrieval without parameter updates; and (3) Learning from Fact Check, enabling self-training through supervised fine-tuning or preference-based learning using fact-checking as pseudo-labels. Experimental results show that RAFE outperforms Factcheck-GPT in detecting inaccuracies, Fact-Check-then-RAG effectively corrects errors, and Learning from Fact Check improves performance without labeled data. In a real-world healthcare deployment with proprietary medical documents, LEAF achieved an 83% improvement in factuality scores, demonstrating practical applicability for adapting general-purpose LLMs to organization-specific knowledge. Our framework provides a scalable solution for industrial applications requiring high factual accuracy.
Training Medical QA Models Based on Mixed Rewards from Multiple-Choice and Open-Ended Questions
Yue Qiu
|
Yujan Ting
|
Pei Dong
|
Terrence Chen
|
Weijing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Reinforcement learning (RL) for large language models (LLMs) typically requires clear reward signals, which are often unavailable for open-ended (OE) questions where answer evaluation is ambiguous without scalable expert labeling. We investigate whether LLMs benefit from training on mixed data with varying reward clarity. Our approach combines Multiple-choice questions (MCQs), which offer clear binary rewards, with OE questions, for which we use simpler, potentially noisy rewards such as Jaccard similarity or LLM-based evaluators. We hypothesize that MCQs can stabilize training when mixed with OE questions. Our experiments show this mixed-data approach consistently improves medical question-answering performance across model scales.