Training Medical QA Models Based on Mixed Rewards from Multiple-Choice and Open-Ended Questions

Yue Qiu, Yujan Ting, Pei Dong, Terrence Chen, Weijing Huang


Abstract
Reinforcement learning (RL) for large language models (LLMs) typically requires clear reward signals, which are often unavailable for open-ended (OE) questions where answer evaluation is ambiguous without scalable expert labeling. We investigate whether LLMs benefit from training on mixed data with varying reward clarity. Our approach combines Multiple-choice questions (MCQs), which offer clear binary rewards, with OE questions, for which we use simpler, potentially noisy rewards such as Jaccard similarity or LLM-based evaluators. We hypothesize that MCQs can stabilize training when mixed with OE questions. Our experiments show this mixed-data approach consistently improves medical question-answering performance across model scales.
Anthology ID:
2025.findings-emnlp.463
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8721–8729
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.463/
DOI:
Bibkey:
Cite (ACL):
Yue Qiu, Yujan Ting, Pei Dong, Terrence Chen, and Weijing Huang. 2025. Training Medical QA Models Based on Mixed Rewards from Multiple-Choice and Open-Ended Questions. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8721–8729, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Training Medical QA Models Based on Mixed Rewards from Multiple-Choice and Open-Ended Questions (Qiu et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.463.pdf
Checklist:
 2025.findings-emnlp.463.checklist.pdf