Self-Training Large Language and Vision Assistant for Medical Question Answering

Guohao Sun, Can Qin, Huazhu Fu, Linwei Wang, Zhiqiang Tao


Abstract
Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on building high-quality visual instruction data, which is costly and labor-intensive to obtain, particularly in the medical domain. To mitigate this data-starving issue, we introduce Self-Training Large Language and Vision Assistant for Medical (STLLaVA-Med). The proposed method is designed to train a policy model (an LVLM) capable of auto-generating medical visual instruction data to improve data efficiency, guided through Direct Preference Optimization (DPO). Specifically, a more powerful and larger LVLM (e.g., GPT-4o) is involved as a biomedical expert to oversee the DPO fine-tuning process on the auto-generated data, encouraging the policy model to align efficiently with human preferences. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks, demonstrating competitive zero-shot performance with the utilization of only 9% of the medical data.
Anthology ID:
2024.emnlp-main.1119
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
20052–20060
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1119
DOI:
Bibkey:
Cite (ACL):
Guohao Sun, Can Qin, Huazhu Fu, Linwei Wang, and Zhiqiang Tao. 2024. Self-Training Large Language and Vision Assistant for Medical Question Answering. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20052–20060, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Self-Training Large Language and Vision Assistant for Medical Question Answering (Sun et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1119.pdf