Ruyi Ouyang
2024
Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
Junying Chen
|
Chi Gui
|
Ruyi Ouyang
|
Anningzhe Gao
|
Shunian Chen
|
Guiming Hardy Chen
|
Xidong Wang
|
Zhenyang Cai
|
Ke Ji
|
Xiang Wan
|
Benyou Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed’s large-scale, de-identified medical image-text pairs to address these limitations, they often fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an ‘unblinded’ capacity to denoise and reformat the data, resulting in the creation of the **PubMedVision** dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM **HuatuoGPT-Vision**, which shows superior performance in medical multimodal scenarios among open-source MLLMs. Our code and data are available at https://github.com/FreedomIntelligence/HuatuoGPT-Vision.
Search
Co-authors
- Junying Chen 1
- Chi Gui 1
- Anningzhe Gao 1
- Shunian Chen 1
- Guiming Hardy Chen 1
- show all...