Multi-modal Concept Alignment Pre-training for Generative Medical Visual Question Answering

Quan Yan, Junwen Duan, Jianxin Wang


Abstract
Medical Visual Question Answering (Med-VQA) seeks to accurately respond to queries regarding medical images, a task particularly challenging for open-ended questions. This study unveils the Multi-modal Concept Alignment Pre-training (MMCAP) approach for generative Med-VQA, leveraging a knowledge graph sourced from medical image-caption datasets and the Unified Medical Language System. MMCAP advances the fusion of visual and textual medical knowledge via a graph attention network and a transformer decoder. Additionally, it incorporates a Type Conditional Prompt in the fine-tuning phase, markedly boosting the accuracy and relevance of answers to open-ended questions. Our tests on benchmark datasets illustrate MMCAP’s superiority over existing methods, demonstrating its high efficiency in data-limited settings and effective knowledge-image alignment capability.
Anthology ID:
2024.findings-acl.319
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5378–5389
Language:
URL:
https://aclanthology.org/2024.findings-acl.319
DOI:
Bibkey:
Cite (ACL):
Quan Yan, Junwen Duan, and Jianxin Wang. 2024. Multi-modal Concept Alignment Pre-training for Generative Medical Visual Question Answering. In Findings of the Association for Computational Linguistics ACL 2024, pages 5378–5389, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Multi-modal Concept Alignment Pre-training for Generative Medical Visual Question Answering (Yan et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.319.pdf