MedINST: Meta Dataset of Biomedical Instructions

Wenhan Han, Meng Fang, Zihan Zhang, Yu Yin, Zirui Song, Ling Chen, Mykola Pechenizkiy, Qingyu Chen


Abstract
The integration of large language model (LLM) techniques in the field of medical analysis has brought about significant advancements, yet the scarcity of large, diverse, and well-annotated datasets remains a major challenge. Medical data and tasks, which vary in format, size, and other parameters, require extensive preprocessing and standardization for effective use in training LLMs. To address these challenges, we introduce MedINST, the Meta Dataset of Biomedical Instructions, a novel multi-domain, multi-task instructional meta-dataset. MedINST comprises 133 biomedical NLP tasks and over 7 million training samples, making it the most comprehensive biomedical instruction dataset to date. Using MedINST as the meta dataset, we curate MedINST32, a challenging benchmark with different task difficulties aiming to evaluate LLMs’ generalization ability. We fine-tune several LLMs on MedINST and evaluate on MedINST32, showcasing enhanced cross-task generalization.
Anthology ID:
2024.findings-emnlp.482
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8221–8240
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.482
DOI:
Bibkey:
Cite (ACL):
Wenhan Han, Meng Fang, Zihan Zhang, Yu Yin, Zirui Song, Ling Chen, Mykola Pechenizkiy, and Qingyu Chen. 2024. MedINST: Meta Dataset of Biomedical Instructions. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8221–8240, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
MedINST: Meta Dataset of Biomedical Instructions (Han et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.482.pdf