Minghao Yang
2025
Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models
Haonan He
|
Yuchen Ren
|
Yining Tang
|
Ziyang Xu
|
Junxian Li
|
Minghao Yang
|
Di Zhang
|
Yuan Dong
|
Tao Chen
|
Shufei Zhang
|
Yuqiang Li
|
Nanqing Dong
|
Wanli Ouyang
|
Dongzhan Zhou
|
Peng Ye
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) have shown remarkable capabilities in general domains, but their application to multi-omics biology remains underexplored. To address this gap, we introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences, including DNA, RNA, proteins, and multi-molecules. This dataset bridges LLMs and complex biological sequence-related tasks, enhancing their versatility and reasoning while maintaining conversational fluency. We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training. To overcome this, we propose ChatMultiOmics, a strong baseline with a novel three-stage training pipeline, demonstrating superior biological understanding through Biology-Instructions. Both resources are publicly available, paving the way for better integration of LLMs in multi-omics analysis. The Biology-Instructions is publicly available at: https://github.com/hhnqqq/Biology-Instructions.
Search
Fix author
Co-authors
- Tao Chen 1
- Yuan Dong 1
- Nanqing Dong 1
- Haonan He 1
- Junxian Li 1
- show all...