Geneverse: A Collection of Open-source Multimodal Large Language Models for Genomic and Proteomic Research

Tianyu Liu, Yijia Xiao, Xiao Luo, Hua Xu, Wenjin Zheng, Hongyu Zhao


Abstract
The applications of large language models (LLMs) are promising for biomedical and healthcare research. Despite the availability of open-source LLMs trained using a wide range of biomedical data, current research on the applications of LLMs to genomics and proteomics is still limited. To fill this gap, we propose a collection of finetuned LLMs and multimodal LLMs (MLLMs), known as Geneverse, for three novel tasks in genomic and proteomic research. The models in Geneverse are trained and evaluated based on domain-specific datasets, and we use advanced parameter-efficient finetuning techniques to achieve the model adaptation for tasks including the generation of descriptions for gene functions, protein function inference from its structure, and marker gene selection from spatial transcriptomic data. We demonstrate that adapted LLMs and MLLMs perform well for these tasks and may outperform closed-source large-scale models based on our evaluations focusing on both truthfulness and structural correctness. All of the training strategies and base models we used are freely accessible. Our codes can be found at https://github.com/HelloWorldLTY/Geneverse.
Anthology ID:
2024.findings-emnlp.277
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4819–4836
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.277
DOI:
Bibkey:
Cite (ACL):
Tianyu Liu, Yijia Xiao, Xiao Luo, Hua Xu, Wenjin Zheng, and Hongyu Zhao. 2024. Geneverse: A Collection of Open-source Multimodal Large Language Models for Genomic and Proteomic Research. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4819–4836, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Geneverse: A Collection of Open-source Multimodal Large Language Models for Genomic and Proteomic Research (Liu et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.277.pdf
Data:
 2024.findings-emnlp.277.data.zip