LiGen: Active Lipid Generation via a Molecular Language Model

Ying Zhan; Xiuqi Tang; Yan Zhang; Xiao Tan; Dian Shen; Zhou Yu; Beilun Wang

LiGen: Active Lipid Generation via a Molecular Language Model

Ying Zhan, Xiuqi Tang, Yan Zhang, Xiao Tan, Dian Shen, Zhou Yu, Beilun Wang

Abstract

Lipid nanoparticles (LNPs) can deliver cargos to both tumor and immune cells, playing a crucial role in biomedicine. Traditional approaches rely on experimental screening and expert knowledge, which can be costly and time-consuming. Recent methods based on language models have accelerated this process using deep learning. Although these methods can retrieve molecules for fusion or rank candidates from existing libraries, they are still limited by the scope of known formulations. In this work, we propose a method, LiGen, to generate lipid molecules efficiently and actively, facilitating the discovery of high-performing LNP formulations. We first train a lipid-specific molecular language model, LiCore, to learn hidden representations of lipid molecules. We then explore the learned latent space to generate improved candidate formulations. This process is guided by a trained predictor, which evaluates delivery efficiency and provides directional signals. In reconstruction tasks, LiCore achieves nearly perfect reconstruction output with a low invalid ratio on both the LNP-Virtual900k and LNP-Exp12k datasets. The predictor consistently improves ranking-oriented metrics across multiple cell lines, with our method outperforming the best baselines by an average of 4.1%, 10.8%, and 8.1% in Top-50, Top-10, and Top-5 identification accuracy, respectively. Guided by the predictor, LiGen generates novel lipid candidates that achieve a 30.7% improvement over baseline methods on average, with some samples exceeding 50% improvement.

Anthology ID:: 2026.acl-long.392
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8674–8686
Language:
URL:: https://aclanthology.org/2026.acl-long.392/
DOI:
Bibkey:
Cite (ACL):: Ying Zhan, Xiuqi Tang, Yan Zhang, Xiao Tan, Dian Shen, Zhou Yu, and Beilun Wang. 2026. LiGen: Active Lipid Generation via a Molecular Language Model. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8674–8686, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: LiGen: Active Lipid Generation via a Molecular Language Model (Zhan et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.392.pdf
Checklist:: 2026.acl-long.392.checklist.pdf

PDF Cite Search Checklist Fix data