Enhancing Cross Text-Molecule Learning by Self-Augmentation

Yinuo Jiang, Xiang Zhuang, Keyan Ding, Qiang Zhang, Huajun Chen


Abstract
The development of Large Language Models (LLMs) has greatly advanced the field of drug discovery, with the belief that natural language can enhance human control over molecule design. However, the scarcity of high-quality labeled data remains a challenge for cross text-molecule learning. Existing datasets are limited due to the difficulty of collecting precise molecule-description pairs. Although recent efforts have utilized pseudo data generated by LLMs for augmentation, the lack of specialized chemistry knowledge of LLMs and the absence of an effective high quality data selector may introduce noise into the annotations, compromising the models’ robustness. To address these challenges, this paper introduces a novel framework that interweaves model fine-tuning and data augmentation to overcome the scarcity of high-quality data. The proposed approach involves an iterative procedure where the model plays dual roles in annotating unlabeled data and sampling a subset of high-quality data until convergence is achieved, enhancing the model’s understanding and adaptability. Additionally, a new dataset called SAPubChem-41 is presented, which comprises meticulously curated high-quality parallel molecule-description pairs designed specifically for fine-tuning purposes. This research provides an important contribution to the field by addressing the need for high-quality datasets and presenting an effective framework for cross text-molecule learning.
Anthology ID:
2024.findings-acl.569
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9551–9565
Language:
URL:
https://aclanthology.org/2024.findings-acl.569
DOI:
Bibkey:
Cite (ACL):
Yinuo Jiang, Xiang Zhuang, Keyan Ding, Qiang Zhang, and Huajun Chen. 2024. Enhancing Cross Text-Molecule Learning by Self-Augmentation. In Findings of the Association for Computational Linguistics ACL 2024, pages 9551–9565, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Enhancing Cross Text-Molecule Learning by Self-Augmentation (Jiang et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.569.pdf