Knowlab’s Submission to L+M Shared Task: All you need is continued pretraining of chemistry texts even for molecule captioning

Yunsoo Kim, Honghan Wu


Abstract
This paper presents our submission to the L+M-24 shared task, focused on translating molecular structures into natural language descriptions, known as the molecule captioning task. We selected a small language model (SLM), Phi-3-mini-4k, to evaluate the impact of continued pretraining and instruction tuning for domain-specific chemical knowledge. The Phi-3 model was continued pretrained with 90M chemistry textbooks and abstracts, followed by instruction tuning on 150K question answering sets of SMILES and general chemistry knowledge. Despite the continued pretraining phase not including direct exposure to SMILES representations, it significantly enhanced the Phi-3 model’s performance, a 300% increase for the BLEU scores, in the molecule captioning task. The code and model are released at https://github.com/bluesky333/Phi3KnowChem to facilitate research in chemical small language modeling.
Anthology ID:
2024.langmol-1.11
Volume:
Proceedings of the 1st Workshop on Language + Molecules (L+M 2024)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Carl Edwards, Qingyun Wang, Manling Li, Lawrence Zhao, Tom Hope, Heng Ji
Venues:
LangMol | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
91–96
Language:
URL:
https://aclanthology.org/2024.langmol-1.11
DOI:
Bibkey:
Cite (ACL):
Yunsoo Kim and Honghan Wu. 2024. Knowlab’s Submission to L+M Shared Task: All you need is continued pretraining of chemistry texts even for molecule captioning. In Proceedings of the 1st Workshop on Language + Molecules (L+M 2024), pages 91–96, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Knowlab’s Submission to L+M Shared Task: All you need is continued pretraining of chemistry texts even for molecule captioning (Kim & Wu, LangMol-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.langmol-1.11.pdf