Enhanced BioT5+ for Molecule-Text Translation: A Three-Stage Approach with Data Distillation, Diverse Training, and Voting Ensemble

Qizhi Pei, Lijun Wu, Kaiyuan Gao, Jinhua Zhu, Rui Yan


Abstract
This paper presents our enhanced BioT5+ method for the Language + Molecules shared task at the ACL 2024 Workshop. The task involves “translating” between molecules and natural language, including molecule captioning and text-based molecule generation using the L+M-24 dataset. Our method consists of three stages. In the first stage, we distill data from various models. In the second stage, combined with extra version of the provided dataset, we train diverse models for subsequent voting ensemble.We also adopt Transductive Ensemble Learning (TEL) to enhance these base models. Lastly, all models are integrated using a voting ensemble method. Experimental results demonstrate that BioT5+ achieves superior performance on L+M-24 dataset. On the final leaderboard, our method (team name: qizhipei) ranks first in the text-based molecule generation task and second in the molecule captioning task, highlighting its efficacy and robustness in translating between molecules and natural language. The pre-trained BioT5+ models are available at https://github.com/QizhiPei/BioT5.
Anthology ID:
2024.langmol-1.6
Volume:
Proceedings of the 1st Workshop on Language + Molecules (L+M 2024)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Carl Edwards, Qingyun Wang, Manling Li, Lawrence Zhao, Tom Hope, Heng Ji
Venues:
LangMol | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
48–54
Language:
URL:
https://aclanthology.org/2024.langmol-1.6
DOI:
10.18653/v1/2024.langmol-1.6
Bibkey:
Cite (ACL):
Qizhi Pei, Lijun Wu, Kaiyuan Gao, Jinhua Zhu, and Rui Yan. 2024. Enhanced BioT5+ for Molecule-Text Translation: A Three-Stage Approach with Data Distillation, Diverse Training, and Voting Ensemble. In Proceedings of the 1st Workshop on Language + Molecules (L+M 2024), pages 48–54, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Enhanced BioT5+ for Molecule-Text Translation: A Three-Stage Approach with Data Distillation, Diverse Training, and Voting Ensemble (Pei et al., LangMol-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.langmol-1.6.pdf