MolXPT: Wrapping Molecules with Text for Generative Pre-training

Generative pre-trained Transformer (GPT) has demonstrates its great success in natural language processing and related techniques have been adapted into molecular modeling. Considering that text is the most important record for scientific discovery, in this paper, we propose MolXPT, a unified language model of text and molecules pre-trained on SMILES (a sequence representation of molecules) wrapped by text. Briefly, we detect the molecule names in each sequence and replace them to the corresponding SMILES. In this way, the SMILES could leverage the information from surrounding text, and vice versa. The above wrapped sequences, text sequences from PubMed and SMILES sequences from PubChem are all fed into a language model for pre-training. Experimental results demonstrate that MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet, performs comparably to the best model in text-molecule translation while using less than half of its parameters, and enables zero-shot molecular generation without finetuning.


Introduction
Generative pre-trained Transformer (GPT), like GPT-3 (Brown et al., 2020) and ChatGPT (Ope-nAI, 2022), have obtained great success in natural language processing. They usually have billions of parameters and are trained on large corpus (Taylor et al., 2022;Singhal et al., 2022). By witnessing their great power, people start transferring language models to chemical (Bagal et al., 2022) and biological domains (Ferruz et al., 2022). For example, a small molecule (e.g., an oral drug) can be represented using simplified molecular-input lineentry system (SMILES) (Weininger, 1988), which * * Equal contribution. This work was done when Z. Liu and W. Zhang were interns at Microsoft Research AI4Science.
† † Corresponding authors. is a sequence obtained by traversing the molecular graph using depth-first-search and several rules for branching, aromaticity, etc. After serializing molecules, people pre-train language models on SMILES (Bagal et al., 2022;Tong et al., 2021;Frey et al., 2022) and obtain promising results for molecular generation.
Text is the most important record for molecular science and more generally, scientific discovery (Beltagy et al., 2019). It describes detailed properties of molecules, like how to synthesize the molecule (Feng et al., 2016), whether the molecule is toxic (Juurlink et al., 2003), etc. BioGPT (Luo et al., 2022) and PubMedGPT (Bolton et al., 2022) are two language models trained on biomedical literature. Recently, a new trend is to jointly model SMILES and scientific text so as to obtain shared representations across the two modalities. MolT5 is a T5-like (Raffel et al., 2020) model, where several spans of the text/SMILES are masked in the encoder and they should be reconstructed in the decoder. Galactica (Taylor et al., 2022) is a GPTlike (Brown et al., 2020) model pre-trained on various types of inputs, like text, SMILES, protein sequences, etc. Although those models demonstrate progress in prediction and generation tasks, they do not explicitly leverage the relation between molecules and text. An intuition is that, in scientific literature, when a molecule name appears in a sentence, the surrounding context could be a description of the molecule. This should be useful information for joint training but is ignored in those models.
To leverage such relations, in this work, we propose a novel molecule-text language model (MolXPT), which is trained on "wrapped" sequences: Given a sentence, we detect the molecular names with named entity recognition tools, and if any, replace them to the corresponding SMILES and obtain the "wrapped" sequence between SMILES and text. We pre-train a 24-layer

Finetuning
Blood-brain barrier penetration: true Figure 1: Framework of MolXPT. MolXPT is pretrained on text from PubMed, SMILES from PubChem and wrapped sequences between SMILES and text. The wrapped sequences are obtained by applying NER and entity linking to text and then replacing matched molecular mentions with SMILES. MolXPT can be finetuned for various text and molecular downstream tasks, like molecular property prediction and molecule-text translation.
MolXPT (with 350M parameters) on 8M wrapped sequences, as well as 30M SMILES from PubChem  and 30M titles and abstracts from PubMed (a popular biomedical literature search engine).
After pre-training, we finetune MolXPT on MoleculeNet (a benchmark about molecular property prediction) (Wu et al., 2018) and molecule-text translation (Edwards et al., 2022) using promptbased finetuning. On MoleculeNet, MolXPT outperforms strong baselines with sophisticated design like GEM (Fang et al., 2022). On text-molecule translation, MolXPT performs comparably with the state-of-the-art model, MolT5-large (Edwards et al., 2022). MolT5-large has 800M parameters while MolXPT only uses 44% of its parameters. We also verify that MolXPT has the zero-shot ability on text-to-molecule generation.

Our Method
MolXPT is a language model pre-trained on heterogeneous data including scientific text, SMILES sequences, and "wrapped" sequences between SMILES and text. Due to the flexible input, we can finetune it for various text and molecular tasks. The framework of MolXPT is in Figure 1.

Pre-training corpus
For scientific text, we use the titles and abstracts of 30M papers from PubMed 1 . For molecular SMILES, we randomly choose 30M molecules from PubChem 2 .
The wrapped sequences are constructed via a "detect and replace" pipeline. We first use BERN2 (Sung et al., 2022), a widely used named entity recognition (NER) tool for biomedical purpose, to 1 https://ftp.ncbi.nlm.nih.gov/pubmed/ 2 https://pubchem.ncbi.nlm.nih.gov/ detect all mentions of molecules and link them to the entities in public knowledge bases like ChEBI (Hastings et al., 2016). After that, we can retrieve the molecular SMILES of the matched entities. Finally, we replace the molecular mentions to their corresponding SMILES. An example is shown in the left panel of Figure 1. The wrapped sequences must contain at least one molecular SMILES. We eventually obtain 8M wrapped sequences in total.
Text and SMILES are tokenized separately. For text, we use byte-pair encoding (BPE) (Sennrich et al., 2016) to split the words into subwords. The number of BPE merge operation is 40k. For SMILES sequences (including those in wrapped sequences), we tokenize them with the regular expression from (Schwaller et al., 2018). For each SMILES sequence S, we add a start-of-molecule token som at the beginning of S and append an end-of-molecule token eom at the end of S.

Model and training
Model architecture: MolXPT has the same architecture as the GPT models (Radford et al., 2019). Due to computational resource limitation, in this paper, we follow the GPT-2 medium configuration with 24 layers, 1024 hidden size and 16 attention heads. The maximum length of input we can process is 2048 and the vocabulary size is 44536. In total, our model has 350M parameters. Pre-training: The pre-training objective function of MolXPT is the negative log-likelihood. Mathematically, let D = {x i } i denote the collection of sequences of the three types of the data, and x i = (s i,1 , s i,2 , · · · , s i,n i ) is the i-th sequence with n i tokens. The training objective function is: log P (s i,j |s i,j−1 , s i,j−2 , · · · , s 1 ).  Gu et al., 2022). Therefore, we adopt prompt-based finetuning (Gao et al., 2021) to unify different tasks into a sequence generation task, which is consistent with the pre-training objective. Briefly, given a task, we convert the input and output into text and/or SMILES sequences, equip the sequences with task-specific prompts and finetune using language modeling loss. Prompts for MoleculeNet and text-molecule translation are introduced in the Section 3.1 and 3.2 respectively. Discussion: Some works also try to jointly model text and molecules. Zeng et al. (2022) propose KV-PLM, where SMILES sequences are appended after molecule names for pre-training. Su et al.

Experiments
We evaluated MolXPT on two downstream tasks: (1) molecular property prediction on MoleculeNet (Wu et al., 2018), which is to predict whether the given molecule has specific properties; (2) the generation between text descriptions and molecules (Edwards et al., 2022), where both molecules and text should be considered. In this section, we fo-cus on introducing task definition, prompt design and results while leaving the detailed finetuning hyper-parameters in Appendix C.

Results on MoleculeNet
MoleculeNet (Wu et al., 2018) is a widely-used benchmark for molecular modeling, which has more than 700k compounds for various different properties. We choose six molecular classification tasks for evaluation, which are BBBP, Tox21, Clin-Tox, HIV, BACE and SIDER. Details are left in Appendix A. We follow GEM (Fang et al., 2022) to split the data into training/validation/test sets based on the scaffold. For these tasks, the input is a SMILES and the output is a binary label. Previous molecular property prediction models mainly use SMILES sequences or molecular graphs as input, while we can use the "wrapped" sequences. For example, one task is to predict the blood-brain barrier penetration (BBBP) of a molecule. Therefore, the prompt is "We can conclude that the BBB penetration of som SMILES eom is true/false". All the prompts for six tasks can be found in Appendix C.1. The finetuning hyper-parameters are summarized in Appendix C.2. We compare MolXPT with two types of baselines: (1) pre-trained language model baselines including KV-PLM (Zeng et al., 2022), Galactica (Taylor et al., 2022) and MoMu (Su et al., 2022).
MolXPT outperforms the GNN baselines pretrained on pure molecular data, indicating the effectiveness of pre-training with scientific text corpus. Compared with Galactica which also uses both SMILES and text for pre-training GPT-like model, MolXPT obtains better performance. Note that Galactica does not purposely build and train on the "wrapped" sequences, whose importance is demonstrated via our empirical results. A possible explanation of the superior performance is that the SMILES describes the component and structural information of molecules, while the text describes the general properties. They are complementary to each other, and joint training on them brings more effective representations.

Results on text-molecule translation
We evaluated the performance of MolXPT on CheBI-20 (Edwards et al., 2021), a bidirectional text-molecule translation dataset. It consists of 33,010 molecule-description pairs. We use the data split provided by MolT5 (Edwards et al., 2022), where the training, validation and test sets account 80%, 10% and 10% of total data. For molecule-totext generation, given a molecular SMILES S, the prompt is: "The description of som S eom is: The molecule is", followed by the text description of S. For text-to-molecule generation, given a text description T , the prompt is: "T . The compound is som ", and the model will generate the molecular SMILES ended with eom . We compare our method with MolT5 (Edwards et al., 2022).
For molecule-to-text generation, the results are evaluated by NLP metrics including BLEU (Papineni et al., 2002), Rouge (Lin, 2004) and ME-TEOR (Banerjee and Lavie, 2005). "Text2mol" is a deep learning based metric proposed by Edwards et al. (2022) to measure the similarity of the text-molecule pairs. For text-to-molecule generation, we evaluate the following metrics: the proportion of the generated SMILES that exactly match the reference SMILES (denoted as "Exact"); the Tanimoto similarity of three types of fingerprints: MACCS (Durant et al., 2002), RDK (Schneider et al., 2015) and Morgan (Rogers and Hahn, 2010); the FCD score (Preuer et al., 2018), which measures the molecule distances by a pretrained model; the percentage of the valid generated SMILES. The results are reported in Table 2.
We observe that MolXPT achieves signifi-cantly better performance than MolT5-small and MolT5-base, and has comparable performance with MolT5-large. Note that MolT5-large has 800M parameters while MolXPT only uses 44% of its parameters. For both tasks, our model performs the best on Text2Mol metric, indicating that MolXPT captures the alignment between text and molecule better. We attribute it to the wrapped sequences, by which the model can learn the relation between molecule and text explicitly. We further verify the zero-shot text-to-molecule generation ability of MolXPT. The pre-trained MolXPT takes the text as input and directly generates molecules without finetuning. The top-1 and top-5 fingerprint similarity is in Table 3. Indeed, compared with the full data setting, the performance drops, but still reasonable numbers. In addition, the zero-shot MolXPT successfully recovers 33 molecules based on the text (see Appendix D).

Conclusions and Future Work
We propose MolXPT, a generative model pretrained on scientific text, molecular SMILES and their wrapped sequences. We train a 24-layer MolXPT with 350M parameters. By prompt-based finetuning, it improves strong baselines on Molecu-leNet and achieves comparable results with the best model on molecule-text translation but using much fewer parameters.
For future work, first, we will train larger MolXPT to further verify the performances across different tasks and the zero-shot/in-context (Xie et al., 2022) learning ability. Second, how to further enhance the interaction between molecules and text (e.g., using contrastive learning to enhance consistency) should be studied. Third, how to effectively adapt MolXPT into other molecule and text tasks such as text-guided molecule optimization is another direction to explore.

Limitations
One limitation of our method is that when training larger models, it requires more computation resources, whose cost is relatively high. However, after pre-training, we will release our models so that readers can directly use them without pre-training again.   The molecule is a sesquiterpene lactone and active principle of Feverfew (Tanacetum parthenium). It has a role as a nonsteroidal anti-inflammatory drug, a non-narcotic analgesic, a peripheral nervous system drug, an inhibitor and a drug allergen.
The molecule is the (R)-enantiomer of mevalonic acid. It is a conjugate acid of a (R)-mevalonate. It is an enantiomer of a (S)mevalonic acid.
The molecule is a bile acid taurine conjugate of ursocholic acid. It has a role as a human metabolite and a rat metabolite. It derives from an ursocholic acid. It is a conjugate acid of a tauroursocholate. Figure 2: Examples for zero-shot text-to-molecule generation. We randomly pick up three cases that MolXPT can successfully generate the reference molecules without finetuning.

C Finetuning details of downstream tasks C.1 Prompts for finetuning MoleculeNet
(1) BBBP: "We can conclude that the BBB penetration of som SMILES eom is true/false." (2) Tox21: "We can conclude that the som SMILES eom activity outcome on target is active/inactive. " where target refers to corresponding receptor or enzyme for each subtask, e.g. the target of subtask "AR" is "Androgen Receptor".
(3) ClinTox:"We can conclude that the clinical trial toxicity of som SMILES eom is true/false." for subtask CT_TOX and "We can conclude that the FDA approval status of som SMILES eom is true/false." for subtask FDA_APPROVED.
(4) HIV: "We can conclude that the screening result of ability to inhibit HIV replication of som SMILES eom is active/inactive." (5) BACE: "We can conclude that the binding result on beta-secretase 1 of som SMILES eom is true/false." (6) SIDER:"We can conclude that the som SMILES eom can bring about the side effect of side-effect is true/false." where side-effect refers to corresponding side-effect for each subtask.

C.3 Details of finetuning text-molecule generation
For text-molecule generation, MolXPT is finetuned for 100 steps on one P40 GPU with 1024 tokens and 16 accumulated steps per device. Models are finetuned for 100 epochs. The learning rate is 0.0001 and the dropout rate is grid searched from [0.1, 0.2, 0.3, 0.4, 0.5]. Setting dropout rate as 0.4 and 0.5 achieves the best validation performance on molecule-to-text generation and text-to-molecule generation respectively. We use the corresponding models for testing.

D Zero-shot text-to-molecule generation
Given K generated moleculem 1 ,m 2 , · · · ,m K and the reference molecule m, the top-K fingerprint similarity is max i∈ [K] similarity(m,m i ).
MolXPT generates 33 molecules that can exactly match the reference molecules without finetuning. Figure 2 shows three of the cases and the remaining molecules are left in the supplementary material.