Translation between Molecules and Natural Language

We present MolT5 - a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. MolT5 allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since MolT5 pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that MolT5-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.


Introduction
Imagine a future where a doctor can write a few sentences describing a specialized drug for treating a patient and then receive the exact structure of the desired drug.Although this seems like science fiction now, with progress in integrating natural language and molecules, it might well be possible in the future.Historically, drug creation has commonly been done by humans who design and build individual molecules.In fact, bringing a new drug to market can cost over a billion dollars and take over ten years (Gaudelet et al., 2021).Recently, there has been considerable interest in using new deep learning tools to facilitate in silico drug design-a field often called cheminformatics (Rifaioglu et al., 2018).Yet, many of these experiments still focus on molecules and their low-level properties such as logP (the octanol-water partition coefficient) (Bagal et al., 2021).In the future, * indicates equal contributions. 1 All resources are publicly available at github.com/blendernlp/MolT5 The molecule is an eighteen-membered homodetic cyclic peptide which is isolated from Oscillatoria sp. and exhibits antimalarial activity against the W2 chloroquine-resistant strain of the malarial parasite, Plasmodium falciparum.It has a role as a metabolite and an antimalarial.It is a homodetic cyclic peptide, a member of 1,3oxazoles, a member of 1,3-thiazoles and a macrocycle.we foresee a need for a higher-level control over molecule design, which can easily be facilitated by natural language.
In this work, we pursue an ambitious goal of translating between molecules and language by proposing two new tasks: molecule captioning and text-guided de novo molecule generation.In molecule captioning, we take a molecule (e.g., as a SMILES string) and generate a caption that describes it (Figure 2).In text-guided molecule generation, the task is to create a molecule that matches a given natural language description (Figure 1).These new tasks would help to accelerate research in multiple scientific domains by enabling chemistry domain experts to generate new molecules and better understand them using natural language.
While our proposed molecule-language tasks share some similarities with vision-language tasks, they have several inherent difficulties that separate them from existing vision-language analogs: 1) creating annotations for molecules requires significant domain expertise, 2) thus, it is significantly more difficult to acquire large numbers of molecule-

Caption
The molecule is an organic disulfide isolated from the whole broth of the marine-derived fungus Exserohilum rostratum and has been shown to exhibit antineoplastic activity.It has a role as a metabolite and an antineoplastic agent.It is a bridged compound, a lactam, an organic disulfide, an organic heterohexacyclic compound, a secondary alcohol, a cyclic ketone and a diol.

Molecule Captioning Image Captioning
1. a cat sitting on top of an open laptop computer.
2. a cat that is sitting on top of a lap top.
3. a cat is sitting on the keyboard of a laptop.(Chen et al., 2015) and molecule captioning.Molecule captioning is considerably more difficult because of the increased linguistic variety in possible captions.description pairs, 3) the same molecule can have many functions and thus be described in very different ways, which causes 4) existing evaluation measures based on reference descriptions, such as BLEU, to fail to adequately evaluate these tasks.
To address the issue of data scarcity (i.e., difficulties 1 and 2), we propose a new self-supervised learning framework named MolT5 (Molecular T5) that is inspired by the recent progress in pretraining multilingual models (Devlin et al., 2019;Liu et al., 2020).MolT5 first pretrains a model on a vast amount of unlabeled natural language text and molecule strings using a simple denoising objective.After that, the pretrained model is finetuned on limited gold standard annotations.Furthermore, to adequately evaluate models for molecule captioning or generation, we consider various kinds of metrics and also adopt a new metric based on Text2Mol (Edwards et al., 2021).We repurpose this retrieval model for assessing the similarity between the ground truth molecule/description and the generated description/molecule, respectively.
To the best of our knowledge, there is no work yet on molecule captioning or text-guided molecule generation.The closest existing work to molecule captioning falls within the scope of image captioning (Vinyals et al., 2015).However, molecule captioning is arguably much more challenging due to the increased linguistic variety in possible captions (Figure 2).A molecule could be described with an IUPAC name, with one of many different synthetic routes from known precursor molecules, in terms of the properties (e.g.carcinogenic or lipophilic), with the applications of the molecule (e.g. a dye, an antipneumonic, or an antifungal), or in terms of its functional groups (e.g."substituted by hydroxy groups at positions 5 and 7 and a methyl group at position 8"), among other methods.
In summary, our main contributions are: 1.We propose two new tasks: 1) molecule captioning, where a description is generated for a given molecule, and 2) text-based de novo molecule generation, where a molecule is generated to match a given text description.2. We consider multiple evaluation metrics for these new tasks, and we adopt a new crossmodal retrieval similarity metric based on Text2Mol (Edwards et al., 2021).3. We propose MolT5: a self-supervised learning framework for jointly training a model on molecule string representations and natural language text, which can then be finetuned on a cross-modal task.

Tasks
With the ambitious goal of bi-directional translation between molecules and language, we propose two new novel tasks: molecule captioning (Section 2.1) and text-based molecule generation (Section 2.2).

Molecule Captioning
For any given molecule, the goal of molecule captioning is to describe the molecule and what it does.An example is shown in Figure 2. Molecules are often represented as SMILES strings (Weininger, 1988;Weininger et al., 1989), a linearization of the molecular graph which can be interpreted as a language for molecules.Thus, this task can be considered an exotic translation task, and sequence to sequence models serve as excellent baselines.

Text-Based de Novo Molecule Generation
The goal of the de novo molecule generation task is to train a model which can generate a variety of possible new molecules.Existing work tends to focus on evaluating the model coverage of the chemical space (Polykovskiy et al., 2020).Instead, we propose generating molecules based on a natural language description of the desired moleculethis is essentially swapping the input and output for the captioning task.An example of this task is shown in Figure 1.Recent work, such as DALL•E (Ramesh et al., 2021(Ramesh et al., , 2022)), which generates images from text, has shown the ability to seamlessly integrate multiple properties, such as chairs and avocados, in an image.This points towards similar applications in the molecule generation domain via the usage of natural language.

Text2Mol Metric
Since we are considering new cross-modal tasks between molecules and text, we also introduce a new cross-modal evaluation metric.This is based on Text2Mol (Edwards et al., 2021), which aims to train a retrieval model to rank molecules given their text descriptions.Since the ranking function uses cosine similarity between embeddings, a trained model can be repurposed for evaluating the similarity between the ground truth molecule/description and the generated description/molecule (respectively).To this end, we first train a base multi-layer perceptron (MLP) model from Text2Mol.This model is then used to generate similarities of the candidate molecule-description pairs, which can be compared to the average similarity of the ground truth molecule-description pairs.We also note that negative molecule-description pairs have an average similarity of roughly zero.

Evaluating Molecule Captioning
Traditionally, captioning tasks have been evaluated by natural language generation metrics such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and METEOR (Banerjee and Lavie, 2005).Un-like captioning tasks such as COCO (Chen et al., 2015), which has several captions per image, in our task we only have one reference caption.This makes these metrics less effective, especially because there are many non-overlapping ways to describe a molecule.Nevertheless, for comparison, we still report these scores (e.g., aggregated sentence-level METEOR scores).

Evaluating Text-Based de Novo Molecule Generation
Considerable interest has grown in applying deep generative models to de novo molecule generation.
Because of this, a number of metrics have been proposed, such as novelty and scaffold similarity (Polykovskiy et al., 2020).However, many of these metrics do not apply to our problem-we want our generated molecule to match the input text instead of being generally diverse.Instead, we consider metrics which measure the distance of the generated molecule to either the ground truth molecule or the ground truth description, such as our proposed Text2Mol-based metric.
We employ three fingerprint metrics: MACCS FTS, RDK FTS, and Morgan FTS, where FTS stands for fingerprint Tanimoto similarity (Tanimoto, 1958).MACCS (Durant et al., 2002), RDK (Schneider et al., 2015), and Morgan (Rogers and Hahn, 2010) are each fingerprinting methods for molecules.The fingerprints of two molecules are compared using Tanimoto similarity (also known as Jaccard index), and the average similarity over the evaluation dataset is reported.See (Campos and Ji, 2021) for more details.We also report exact SMILES string matches, Levenshtein distance (Miller et al., 2009), and SMILES BLEU scores.Preuer et al. (2018) propose Fréchet ChemNet Distance (FCD), which is inspired by the Fréchet Inception Distance (FID) (Heusel et al., 2017).FCD is based on the penultimate layer of a network called "ChemNet", which was trained to predict the activity of drug molecules.Thus, FCD takes into account chemical and biological information about molecules in order to compare them.This allows molecules to be compared based on the latent information required to predict useful properties rather than a string-based metric.
In the case of models which use SMILES strings, generated molecules can be syntactically invalid.Therefore, we also report validity as the percent of molecules which can be processed by RDKIT Figure 3: A diagram of our framework.We first pre-train MolT5 on a large amount of data of both SMILES string and natural language using the "replace corrupted spans" objective (Raffel et al., 2020).After the pre-training stage, MolT5 can be easily fine-tuned for either the task of molecule captioning or generation (or both).(Landrum, 2021) as in (Polykovskiy et al., 2020).

MolT5 -Multimodal Text-Molecule Representation Model
We can crawl a massive amount of text from the Internet.For example, Raffel et al. (2020) built a Common Crawl-based dataset that contains over 700 GB of reasonably clean and natural English text.On the other hand, over a billion molecules are also available from public databases such as ZINC-15 (Sterling and Irwin, 2015a).Inspired by the progress in large-scale pretraining (Ramesh et al., 2021), we propose a new self-supervised learning framework named MolT5 (Molecular T5) to leverage the vast amount of unlabeled natural language text and molecule strings.
Figure 3 shows an overview of MolT5.We first initialize an encoder-decoder Transformer model (Vaswani et al., 2017) using one of the public checkpoints of T5.1.1 2 , an improved version of T5 (Raf-2 https://tinyurl.com/t511-ckptsfel et al., 2020).After that, we pretrain the model using the "replace corrupted spans" objective (Raffel et al., 2020).More specifically, during each pretraining step, we sample a minibatch comprising both natural language sequences and SMILES sequences.For each sequence, some words in the sequence are randomly chosen for corruption.Each consecutive span of corrupted tokens is replaced by a sentinel token (shown as [X] and [Y] in Figure 3).Then the task is to predict the dropped-out spans. 3olecules (e.g.represented as SMILES strings) can be thought of as a language with a very unique grammar.Then, intuitively, our pretraining stage essentially trains a single language model on two monolingual corpora from two different languages, and there is no explicit alignment between the two corpora.This approach is similar to how some multilingual language models such as mBERT (Devlin et al., 2019) and mBART (Liu et al., 2020) were pretrained.As models such as mBERT demonstrate ex-cellent cross-lingual capabilities (Pires et al., 2019), we also expect models pretrained using MolT5 to be useful for text-molecule translation tasks.
After the pretraining process, we can finetune the pretrained model for either molecule captioning or generation (depicted by the bottom half of Figure 3).In molecule generation, the input is a description, and the output is the SMILES representation of the target molecule.On the other hand, in molecule captioning, the input is the SMILES string of some molecule, and the output is a caption describing the input molecule.

Data
Pretraining Data As described in Section 4, the pretraining stage of MolT5 requires two monolingual corpora: one consisting of natural language text and the other consisting of molecule representations.We use the "Colossal Clean Crawled Corpus" (C4) (Raffel et al., 2020) as the pretraining dataset for the textual modality.For the molecular modality, we directly utilize the 100 million SMILES strings used in Chemformer (Irwin et al., 2021).As these strings were selected from the ZINC-15 dataset (Sterling and Irwin, 2015b), we refer to this pretraining dataset as ZINC from this point.

Finetuning and Evaluation Data
We use ChEBI-20 (Edwards et al., 2021) as our gold standard dataset for finetuning and evaluation.It consists of 33,010 molecule-description pairs, which are separated into 80/10/10% train/validation/test splits.We use ChEBI-20 to finetune MolT5-based models and to train baseline models.Many captions in ChEBI-20 contain a name for the molecule at the start of the string (e.g., "Rostratin D is an organic disulfide isolated from ...").To force the models to focus on the semantics of the description, we replace the molecule's name with "The molecule is [...]" (e.g., "The molecule is an organic disulfide isolated from ...").

Baselines
Any sequence-to-sequence model is applicable to our new tasks (i.e., molecule captioning and generation).We implement the following baselines: 1. RNN-GRU (Cho et al., 2014).We implement a 4-layer GRU recurrent neural network.The encoder is bidirectional.
2. Transformer (Vaswani et al., 2017).We train a vanilla Transformer model consisting of six encoder and decoder layers.
We train the baseline models on ChEBI-20 using SMILES representations for the molecules.Molecule captioning and generation are trained with molecules as input/output and text as output/input.More information about the baselines and the hyperparameters is in the appendix.

Pretraining Process
We first initialize an encoder-decoder Transformer model using a public checkpoint of T5.1.1 (either t5.1.1.small, t5.1.1.base, or t5.1.1.large).We then pretrain the model on the combined dataset of C4 and ZINC (i.e., C4+ZINC) for 1 million steps.Each step uses a batch size of 256 evenly split between text and molecule sequences.After this, we finetune the pretrained model on ChEBI-20 for either molecule captioning or generation.The number of finetuning steps is 50,000.

Molecule Captioning
Table 1 shows the overall molecule captioning results.The pretrained models, either T5 or MolT5, are considerably better at generating realistic language to describe a molecule than the RNN and Transformer baselines.The RNN is more capable of extracting relevant properties from molecules than the Transformer, but it generally produces ungrammatical outputs.On the other hand, the Transformer produces grammatical outputs, but they tend to repeat the same properties, such as carcinogenic, regardless of whether they apply.For this reason, the Text2Mol scores are much lower for the Transformer model, since its outputs match the given molecule much less frequently.We speculate that the ChEBI-20 dataset is too small to effectively train a Transformer without large-scale pretraining.We find that our additional pretraining of MolT5 results in a reasonable increase over T5 in captioning performance on both the traditional NLG metrics and our Text2Mol metric for each model size.Finally, we refer the reader to Section H in the molecule is the stable isotope of helium with relative atomic mass 3. 016029.the least abundant ( 0. 000137 atom percent ) isotope of naturally occurring helium.
The molecule is a GDP-Dglucose in which the anomeric centre of the pyranose fragment has alphaconfiguration.It is a GDP-Dglucose and a ribonucleoside 5'-diphosphate-alpha-Dglucose.It is a conjugate acid of a GDP-alpha-D-glucose(2-).
1 the appendix for information about the statistical significance of our results.
Several examples of different models' outputs are shown in Figure 4 and Appendix Figure 9.In (1), MolT5's description matches best, identifying the molecule as a "GDP-L-galactose".MolT5 is usually able to recognize what general class of molecule it is looking at (e.g.cyclohexanone, maleate salt, etc.).In general, all models often look for the closest compound they know and base their caption on that.The argon atom, example (2) with SMILES '[39Ar]', is not present in the training dataset bonded to any other atoms (likely because it is an inert noble gas).All models recognize that (2) is a single atom, but they are unable to describe it.In (3), the models try to caption a histological dye.MolT5 captions the molecule as an azure histological dye, which is very close to the ground truth "brilliant cresyl blue", while T5 does not.

Text-Based de novo Molecule Generation
In the molecule generation task, the pretrained models also perform much better than the RNN and Transformer (Table 2).Although it is well known that scaling model size and pretraining data leads to significant performance increases (Kaplan et al., 2020), it was still surprising to see the results.For example, a default T5 model, which was only pretrained on text data, is capable of generating molecules which are much closer to the ground truth than the RNN and which are often valid.This trend also persists as language model size scales, since T5-large with 770M parameters outperforms the specifically pretrained MolT5-small with 60M parameters.Still, the pretraining in MolT5 slightly improves some molecule generation results, with especially large gains in validity.Finally, Section H in the appendix has information about the statistical significance of our results.
MolT5 is able to understand rare atoms such as Ruthenium (5).However, in this case it still misses the atom's charge.Some example descriptions, such as (1), lack details so the molecules generated by MolT5 may be interesting to investigate.

Probing the Model
We conduct probing tests on the model for certain input properties, which are shown in Appendix J.
Often, the model will generate molecules that it knows matches the input description from the finetuning data.It also creates solutions from these as well by adding various ions (e.g.". [Na+]").In some cases, it generates molecules not appearing in finetuning data (sometimes successfully sometimes not).For example, given the input "The molecule is a corticosteroid.",the first molecule generated is a well known corticosteroid called corticosterone.
The fifth molecule generated is not present in the PubChem database.Based on a structure similarity search, it is most closely related to the androgenic steroid Fluoxymesterone and the corticosteroid Hydrocortisone.
6 Related Work

Multimedia Representation
Much recent work on multimedia representations falls into training large vision-language models (Su et al., 2020;Lu et al., 2019;Chen et al., 2020).CLIP (Radford et al., 2021) trains a zero-shot image classifier by using natural language labels which can be easily extended.A modification of CLIP's contrastive loss function, which follows (Sohn, 2016), is applied by Text2Mol (Edwards et al., 2021)  Their natural language generation is constrained to the specific reaction steps in their dataset-the main purpose of their model is to create the steps for a reaction rather than describing molecules.

Image Captioning and Text-Guided Image Generation
Image captioning has been studied extensively (Pan et al., 2004;Lu et al., 2018;Hossain et al., 2019;Stefanini et al., 2021).Many recent studies tend to pretrain Transformer-based models on massive text-image corpora (Li et al., 2020;Hu et al., 2022).Work has also been done in the biomedical domain (Pavlopoulos et al., 2019), a close cousin of the chemistry domain, where tasks tend to be focused on diagnosis of various image types such as x-rays (Demner-Fushman et al., 2016).

Molecule Representation
Molecule representation has been a long-standing problem in the field of cheminformatics.Traditionally, fingerprinting methods have been a preferred technique to featurize molecule structural representations (Rogers and Hahn, 2010;Cereto-Massagué et al., 2015).These approaches do not allow representations to be learned from data.In recent years, advances in machine learning and NLP have been applied to this problem.A popular input for these algorithms has been SMILES strings (Weininger, 1988;Weininger et al., 1989), which are a computer-readable linearization of molecule graphs.Jaeger et al. (2018) use the Morgan fingerprinting algorithm to convert each molecule into a 'sentence' of its substructures, to which it applies the Word2vec algorithm (Mikolov et al., 2013a,b).Duvenaud et al. (2015) use neural methods to learn fingerprints.Other advances such as BERT (Devlin et al., 2019) have also been applied to the domain, such as MolBERT (Fabian et al., 2020) and ChemBERTa (Chithrananda et al., 2020), which use SMILES strings as inputs to pretrain a BERT-esque model.Work has been done to use the molecule graph structure and known reactions for learning representations (Wang et al., 2022).Schwaller et al. (2021b)  There has been particular interest in training generative models for de novo molecule discovery.Bagal et al. ( 2021) apply a GPT-style decoder for this task.Lu and Zhang (2022) apply a T5 model to SMILES strings for multitask reaction prediction problems.MegaMolBART5 trains a BART model on 500M SMILES strings from the ZINC-15 dataset (Sterling and Irwin, 2015b) 7 Conclusions and Future Work In this work, we propose MolT5, a self-supervised learning framework for pretraining models on a vast amount of unlabeled text and molecule strings.Furthermore, we propose two new tasks: molecule captioning and text-guided molecule generation, for which we explore various evaluation methods.Together, these tasks allow for translation between natural language and molecules.Using MolT5, we are able to obtain high scores for both tasks.

Broader Impacts
Our proposed model and tasks will have the following broader impacts.1) It will help to democratize molecular AI, allowing chemistry experts to take advantage of new AI technologies for discovering new life-changing drugs by interacting in the natural language, because it is most natural for humans to provide explanations and requirements in natural language.2) Text-based molecule generation enables the ability to generate molecules with specific functions (such as taste) rather than properties, enabling the next generation of chemistry where custom molecules are used for each application.Specifically-designed molecular solutions have the potential to revolutionize fields such as medicine and material science.3) Our models, whose weights we will release, will allow further research in the NLP community on the applications of multimodal text-molecule models.

Risks
MolT5, like other large language models, can potentially be abused.First, there may be biases learned by the model due to its large-scale training data.These biases may affect what type of molecules are generated when the model is prompted about certain diseases.Thus, any molecules discovered by usage of MoLT5 should strictly evaluated by standard clinical processes before being considered for medicinal use.Another risk is that the model may be used to discover potentially dangerous molecules instead of beneficial ones.It is difficult to predict what exact molecules may be discovered via usage of our work.However, while there is this unfortunate potential for misuse of the technology, knowledge of dangerous molecule's existence and structure is generally not harmful due to the requisite technical knowledge and laboratory resources required to synthesize them in any meaningful quantity.Over-all, we believe these downsides are outweighed by the benefits to the research and pharmaceutical communities.

Limitations
Since this work focuses on a new application for large language models, many of the same limitations apply here.Namely, the model is trained on a large dataset collected from the Internet, so it may contain unintended biases.One limitation of our model is using SMILES strings -recent work (Krenn et al., 2020) proposes a string representation with validity guarantees.In practice, we found this to work poorly with pretrained T5 checkpoints (which were important from a computational perspective).We also note that some compounds in ChEBI-20 can cause validity problems in the default SELFIES implementation.We leave further investigation of this to future work.Finally, we stress that MolT5 was created for research purposes and generated molecules should not be used for medical purposes without careful evaluation by standard clinical testing first.

A Baselines and Hyperparameters
Any sequence-to-sequence model is applicable to our new tasks (i.e., molecule captioning and generation).We implement the following baselines: 1. RNN-GRU (Cho et al., 2014).We implement a 4-layer GRU recurrent neural network with a hidden size of 512.We use a learning rate of 1e-4 and a batch size of 128 for molecule generation.For caption generation, a batch size of 116 is used.The number of training epochs is 50.Additionally, the encoder is bidirectional.For training, teacher forcing is used 50% of the time, and gradient clipping to 50 is applied.
2. Transformer (Vaswani et al., 2017).We train a vanilla Transformer model consisting of six encoder and decoder layers.The number of training epochs is 40, the batch size is 16, and the learning rate is 1e-4.We use a linear decay with a warmup of 400 steps.
3. T5 (Raffel et al., 2020).We experiment with three public T5.1.1 checkpoints6 : small, base, and large.We finetune each checkpoint for molecule captioning or molecule generation using the open-sourced t5x framework (Roberts et al., 2022).The number of training steps is set to be 50,000.The dropout rate is set to be 0.0 for the small and base models, and it is set to be 0.1 for the large model.
For other hyperparameters, we use the default values provided by the t5x framework.
We train the baseline models on the ChEBI-20 dataset using SMILES representations for the molecules.Molecule captioning and generation are trained with molecules as input/output and text as output/input.Sequences are limited to 512 tokens for input and output.During inference, a beam decoder with a beam size of 5 is used.
On the RNN and vanilla Transformer models, we use a character-split vocabulary for SMILES.

B Reproducibility Checklist
The programs, trained models, and resources will be made publicly available.For training the RNN and Transformer baselines, we use NVIDIA Tesla V100 GPUs.For pretraining and finetuning T5related models, we use TPUs.
When testing on a MacBook Pro that has no access to GPUs, the average inference time of our MolT5-Base molecule generation model is 2.24 seconds/query.The average inference time of our large MolT5-Base molecule captioning model is 9.86 seconds/query.

C Decoding with Huggingface Model
For ease of adoption, we converted our original models trained using the t5x framework (Roberts et al., 2022) to HuggingFace-based models (Wolf et al., 2019).We will release the converted models on HuggingFace (HF) Hub.Due to implementation differences, the HF-based models produce slightly different outputs from the original models.Therefore, we also report the numbers of the HF-based models in Table 3 and Table 4.

D High Validity Molecule Generation
To increase the validity score of the molecule generation models, we consider a high-validity decoding strategy.We use diverse beam search (Vijayakumar et al., 2016) with a beam width and beam group of 30 and a diversity penalty of 0.5.Then, we use RD-Kit (Landrum, 2021) to select the first valid beam.On rare occasions, the beam size exceeds memory limitations, so we iteratively reduce the beam size by 5 for that input and try again.In Table 4, MolT5-Small-HV, MolT5-Base-HV, and MolT5-Large-HV denote models that use this decoding process.

E Ablations
We perform ablations on MolT5-Small pretraining.For molecule captioning (Table 5), pretraining on both C4 and ZINC is clearly more beneficial than pretraining only on C4 or only on ZINC.
For molecule generation, at first glance, pretraining on C4+ZINC seems not to outperform pretraining only on C4 (Table 6).However, note that except for BLEU, Exact, Levenshtein, and Validity, other metrics in Table 6 are computed using only syntactically valid molecules.Table 7 shows the normalized molecule generation results.After normalization, we see that pretraining on C4+ZINC outperforms pretraining only on C4 or only on ZINC according to most metrics.Finally, pretraining only on ZINC increases the validity score substantially.However, this leads to decreased similarity of the generated molecules to the ground truths.

F More Examples
The molecule is a member of the class of phhenylureas that is urea in which one of the nitrogens is substituted by a pchlorophenyl group while the other is substituted by two methyl groups.It has a role as a herbicide, a xenobiotic and an environmental contaminant.It is a member of monochlorobenzenes and a member of phenylureas.
The molecule is a perchlorometallate anion having six chlorines and ruthenium(IV) as the metal component.It is a perchlorometallate anion and a ruthenium coordination entity.
The molecule is a trisaccharide derivative that consists of 6-sulfated D-glucose having an alpha-L-fucosyl residue attached at position 3 and a beta-Dgalactosyl residue attached at position 4. It has a role as an epitope.It is a trisaccharide derivative and an oligosaccharide sulfate.

Invalid
The molecule is a monocarboxylic acid that is thyroacetic acid carrying four iodo substituents at positions 3, 3', 5 and 5'.It has a role as a thyroid hormone, a human metabolite and an apoptosis inducer.It is an iodophenol, a 2-halophenol, a monocarboxylic acid and an aromatic ether.
The molecule is a methylbutanoyl-CoA is the S-isovaleryl derivative of coenzyme A. It has a role as a mouse metabolite.It derives from an isovaleric acid and a butyryl-CoA.It is a conjugate acid of an isovaleryl-CoA(4-).
The molecule is an D-arabinose 5phosphate that is beta-Darabinofuranose attached to a phospahte group at position 5.It derives from a beta-Darabinofuranose.
The molecule is a guaiacyl lignin obtained by cyclodimerisation of coniferol.It has a role as a plant metabolite and an anti-inflammatory agent.It is a member of 1-benzofurans, a primary alcohol, a guaiacyl lignin and a member of guaiacols.It derives from a coniferol.
Invalid 8 9 Invalid Invalid The molecule is a synthetic piperidine derivative, effective against diarrhoea resulting from gastroenteritis or inflammatory bowel disease.It has a role as a mu-opioid receptor agonist, an antidiarrhoeal drug and an anticoronaviral agent.It is a member of piperidines, a monocarboxylic acid amide, a member of monochlorobenzenes and a tertiary alcohol.It is a conjugate base of a loperamide(1+).
The molecule is a steroid sulfate that is the 3-sulfate of androsterone.It has a role as a human metabolite and a mouse metabolite.It is a 17-oxo steroid, a steroid sulfate and an androstanoid.It derives from an androsterone.It is a conjugate acid of an androsterone sulfate(1-).It derives from a hydride of a 5alpha-androstane.

12
The molecule is a member of the class of chloroethanes that is ethane in which five of the six hydrogens are replaced by chlorines.A non-flammable, high-boiling liquid (b.p. 161-162℃) with relative density 1.67 and an odour resembling that of chloroform, it is used as a solvent for oil and grease, in metal cleaning, and in the separation of coal from impurities.It has a role as a non-polar solvent.

Invalid, fixed
The molecule is an ultra-long-chain primary fatty alcohol that is tetratriacontane in which one of the terminal methyl hydrogens is replaced by a hydroxy group It has a role as a plant metabolite.

Invalid Invalid
The molecule is an acyl-CoA that results from the formal condensation of the thiol group of coenzyme A with the carboxy group of (E)-2-benzylidenesuccinic acid.

Invalid Invalid
The molecule is a branched amino octasaccharide derivative that is beta-D-Man-(1->4)-beta-D-GlcNAc-(1->4)-beta-D-GlcNAc in which the mannosyl group is substituted at positions 3 and 6 by beta-D-GlcNAc-(1->2)-alpha-D-Man groups and the reducing-end N-acetyl-beta-Dglucosamine residue is substituted at position 6 by an alpha-L-fucosyl group.It has a role as an epitope.It is an amino octasaccharide and a glucosamine oligosaccharide.

RNN T5 MolT5 Input Ground Truth
The molecule is a benzazepine and a tetracyclic antidepressant.It has a role as an alpha-adrenergic antagonist, a serotonergic antagonist, a histamine antagonist, an anxiolytic drug, a H1receptor antagonist and a oneirogen.

Invalid
The molecule is a tetrazine that is 1,2,4,5tetrazine in which both of the hydrogens have been replaced by o-chlorophenyl groups.It has a role as a mite growth regulator and a tetrazine acaricide.It is an organochlorine acaricide, a member of monochlorobenzenes and a tetrazine.It derives from a hydride of a 1,2,4,5tetrazine.

23
The molecule is a derivative of phosphorous acid in which one of the acidic hydroxy groups has been replaced by amino.

Figure 1 :
Figure 1: An example output from our model for the molecule generation task.The left is the ground truth, and the right is a molecule generated from the given natural language caption.
Figure 2: An example of both the image captioning task (Chen et al., 2015) and molecule captioning.Molecule captioning is considerably more difficult because of the increased linguistic variety in possible captions.

Figure 4 :
Figure 4: Example captions generated by different models.
trains a BERT model to learn representations of chemical reactions.Schwaller et al. (2021a) leverages unsupervised representation learning with Transformers to extract an organic chemistry grammar.Unlike existing work, MolT5's molecule representations allow for translation between molecules and natural language.

Figure 6 :
Figure 6: More examples of interesting molecules generated by different models.
linkage.It has a role as a metabolite.It derives from a Lleucine and a L-aspartic acid.The molecule is a tripeptide composed of L-leucine, Lvaline and L-aspartic acid joined in sequence by peptide linkages.It has a role as a metabolite.It derives from a L-leucine, a L-valine and a L-aspartic acid.The molecule is a tripeptide composed of L-leucine, Lvaline and L-aspartic acid joined in sequence by peptide linkages.It has a role as a metabolite.It derives from a L-leucine, a L-valine and a L-aspartic acid.

Figure 9 :
Figure 9: More examples of interesting captions generated by different models.

Figure 11 :
Figure 11: Input: The molecule is a apoptosis inducer.

Figure 12 :
Figure 12: Input: The molecule is a blue dye.

Figure 14 :
Figure 14: Input: The molecule is a corticosteroid.

Figure 17 :
Figure 17: Input: The molecule is a green dye.

Figure 18 :
Figure 18: Input: The molecule is a histological dye.

Figure 19 :
Figure 19: Input: The molecule is a human metabolite.

Figure 20 :
Figure 20: Input: The molecule is a hydrocarbon which tastes really cool.

Figure 21 :
Figure 21: Input: The molecule is a liquid at room temperature.

Figure 23 :
Figure 23: Input: The molecule is a maleate salt.

Figure 24 :Figure 25 :
Figure 24: Input: The molecule is a neurotransmitter agent.

Figure 26 :
Figure 26: Input: The molecule is a photovoltaic.

Figure 27 :Figure 28 :
Figure 27: Input: The molecule is a pigment which converts sunlight into energy.

Figure 29 :
Figure 29: Input: The molecule is a purple dye.

Figure 30 :Figure 31 :
Figure 30: Input: The molecule is a red dye.

Figure 33 :Figure 34 :
Figure 33: Input: The molecule is a sweet tasting sugar additive.

Figure 35 :
Figure 35: Input: The molecule is able to lower blood pressure.

Figure 38 :
Figure 38: Input: The molecule is an anabolic agent.

Figure 41 :
Figure 41: Input: The molecule is an antibiotic.

Figure 44 :
Figure 44: Input: The molecule is an antineoplastic agent.

Figure 47 :
Figure 47: Input: The molecule is an antitubercular agent.

Figure 50 :
Figure 50: Input: The molecule is a catabolic agent.

Figure 53 :
Figure 53: Input: The molecule is an insect attractant.

Figure 54 :
Figure 54: Input: The molecule is an insecticide.

Figure 58 :
Figure 58: Input: The molecule is blue blue blue.

Figure 59 :
Figure 59: Input: The molecule is blue blue blue blue.

Table 1 :
Molecule captioning results on the test split of CheBI-20.Rouge scores are F1 values.

Table 2 :
(Campos and Ji, 2021)esults on the test split of CheBI-20.Except for BLEU, Exact, Levenshtein, and Validity, other metrics are computed using only syntactically valid molecules, as in(Campos and Ji, 2021).
Zeng et al. (2022)21)eval between molecule and text pairs.Edwards et al. (2021)also released the ChEBI-20 dataset of moleculedescription pairs, which is used for training and evaluation in this paper.Vall et al. (2021)leverage a contrastive loss between bioassay descriptions and molecules to predict activity between the two.Sun et al. (2021)uses cross-modal attention with molecule structures to improve chemical entity typing.Zeng et al. (2022)pretrain a language model to learn a joint representation between molecules and biomedical text via entity linking which they use for tasks such as relation extraction, molecule property prediction, and crossmodal retrieval like Text2Mol.Unlike our work, they do not explore generating text nor molecules.

Table 3 :
HuggingFace model molecule captioning results for the different baseline models on the test split of CheBI-20.Rouge scores are F1 values.

Table 4 :
HuggingFace model de novo molecule generation results for the different baseline models on the test split of CheBI-20.MolT5-Small-HV, MolT5-Base-HV, and MolT5-Large-HV are models that use a high-validity decoding process-see Appendix D.

Table 5 :
Pretraining ablation results of molecule captioning for MolT5-Small on the test split of CheBI-20.Rouge scores are F1 values.

Table 6 :
Pretraining ablation results of molecule generation for MolT5-Small on the test split of CheBI-20.

Table 7 :
Normalized pretraining ablation results of molecule generation for MolT5-Small on the test split of CheBI-20.Molecule-based results (FTS, FCD, Text2Mol) are normalized by multiplying by validity (for scores where higher is better) or dividing by validity (for scores where lower is better).
More examples of interesting molecules generated by different models.
8: More examples of interesting molecules generated by different models.