BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.


Introduction
Molecules and proteins are two essential bioentities in drug discovery (Dara et al., 2022).Small molecule drugs have been the cornerstone of the pharmaceutical industry for nearly a century, owing to their unique advantages such as oral availability, diverse modes of action, etc (AstraZeneca, 2023).Proteins serve as the foundation of life science, functioning as drug targets or crucial elements in disease pathways.As illustrated in Figure 1, both Molecule can be represented by its name, bio-sequence (SMILES and SELFIES), and 2D graph structure.Protein can be represented by its name, corresponding gene name, bio-sequence (FASTA), and 3D structure.
molecules and proteins can be represented using sequences.A molecule can be depicted by a SMILES sequence (Weininger, 1988;Weininger et al., 1989), which is derived by traversing the molecular graph through depth-first search and applying specific branching rules.A protein can be represented by a FASTA sequence (Lipman and Pearson, 1985;Pearson and Lipman, 1988), which outlines the amino acids in a protein.The sequential formats of molecules and proteins facilitate the application of Transformer models (Vaswani et al., 2017) and pretraining techniques (Liu et al., 2019;Radford et al., 2019) from natural language processing (NLP) to the biomedical field.Chemberta (Chithrananda et al., 2020) and ESM (Rives et al., 2021;Lin et al., 2022) apply masked language modeling to molecular SMILES and protein FASTA respectively, while MolGPT (Bagal et al., 2022) and ProtGPT2 (Ferruz et al., 2022) leverage GPT-style models for molecular and protein generation. et al., 2023;Boutet et al., 2007) serve as knowledge repositories of molecules and proteins.These resources detail properties, experimental results, and interactions between various bio-entities, which cannot be explicitly inferred from molecular or protein sequences alone.Consequently, a recent trend involves jointly modeling text along with molecules and proteins, allowing the textual descriptions to enhance molecular and protein representations.MolT5 (Edwards et al., 2022) adopts the T5 (Raffel et al., 2020) framework to molecular SMILES and biomedical literature.MolXPT (Liu et al., 2023b) and Galactica (Taylor et al., 2022) are GPT models trained on text and bio-entities, such as SMILES and FASTA sequences.Deep-EIK (Luo et al., 2023) fuses the encoded features from multi-modal inputs using attention (Vaswani et al., 2017) mechanism.Despite their success, there is still significant room for improvement: (i) Prior work often relies on SMILES to represent molecules.However, addressing the issue of generating invalid SMILES remains a challenge to overcome (Edwards et al., 2022;Li et al., 2023).
(ii) The contextual information surrounding molecular or protein names could offer valuable insights for understanding the interactions and properties of bio-entities.Developing effective methods to leverage this information merits further attention.(iii) Existing research tends to treat structured data (e.g., molecule-text pairs from databases) and unstructured data (e.g., text sequences in literature) equally.However, structured data could be utilized more effectively to further enhance overall performance.
To address the above challenges, in this paper, we introduce BioT5, a comprehensive pre-training framework encompassing text, molecules, and proteins.BioT5 leverages SELFIES (Krenn et al., 2020) to represent small molecules since its advantage over SMILES is that SELFIES offers a more robust and error-tolerant molecular representation, eliminating issues of illegitimate structures often encountered with SMILES.There are mainly two steps for BioT5 pre-training: (1) Data collection & processing: We gather text, molecule, and protein data, as well as existing databases containing molecule-text parallel data and protein-text parallel data.For the text data (PubMed) from the biological domain, we employ named entity recognition and entity linking to extract molecular and protein mentions, replacing them with the corresponding SELFIES or FASTA sequences.Following Liu et al. (2023b), we refer to such data as "wrapped" text.Text tokens, FASTA sequences, and SELFIES are tokenized independently (see Section 3.2 for more details).
(2) Model training: BioT5 utilizes a shared encoder and a shared decoder to process various modalities.The standard T5 employs the "recover masked spans" objective, wherein each masked span and its corresponding part share the same sentinel token.We refer to the aforementioned training objective function as the "T5 objective" for simplicity.There are three types of pre-training tasks: (i) Applying the standard T5 objective to molecule SELFIES, protein FASTA, and general text independently, ensuring that the model possesses capabilities in each modality.(ii) Applying the T5 objective to wrapped text from the biological domain, where all text, FASTA, and SELFIES tokens can be masked and recovered.(iii) For the structured molecule-text data, we introduce a translation objective.Specifically, BioT5 is trained to translate molecule SELFIES to the corresponding description and vice versa.Likewise, the translation objective is applied to protein-text data.
After pre-training, we fine-tune the obtained BioT5 on 15 tasks covering molecule and protein property prediction, drug-target interaction prediction, protein-protein interaction prediction, molecule captioning, and text-based molecule generation.BioT5 achieves state-of-the-art performances on 10 tasks and exhibits results comparable to domain-specific large models on 5 tasks, demonstrating the superior ability of our proposed method.BioT5 model establishes a promising avenue for the integration of chemical knowledge and natural language associations to augment the current understanding of biological systems.

Related Work
In this section, we briefly review related work about cross-modal models in biology and representations of molecule and protein.

Cross-modal Models in Biology
Language models in the biology field have gained considerable attention.Among these, BioBERT (Lee et al., 2020) and BioGPT (Luo et al., 2022), which are pre-trained on scientific corpora, have been particularly successful in effectively understanding scientific texts.More recently, cross-modal models focusing on jointly modeling text with bio-sequences have emerged.They can be categorized into the following three groups.Cross Text-molecule Modalities MolT5 (Edwards et al., 2022) is a T5 (Raffel et al., 2020)based model, which is jointly trained on molecule SMILES and general text corpus.MoSu (Su et al., 2022) is trained on molecular graphs and related textual data using contrastive learning.MolXPT (Liu et al., 2023b) is a GPT (Radford et al., 2018)-based model pre-trained on molecule SMILES, biomedical text, and wrapped text.Different from BioT5, these models all use SMILES to represent molecules, which leads to validity issues when generating molecules.Cross Text-protein Modalities ProteinDT (Liu et al., 2023a) is a multi-modal framework that uses semantically-related text for protein design.BioTranslator (Xu et al., 2023a) is a cross-modal translation system specifically designed for annotating biological instances, such as gene expression vectors, protein networks, and protein sequences, based on user-written text.Cross Three or More Biology Modalities Galactica (Taylor et al., 2022) is a general GPT-based large language model trained on various scientific domains, including scientific paper corpus, knowledge bases (e.g., PubChem (Kim et al., 2023) molecules, UniProt (uni, 2023) protein), codes, and other sources.DeepEIK (Luo et al., 2023) fuses the feature from multi-modal inputs (drugs, proteins, and text).Then attention (Vaswani et al., 2017) mechanism is adopted to do textual information denoising and heterogeneous features integration.Our work differs from previous studies in several ways: (1) we primarily focus on two biological modalities-molecule, protein-with text serving as a knowledge base and bridge to enrich the underlying relations and properties in the molecule and protein domains; (2) we use multi-task pretraining to model the connections between these three modalities in a more comprehensive manner.
(3) we use SELFIES instead of SMILES to represent molecules, which is more robust and resolves the validity issue in molecule generation tasks.

Representations of Molecule and Protein
Molecule Representation The representation and modeling of molecules have long been a challenge in bioinformatics.There are many methods to represent a molecule: name, fingerprint (Rogers and Hahn, 2010a), SMILES (Weininger, 1988;Weininger et al., 1989), InChl (Heller et al., 2013), DeepSMILES (O'Boyle andDalke, 2018), SELF-IES (Krenn et al., 2020), 2D molecular graph, etc. SMILES (Simplified Molecular-Input Line-Entry System), a compact and textual representation of the molecular structure, is the most common method.It employs a sequence of characters to encode atoms, bonds, and other molecular features.However, SMILES has several drawbacks (Krenn et al., 2022), such as the lack of syntactic and semantic robustness, which significantly affects the validity of molecules generated by deep learning models (Edwards et al., 2022).To address this issue, SELFIES (Self-referencing Embedded Strings) is introduced as a 100% robust molecular string representation (Krenn et al., 2020).Every permutation of symbols within the SELFIES alphabet invariably generates a chemically valid molecular structure, ensuring that each SELFIES corresponds to a valid molecule.Unlike existing works introduced in Section 2.1 that use SMILES for molecule representation, we employ SELFIES with separate encoding in BioT5 to achieve 100% validity in downstream molecule generation tasks.Protein Representation Protein can also be represented in various ways, such as by its name, corresponding gene name, FASTA format, or 3D geometric structure.The FASTA format is a common choice for encoding protein sequences, which uses single-letter codes to represent the 20 different amino acids.In BioT5, we also employ FASTA format for protein representation.
Unlike Edwards et al. (2022) and Taylor et al. (2022) that share the dictionary between biosequence tokens and nature language tokens, BioT5 uses a separate dictionary and biology-specific tokenization to explicitly distinguish biological modalities.We give further analysis of this in Section 3.2.

BioT5
The overview of the BioT5 pre-training is illustrated in Figure 2. We combine data from different modalities to perform multi-task pre-training.

Pre-training Corpus
As shown in Figure 2, the pre-training corpus of BioT5 is categorized into three classes: (1) Singlemodal data, including molecule SELFIES, protein FASTA, and general text.For small molecules, we use the ZINC20 (Irwin et al., 2020) dataset and convert SMILES to SELFIES.For protein [C]…<eom> on cultured fibroblasts: <M2> (the related amino acid sequence is <bop><p>M<M3><p>L<p>G…<eop> and inhibition of their uptake. In addition, a variety of <M1> are involved in the <M2> folding pathway.(Raffel et al., 2020).( 2) Wrapped text, where molecule names are replaced with their corresponding SELFIES and gene names are appended with related protein FASTA.We use 33M PubMed articles (Canese and Weis, 2013) and apply BERN2 (Sung et al., 2022) for named entity recognition.The scientific sentences which are not replaced or appended by bio-sequences are remained as a supplement to general text.The detailed process is depicted in Figure 4 and discussed in Appendix B.
(3) Molecule-description pairs and protein-description pairs.For moleculetext data, we collect 339K molecule SELFIES along with their corresponding names and descriptions from PubChem (Kim et al., 2019), excluding all molecules present in the downstream ChEBI-20 dataset (Edwards et al., 2022) to avoid potential data leakage.For protein-text data, we obtain 569K protein FASTA-description pairs from Swiss-Prot (Boutet et al., 2007), which contains highquality annotations of various protein properties.Details are left in Appendix E.1.

Separate Tokenization and Embedding
In most previous works, the representation of molecules and proteins has not been modeled with Figure 3: Case for tokenization.MolT5 processes "Br"(bromine atom) as "B" (boron atom) and "r", resulting in incorrect descriptions including tetraborate (related to "B").BioT5 retains the chemically meaningful group "[Br-1]" as a complete token, thereby producing the correct output.
sufficient attention to detail.MolT5 (Edwards et al., 2022) employs the same dictionary as the original T5, as it starts pre-training from the original T5 checkpoint.The original T5 dictionary is derived from nature language using SentencePiece (Kudo and Richardson, 2018).However, directly utilizing this dictionary for molecule SMILES is suboptimal, as some chemically meaningful tokens, such as functional groups or complete atoms, will be tokenized inaccurately.For example, in the molecule depicted in Figure 3, the bromine atom, symbolized as "Br" in SMILES, is tokenized as "B" (a boron atom) and "r" by MolT5.Consequently, MolT5 incorrectly characterizes this molecule as both dibromolit (related to "Br") and tetraborate (related to "B").The character-based tokenization of Galac-tica (Taylor et al., 2022) suffers the same issue.
In addition to the tokenization method, sharing token embeddings for different modalities (Edwards et al., 2022;Taylor et al., 2022) is also questionable.In multilingual tasks, shared embeddings allow models to accurately represent the meanings of borrowed words and cognates, which retain their original meanings across languages.However, molecules, proteins, and text represent entirely distinct languages.The same token within these three different modalities carries different semantic meanings.For example, the token "C" signifies character C in nature language, the carbon atom in molecules, and cysteine (one of the 20 amino acids) in proteins.Studies by Beltagy et al. (2019) and Gu et al. (2021) further emphasize the significance of domain-specific vocabulary.
To address the issues mentioned above, we employ separate vocabularies for molecule, protein, and text.In BioT5, molecule is represented by SELFIES string, where each chemical meaningful atom group is enclosed within brackets and tokenized as a SELFIES token.For example, [Br].For protein, to differentiate amino acids with capital letters in text, we introduce a special prefix <p> for each amino acid.For example, <p>M<p>K<p>R→<p>M,<p>K,<p>R.For text, we use the same dictionary as the original T5.Through this, we explicitly distinguish the semantic space of different modalities, which maintains the inherent integrity of each unique modality and prevents the model from conflating meanings across modalities.

Model and Training
Model architecture BioT5 employs the same architecture as T5 models (Raffel et al., 2020).We follow the configuration used in T5-v1.1-base1 .The vocabulary size of BioT5 is 35, 073, differing from the default configuration as we incorporate separate vocabulary for molecule SELFIES and protein amino acids.In total, the BioT5 model comprises 252M parameters.Pre-training During the pre-training phase, the model is trained in a multi-task way on six tasks that can be classified into three types: (1) Applying T5 objective to each single modality including molecule SELFIES (task #1), protein FASTA (task #2), and general text (task #3) independently.(2) Applying T5 objective to wrapped text from scientific corpus (task #4).(3) Bidirectional translation for the molecule SELFIES-text pairs (task #5) and protein FASTA-text pairs (task #6).By effectively learning the underlying connections and properties of bio-entities from textual information through these pre-training tasks, BioT5 allows for a holistic understanding of the biological domain, thereby facilitating enhanced prediction and generation abilities in various biological tasks.Fine-tuning BioT5 can be fine-tuned on various downstream tasks involving molecules, proteins, and text.To unify different downstream tasks and reduce the gap between pre-training and finetuning (Brown et al., 2020) stage, we adopt the prompt-based fine-tuning (Gao et al., 2021) approach, which facilitates various task formats into a sequence generation format.

Experiments and Results
We evaluate BioT5 on 15 well-established downstream tasks, which can be categorized into three types: single-instance prediction, multi-instance prediction, and cross-modal generation.We include details regarding fine-tuning datasets, baselines, and prompts in Appendix F.
For the downstream binary classification tasks presented in Section 4.1 and 4.2, the calculation of evaluation metrics such as AUROC and AUPRC necessitates the soft probability of the predicted label.As we use the prompt-based fine-tuning method, the output is either Yes for the positive label or No for the negative label.To obtain an appropriate label distribution, following Liu et al. (2023b), we first extract the probabilities of Yes and No tokens (denoted as p pos and p neg respectively) and normalize them.The resulting probability for positive label is ppos ppos+pneg and negative label is pneg ppos+pneg .

Results
The results are presented in Table 1 with all statistics derived from three random runs.From these results, we can see that BioT5 surpasses baselines on most downstream tasks in MoleculeNet.
BioT5 exhibits superior performance compared to GNN baselines that are pre-trained on 2D/3D molecular data, underscoring the effectiveness of knowledge in text.Furthermore, BioT5 outperforms other language model baselines, which may be attributed to the presence of molecule property descriptions in scientific contextual text or existing biological database entries.

Protein Property Prediction
Protein property prediction is crucial as it provides critical insights into the behavior and functions of proteins.We concentrate on two protein property prediction tasks on PEER benchmark (Xu et al., 2022): protein solubility prediction, which aims to predict whether the given protein is soluble, and protein localization prediction, which is to classify proteins as either "membrane-bound" or "soluble".
ProtBert and ESM-1b are studied with two settings (i) freezing the protein language model parameters and only training the prediction head; (ii) finetuning all model parameters.

Results
The results are displayed in Table 2, with all statistics derived from three random runs.In the protein solubility prediction task, BioT5 outperforms all baselines in PEER (Xu et al., 2022) benchmark.In the protein localization prediction task, BioT5 is the second best among all methods.Notably, ProtBert and ESM-1b are both pre-trained on a large corpus of protein sequences, which is comparable to or even larger than ours.Moreover, these models are two to three times larger than BioT5.These demonstrate the potential of BioT5 for enhanced predictive capabilities in protein property prediction by integrating textual information.

Drug-target Interaction Prediction
Drug-target interaction (DTI) prediction plays a crucial role in drug discovery, as it aims to predict whether a given drug (molecule) and target (protein) can interact with each other.We select three widely-used DTI datasets with a binary classification setting, which are BioSNAP (Zitnik et al., 2018), BindingDB (Liu et al., 2007) and Human (Liu et al., 2015;Chen et al., 2020).

Results
The results on BioSNAP, Human, and BindingDB datasets are presented in Table 3.All statistics are obtained from five random runs.On BioSNAP and BindingDB datasets, BioT5 consistently outperforms other methods in various performance metrics, including AUROC, AUPRC, and accuracy.For the Human dataset, although deep learning-based models generally exhibit strong performance, the BioT5 model demonstrates a slight advantage over the baseline models.It is worth noting that, in contrast to most deep learning-based baselines, our BioT5 does not rely on a specific design tailored for molecules or proteins.A possible explanation for the superior performance of BioT5 is that the SELFIES and FASTA representations effectively capture the intricate structure and function of molecules and proteins, and the interaction information between them may be well-described in the contextual scientific literature or corresponding text entries in databases.

Protein-protein Interaction Prediction
Protein-protein interaction (PPI) prediction plays a vital role in understanding protein functions and structures, as it aims to determine the potential    (Edwards et al., 2022), MolXPT (Liu et al., 2023b), and MolReGPT (Li et al., 2023).
Baselines The baselines for comparison are the same as that in Section 4.1.2.

Results
The results are shown in Table 4.All statistics are over three random runs.On two PPI datasets, BioT5 shows superior performance compared to almost all baseline models.Remarkably, BioT5 outperforms both ProtBert and ESM-1b (with full parameters fine-tuned).This result strongly highlights the crucial role of incorporating textual information during the pre-training of BioT5, which effectively establishes profound connections between proteins.Our model, despite being smaller, is able to harness the unstructured information embedded in scientific text and structured information from biological databases, encapsulating the comprehensive knowledge of proteins in their varying contexts.

Cross-modal Generation
In this section, we evaluate the performance of BioT5 on the cross-modal generation task.Specifically, we fine-tune BioT5 on molecule captioning and text-based molecule generation tasks.These two tasks are proposed by MolT5 (Edwards et al., 2022) and both use the ChEBI-20 dataset (Edwards et al., 2021).The evaluation metrics and some interesting cases are introduced in Appendix D and G.

Molecule Captioning
For the given molecule, the goal of molecule captioning task is to provide a description of the given molecule.As we use SELFIES sequences to represent molecules, this task can be formulated as an exotic sequence-to-sequence translation task.

Results
The results are shown in Table 5. BioT5 only has nearly the same number of parameters as MolT5-base, but outperforms all baseline models in all metrics, including those that have more parameters.The Text2Mol score is 0.603, which is very close to the Text2Mol score of 0.609 between the ground truth molecule and the corresponding description.We can attribute this superior performance to the unstructured contextual knowledge and structured database knowledge induced in BioT5 pre-training, which helps the model learn the intricate relationship between text and molecules.

Text-Based Molecule Generation
This is a reverse task of molecule captioning.Given the nature language description of the intended molecule, the goal is to generate the molecule that fits the description.
Baselines The compared baselines are the same as baselines in Section 4.3.1.

Results
The results are presented in Table 6.BioT5 only uses parameters similar to MolT5-base yet delivers superior performance across nearly all metrics.Notably, the exact match score of BioT5 surpasses the MolT5-Large by 32.8% while maintaining a validity of 1.0.This indicates that BioT5 not only generates more relevant molecules corresponding to the given text descriptions, but also ensures a 100% validity for the generated molecules.The overall enhanced performance of BioT5 can be attributed to the incorporation of both contextual and database knowledge, as well as the utilization of SELFIES for molecular representation.

Conclusions and Future Work
In this paper, we propose BioT5, a comprehensive pre-training framework capable of capturing the underlying relations and properties of bio-entities by leveraging both structured and unstructured data sources with 100% robust molecular representation.Our method effectively enriches cross-modal 2 https://openai.com/blog/openai-apiintegration in biology with chemical knowledge and natural language associations, demonstrating notable improvements in various tasks.
For future work, we aim to further enrich our model by incorporating additional biological data types, such as genomics or transcriptomics data, to create a more holistic biological pre-training framework.Additionally, we plan to evaluate the interpretability of BioT5 predictions, aiming to provide more insights into the biological systems under study.Thus, we foresee our work sparking further innovation in the use of AI models in the field of computational biology, ultimately leading to a deeper understanding of biological systems and facilitating more efficient drug discovery.

Limitations
One limitation of BioT5 is conducting fullparameter fine-tuning on each downstream task.This is done because we do not observe generalization ability among different downstream tasks using instruction-tuning (Wei et al., 2022) method.Another reason is that combining data from different tasks using instructions results in data leakage.For example, have noticed overlaps between the training set of BindingDB and the test sets of BioSNAP and Human.Additionally, we only demonstrate the ability of BioT5 in text, molecule, and protein modalities.Numerous other biological modalities, such as DNA/RNA sequences and cells, exist, and there are many other tasks within a single modality or across multiple modalities.Moreover, BioT5 primarily focuses on the sequence format of bio-entities, yet other formats, such as 2D or 3D structures, also hold significant importance.We leave further exploration of these to future work.

B NER and Entity Linking Process
We follow KV-PLM (Zeng et al., 2022) and MolXPT (Liu et al., 2023b) to conduct Named Entity Recognition (NER) and Entity Linking for the bio-entity names appearing in the scientific text.More specifically, we firstly utilize BERN2 (Sung et al., 2022), an advanced neural Named Entity Recognition (NER) tool in biomedical fields, to identify all instances of molecule or protein mentions.Subsequently, we map them to corresponding entities within publicly accessible knowledge bases.For molecule, we use ChEBI (Hastings et al., 2016) and MeSH (Lipscomb, 2000) database, and for protein we use NCBI Gene (Brister et al., 2015) database.Then we can get the corresponding molecule SELFIES and protein FASTA for the matched entities.As shown in Figure 4, for  molecule, we directly replace all the detected names with its SELFIES string; for protein, due to the length limitation, if a sentence consists of more than one protein entity, we only randomly choose one to append the protein FASTA to the name.The motivation for appending protein FASTA instead of replacing is that the genes are transcribed and translated to generate proteins.Therefore, unlike the molecule names directly representing the molecule, the relation between gene names and protein FASTA is indirect.Note that the replacement or appendage will not happen in every sentence.
Only those with detected bio-entities will be done the above process.

C Dictionary and SELFIES Conversion
For molecule-related datasets, when only SMILES is provided, we utilize selfies3 package to convert SMILES into SELFIES.

D Molecule-Text Generation Metrics
We follow Edwards et al. (2022) to use the same evaluation metrics for molecule captioning and textbased molecule generation tasks.To ensure a fair comparison, we convert the molecule SEIFLES to SMILES before calculating these metrics.

D.1 Molecule Captioning Metrics
In the molecule caption task, NLP metrics like BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and METEOR (Banerjee and Lavie, 2005) are utilized to evaluate the closeness of the generated description to the ground truth description.We also adopt Text2Mol metric, which is proposed by Edwards et al. ( 2021) and employ pre-trained models to measure the similarity between the description and ground truth molecule.Higher similarity means that the given text description is more relevant to the molecule, and the Text2Mol score between the ground truth description and molecule is also computed for comparison.

D.2 Text-based Molecule Generation Metrics
Since molecules can be represented in biosequence structure, NLP metrics like BLEU (Papineni et al., 2002) and Exact Match scores between generated and ground truth SMILES are directly applied for evaluation.Additionally, we also report performance on molecule-specific metrics: three molecule fingerprints (FTS) similarity scores-MACCS (Durant et al., 2002), RDK (Schneider et al., 2015), and Morgan (Rogers and Hahn, 2010a); Levenshtein distance (Miller et al., 2009); FCD score (Preuer et al., 2018), which measures molecule similarities according to biological information based on pre-trained "ChemNet"; validity, which is the percentage of the valid SMILES that can be processed by RDKit (Landrum, 2021).The Text2Mol metric is also used to measure the similarity between the molecule SMILES and ground truth description.

E Pre-training Details E.1 Special Tokens
In the pre-training of BioT5, we conduct translation tasks on molecule-text pairs and protein-text pairs extracted from PubChem (Kim et al., 2023) and Swiss-Prot (Boutet et al., 2007) separately.We format the text description from these database entries using special tokens, which serve as anchors for embedding scientific context and structure.For molecule, we use MOLECULE NAME and DESCRIPTION to represent its name and description including properties, functions, etc.For protein, similar to Xu et al. (2023b), we use PRO-TEIN NAME, FUNCTION, SUBCELLULAR LO-CATION, and PROTEIN FAMILIES to represent its name, functions, location and topology in the cell, and families it belongs to.A complete text description is created by concatenating these fields sequentially, omitting any missing fields.Through special tokens, we can effectively encode the intricate information associated with each bio-entity.

E.2 Hyper-parameters
We use the codebase nanoT5 (Nawrot, 2023) for BioT5 pre-training.We pre-train BioT5 for 350K steps on eight NVIDIA 80GB A100 GPUs.The batch size is 96 per GPU, in which a batch includes six types of data.The "translation" directions for molecule-text and protein-text pair are randomly selected for each sample with a probability of 0.5.We use AdamW (Loshchilov and Hutter, 2019) with Root Mean Square (RMS) scaling Optimizer for optimization.The learning rate scheduler is cosine annealing with the base learning rate set to 1e − 2 and the minimum learning rate set to 1e − 5.
The number of warm-up steps is 10,000 and the dropout rate is 0.0.The maximum input length for pre-training is 512.Unlike absolute position encodings, T5 (Raffel et al., 2020) use relative position encodings.This makes the model flexible to inputs of different lengths, which is helpful for downstream fine-tuning.

F Fine-tuning Details
In this section, we provide details about downstream tasks, including datasets, compared baselines, and prompts.Some statistics about downstream tasks are shown in Table 7 When displaying prompts, ⟨SELFIES⟩ refers to the molecule SELF-IES, and ⟨FASTA⟩ refers to the protein FASTA.

F.1.1 Molecule Property Prediction
All the datasets are split using an 8 : 1 : 1 ratio for train, validation, and test, respectively.We use the scaffold splitting method, in which molecules are categorized according to the Bemis-Murcko scaffold representation.

Datasets
(1) The BBBP (Blood-Brain Barrier Penetration) is curated with the intention of aiding the modeling and forecasting of barrier permeability.It comprises compounds that are categorized using binary labels, indicating whether they can penetrate the blood-brain barrier.
(2) The Tox21 ("Toxicology in the 21st Century") initiative established a publicly accessible database that quantifies the toxicity levels of various compounds.The dataset encompasses qualitative toxicity assessments (binary labels) for approximately 8,000 compounds, targeting 12 distinct biological pathways such as nuclear receptors and stress response mechanisms.
(3) The ClinTox dataset contrasts FDA-approved drugs with those that have been unsuccessful in clinical trials owing to toxicity issues.This dataset incorporates two classification objectives for 1,491 drug compounds with established chemical structures: (i) Presence or absence of toxicity in clinical trials; (ii) approved or unapproved by FDA.
(4) The HIV dataset assesses the inhibitory potential of over 40,000 compounds on HIV replication.The screening outcomes were classified into three categories: Confirmed Inactive (CI), Confirmed Active (CA), and Confirmed Moderately Active (CM).Subsequently, the latter two labels were combined, transforming the task into a binary classification between inactive (CI) and active (CA and CM) categories.
(5) The BACE dataset presents quantitative IC50 values and qualitative binary labels for a collection of inhibitors targeting human beta-secretase 1 (BACE-1).
(6) The SIDER (Side Effect Resource) is a comprehensive database that consists of marketed drugs and their corresponding adverse drug reactions (ADR).The drug side effects in SIDER are organized into 27 system organ classes, adhering to the MedDRA classifications.This dataset encompasses data for 1,427 approved drugs.

Baselines
(1) GROVER (Rong et al., 2020)   sage Passing Networks within a Transformer-style architecture and is pre-trained on large-scale molecular dataset without any supervision.G-Contextual and G-Motif are two variants of GROVER, which are pre-trained on contextual property prediction task and motif prediction task, respectively.
(2) GraphMVP (Liu et al., 2022) employs selfsupervised learning by capitalizing on the correspondence and consistency between molecule 2D topological structures and 3D geometric views.
(3) MGSSL (Zhang et al., 2021)  (2) BindingDB (Liu et al., 2007) is an accessible online database that contains experimentally validated binding affinities.Its main focus is on the interactions between small drug-like molecules and proteins.We follow Bai et al. (2023) to use a modified version of the BindingDB dataset, which was previously constructed by Bai et al. (2021) with reduced bias.
(3) Human (Liu et al., 2015;Chen et al., 2020) is constructed with the inclusion of highly credible negative samples.Following Bai et al. (2023), we also use a balanced version of the Human dataset, which contains an equal number of positive and negative samples.

Baselines
We compare the performance of BioT5 with the following six models on DTI task.
(3) DeepConv-DTI (Lee et al., 2019) uses a fully connected neural network to encode the ECFP4 drug fingerprint and a Convolutional Neural Network (CNN) along with a global max-pooling layer to extract features from protein sequences.Then the drug and protein features are concatenated and fed into a fully connected neural network for final prediction.
(4) GraphDTA (Nguyen et al., 2021) uses graph neural networks (GNNs) for the encoding of drug molecular graphs, and a CNN is used for the encoding of protein sequences.The derived vectors of the drug and protein representation are concatenated for interaction prediction.

F.2.2 Protein-protein Interaction Prediction Datasets
(1) Yeast (Guo et al., 2008) involves determining whether two yeast proteins interact or not.The negative pairs are derived from distinct subcellular locations.Following (Xu et al., 2022), the dataset is split and removed redundancy according to protein sequences similarity, which allows for the evaluation of generalization across dissimilar protein sequences.
(2) Human (Pan et al., 2010) involves determining whether two human proteins interact or not.It comprises positive protein pairs sourced from the Human Protein Reference Database (HPRD) (Peri et al., 2003)  (2) Transformer (Vaswani et al., 2017) containing 6 encoder and decoder layers is trained from scratch on ChEBI-20 dataset.
(3) T5 (Raffel et al., 2020) is directly fine-tuned on ChEBI-20 dataset from public checkpoints4 with three different model sizes: small, base and large.Note that no molecule domain knowledge is introduced in the original T5 pre-training.
(4) MolT5 (Edwards et al., 2022) is jointly trained on molecule SMILES from ZINC-15 dataset (Sterling and Irwin, 2015) and general text from C4 dataset (Raffel et al., 2020) so that MolT5 has prior knowledge about these two domains.It also contains three different sizes: small, base and large.
Then they are further fine-tuned on ChEBI-20 dataset.

G Case Study
In this section, we show several example outputs from different models in molecule captioning and text-based molecule generation tasks.Figure 5 shows the cases for the molecule captioning task.
In example (1), the description of BioT5 matches the ground truth best, successfully localizing the position of the substituent group and "member of pyridines and an aryl thiol".In example (2), MolT5 mistakenly describes that the molecule contained boron, while BioT5's description is more accurate.
In example (3), while MolT5 generates repetitive output, BioT5 and T5 generate semantically coherent output, and BioT5's output matches better with ground truth.For a complex molecule in example (4), the output of BioT5 is more holistic and accurate.Notably, only BioT5 describes this molecule as an inhibitor of SARS coronavirus main proteinase, which may come from our integration with protein knowledge.Figure 6  The molecule is the iodo-iodo-iodoiodo-iodo-iodo-iodo-iodo-iodo-iodoiodo-iodo-iodo-iodo-iodo-iodo-iodoiodo-iodo-iodo-iodo-iodo-iodo-iodoiodo-iodo derivative of an iodo-iodoiodo-iodo-iodo-iodo-iodo-iodo-iodoiodo-iodo-iodo-iodoiodododododo The molecule is a quaternary ammonium salt that is the monoiodide salt of ethyltrimethylammonium.It is a quaternary ammonium salt, an organic iodide salt and a quaternary ammonium salt.It contains an ethyltrimethylammonium.
The molecule is a quarternary ammonium salt whose basic unit comprises an ethyltrimethylammonium cation and an iodide anion.It is a quaternary ammonium salt and an iodide salt.The molecule is an aminopyrimidine that is 5-methylpyrimidine-2,4diamine in which one of the hydrogens of the methyl group has been replaced by a 2-cyclopropyl-7,8-dimethoxy-2H-chromen-5-yl group.It is an aminopyrimidine, a member of chromenes and a member of cyclopropanes.
The molecule is a 3' ,5'-cyclic purine nucleotide that is 3' ,5'-cyclic AMP bearing an additional bromo substituent at position 8 on the adenine ring.An activator of cyclic AMP-dependent protein kinase, but resistant to degradation by cyclic AMP phosphodiesterase.It has a role as a protein kinase agonist and an antidepressant.It is a 3' ,5'-cyclic purine nucleotide, an organobromine compound and an adenyl ribonucleotide.It derives from a 3' ,5'-cyclic AMP.

Invalid Invalid
The molecule is a 33-membered polypeptide consisting of His, Gly, Asp, Gly, Ser, Phe, Ser, Asp, Glu, Met, Asn, Thr, Ile, Leu, Asp, Asn, Leu, Ala, Ala, Arg, Asp, Phe, Ile, Asn, Trp, Leu, Ile, Gln, Thr, Lys, Ile, Thr and Asp residues joined in sequence.A glucagon-like peptide-2 receptor agonist used for the treatment of short-bowel syndrome.It has a role as a glucagon-like peptide-2 receptor agonist, a metabolite, an antioxidant and a protective agent.

4
Invalid Invalid The molecule is a Cy5 dye and an organic perchlorate salt.It has a role as a fluorochrome.It contains a dilC18(5)(1+).

Figure 1 :
Figure 1: Representations of molecule and protein.Molecule can be represented by its name, bio-sequence (SMILES and SELFIES), and 2D graph structure.Protein can be represented by its name, corresponding gene name, bio-sequence (FASTA), and 3D structure.

Figure 4 :
Figure 4: Wrapped text matching and mapping process.

Table 4 :
Performance comparison on Yeast and Human datasets (Best, Second Best).The evaluation metric is accuracy.* represents only tuning the prediction head.

Table 6 :
Performance comparison on text-based molecule generation task (Best, Second Best).Following Edwards et al. (2022), BLEU, Exact, Levenshtein, and Validity are computed on all generated molecues while other metrics are computed only on syntactically valid molecules.The Text2Mol score for ground truth is 0.609.The baseline results derive from MolT5 incorporates Mes-

Table 7 :
Downstream task descriptions, including task or dataset name, type, and the size of each split.
Yes for positive label or No instead.
This is the reverse task of molecule captioning.The input is the text description of the desired molecule and the output is the corresponding molecule SELF-IES.The datasets and compared baselines are the same with molecule captioning in Section F.3.1 so will only provide the prompts here.
The molecule is a thiol that is thiol substituted by a sulfanyl group at position 4. It has a role as a metabolite.It is a thiol and a member of benzenes.It derives from a hydride of a thiol.The molecule is a monothiocarbamic ester resulting from the formal condensation of thiocyanic acid with benzene.It is a member of thiocarbamic acids and a monothiocarbamic ester.The molecule is pyridine substituted at position 2 by a sulfanyl group.It has a role as a corrosion inhibitor and an allergen.It is a member of pyridines and an aryl thiol.The molecule is pyridine substituted at C-2 by a sulfanyl group.It has a role as a fluorescence quencher and an allergen.It is a member of pyridines and an aryl thiol.