Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries

We propose a new task, Text2Mol, to retrieve molecules using natural language descriptions as queries. Natural language and molecules encode information in very different ways, which leads to the exciting but challenging problem of integrating these two very different modalities. Although some work has been done on text-based retrieval and structure-based retrieval, this new task requires integrating molecules and natural language more directly. Moreover, this can be viewed as an especially challenging cross-lingual retrieval problem by considering the molecules as a language with a very unique grammar. We construct a paired dataset of molecules and their corresponding text descriptions, which we use to learn an aligned common semantic embedding space for retrieval. We extend this to create a cross-modal attention-based model for explainability and reranking by interpreting the attentions as association rules. We also employ an ensemble approach to integrate our different architectures, which significantly improves results from 0.372 to 0.499 MRR. This new multimodal approach opens a new perspective on solving problems in chemistry literature understanding and molecular machine learning.


Introduction
Discovering new properties and applications of different molecules is critical for accelerating discovery in medicine and science. Existing databases contain tens of millions of molecules; PubChem (Kim et al., 2016 alone has 110 million compounds. Many information retrieval (IR) tools for chemistry rely on queries based on natural language descriptions of the molecules and existing chemical reactions. Hundreds of millions of possible molecules cannot all possibly undergo laboratory  Figure 1: Given a natural language description of water, we want to rank the corresponding molecule H 2 O first among all the possible molecules.
experimentation and be given attention by experts in order to create a description. To address this issue, it is critical to retrieve molecules directly from natural language descriptions. This approach allows newly discovered molecules to be easily integrated into the proposed IR framework. Our framework also allows for semantic-level search between natural language descriptions and molecules as well as for query expansion within traditional chemistry information retrieval systems. Over the past several years, chemists have begun to rely increasingly on computational techniques for cataloging molecules and predicting chemical reactions, products, and properties, such as yield, toxicity, and water solubility (Wu et al., 2018;Glavatskikh et al., 2019;Coley et al., 2017;Ahneman et al., 2018;Fooshee et al., 2018). However, natural language and molecules are very different modalities of data, which makes integrating them together a challenging task. We argue that these two modalities are complementary and should be considered together.
Much current work focuses on images and language (Mogadala et al., 2020), but it is beneficial for the community to consider modalities beyond traditional ones, increasing their work's impact and efficacy. For example, integrating NLP and

Molecule
An electrically neutral group of atoms bonded together. Compound Two or more elements held together by chemical bonds.
Chemical fingerprint Represents a molecule or substructure using a bitstring. This allows for efficient substructure search and similarity calculation. Morgan fingerprint A specific type of chemical fingerprint also known as ECFP.
SMILES string A character-based sequence representation of a molecule. (for example, C1=CC=CC=C1 is the SMILES string for benzene) Canonical SMILES A unique SMILES string for a molecule. In pursuit of this goal, we propose a multimodal embedding approach for constructing an aligned semantic space between these two types of data to allow for cross-modal retrieval. No previous work has studied this retrieval problem. The closest is (Zhou et al., 2010), which uses a hybrid approach to document retrieval by replacing chemicals in text with canonical keywords in order to standardize different chemical synonyms. However, this does not take the semantic information of the molecule (properties beyond the atoms and graph structure, such as being a pollutant or hydrophobic) into account.
Additionally, incorporating cross-modal attention can lead to insights on the relations between molecule substructures and text keywords. For example, we find that given "pollutant," the model focuses on the substructure F − C. This contributes to higher-level explainability between molecules and their descriptions.
Our molecular encoder is based on the Mol2vec (Jaeger et al., 2018) algorithm, which creates "sentences" of substructure identifiers from molecules; we frame Text2Mol as a new, particularly challenging type of cross-lingual information retrieval (CLIR). This problem is much more challenging than traditional CLIR since the gap between the query and target is much larger. It also provides a useful benchmark for extending CLIR to incorporate multiple data modalities. Molecules are essentially a different language with a uniquely challenging grammar. In fact, several techniques apply models developed for natural language processing to SMILES strings-machine-readable characterbased representations for molecules (Weininger, 1988;Weininger et al., 1989).
The major novel contributions of this paper are: • A new task Text2Mol: Cross-modal Text-Molecule Information Retrieval directly from natural language descriptions to molecules.
• Cross-modal attention-based association rules between molecules and text are used to improve results and for explainability.
• A new benchmark dataset with 33,010 text-compound pairs for cross-modal textmolecule IR which can be used for crosslingual, multimodal, and explainable IR.

Task Definition
To push the boundaries of multimodal models, we present a new IR task: Text2Mol.
Given a text query and list of molecules without any reference textual information (represented, for example, as SMILES strings, graphs, or other equivalent representations) retrieve the molecule corresponding to the query. Figure 1 shows an example of this task. From a text description of a molecule, the model must incorporate the information in the description into a semantic representation which can be used to directly retrieve the molecule.
This requires the integration of two very different types of information: the structured knowledge represented by text and the chemical properties present in molecular graphs. We assume there is only one correct (relevant) molecule for each description, so we consider two measures for this task: Hits@1 and mean reciprocal rank (MRR).

Multimedia Representation
Much recent work in this area has fallen into the category of vision-language models which leverage transformers Su et al., 2020;. There are also more fine-grained multimedia embedding approaches, such as integrating events from images and their descriptions  or multimodal pattern mining (Li et al., 2016). CLIP (Radford et al., 2021) uses natural language to train a zero-shot image classifier which can be easily applied to different datasets. Specifically, their loss function, which follows Sohn (2016), serves as a very efficient version of binary cross-entropy loss by comparing all samples in a mini-batch with each other. To our knowledge, we are the first to apply this technique to molecules and text, and we also extend this loss function to incorporate negative samples to allow for crossmodal attention between the two encoders.

Molecule Representation
One critical problem in the field of molecular machine learning is molecule representation. Fingerprinting methods have long been employed in cheminformatics to featurize molecule structural representations (Cereto-Massagué et al., 2015;Sandfort et al., 2020). However, this approach does not allow these representations to be learned from the data. Other representations include techniques such as kernel PCA using Tanimoto similarity (Rensi and Altman, 2017;Mallory et al.). Recent advances in machine learning have begun to be applied to this problem. Jaeger et al. (2018) use the Morgan fingerprinting algorithm to convert each molecule into a 'sentence' of its substructures. A dataset of molecules can be interpreted as a corpus, and Mol2vec then applies Word2vec (Mikolov et al., 2013a,b) to create molecule representations. Additionally, other recent advances such as BERT (Devlin et al., 2019) have been applied to the domain such as MolBERT (Fabian et al., 2020) and ChemBERTa (Chithrananda et al., 2020), which use SMILES strings (Weininger et al., 1989) as inputs to pretrain a BERT-esque model.

Substructure or Description Retrieval
Although the biomedical domain has been more popular than chemistry (Zheng et al., 2014;Islamaj Dogan et al., 2019;Zhang et al., 2021;Lai et al., 2021), information retrieval in chemistry has long been studied and is summarized by Krallinger et al. (2017). Most work has focused on only a single modality: text or molecules. Text-based retrieval includes tasks such as finding relevant papers for a chemical or reaction and chemical entity recognition. Much work has also been done in graph and molecule-based retrieval (Hagadone, 1992;Barnard, 1993;Yan et al., 2005;Kratochvíl et al., 2018;Qu et al., 2019;Kratochvíl, 2019;Goyal et al., 2020). Hybrid approaches have also been attempted; Zhou et al. (2010) replace chemical entities in text with a unique canonical key (thus standardizing synonyms). This also allows them to perform query expansion by including similar molecules from their database. In contrast to this, we perform direct semantic cross-modal retrieval task in our approach, as opposed to just augmenting queries. Work in chemical entity recognition has also incorporated hybrid approaches, mostly as chemical name to structure converters such as ChemSpot (Rocktäschel et al., 2013) and OPSIN (Lowe et al., 2011).

Cross-Lingual Retrieval
Cross-lingual information retrieval (CLIR) is a technique to retrieve documents from a target language given a query in a different source language. Two common strategies are either translating the query into the target language or translating the document corpus into the source language (Zhang and Zhao, 2020). Further, work exists combining these approaches using interlingual semantics, such as via bilingual word embeddings (Vulic and Moens, 2015) or word embeddings and a dictionary (Bhattacharya et al., 2016).
Our problem, cross-modal molecule retrieval from text, can be considered as a CLIR task which we approach using an interlingual semantic approach. The model is trained on a parallel corpus of molecules and descriptions.

Model
To accomplish this retrieval task, we need to connect text to molecules. To do so, we build an aligned semantic embedding space. Our approach consists of two distinct submodels: a text encoder and a molecule encoder. Both submodels create an embedding in the aligned space, and cosine similarity is used to rank the embeddings. A description embedding can be compared against a database of existing molecule embeddings, and this process scales easily using an approximate nearest neighbor search algorithm such as (Johnson et al., 2017). For the text encoder, we use SciBERT (Beltagy et al., 2019) and a linear projection to the embedding space followed by layer normalization (Ba  (Rogers and Hahn, 2010) for Butyramide. The algorithm updates the identifiers from radius r = 0 to r = 1, as shown by the green circles. et al., 2016). For the molecule encoder, we consider two architectures. First, we use a multi-layer perceptron (MLP) that takes Mol2vec embeddings as input. Second, we integrate a graph convolutional network (GCN) (Kipf and Welling, 2017) into Mol2vec.
Mol2vec (Jaeger et al., 2018) converts molecule graph structures into "sentences" of substructures. These substructures are created using Morgan fingerprinting (Rogers and Hahn, 2010), which is a type of topological fingerprint, which were historically used for quick substructure lookup. Morgan fingerprints incorporate a number of molecular properties based on the Daylight atomic invariants rule (Weininger et al., 1989). Atomic invariants such as the number of connections, number of nonhydrogen bonds, and atomic number are used to create the initial identifier for an atom. By using a circular hashing technique, they are able to create a unique identifier for a molecular substructure of some radius r centered around a central atom, as shown in Figure 2. The algorithm starts with a radius of zero which is iteratively increased until the desired substructure size is obtained. In Mol2vec, these fingerprints are used as tokens for each atom. In this work, we use a default value of r = 1, which gives two tokens for each atom (r = 0 and r = 1). This set of tokens is canonicalized in the same way as the canonical SMILES representation (Weininger et al., 1989). This list of tokens can be interpreted as a "sentence", and Mol2vec builds a corpus of such sentences. It then uses the Word2vec skip-gram (Mikolov et al., 2013a,b) algorithm to create "word" embeddings, which it averages together to create molecule representations. We use a two-layer MLP followed by a linear projection and layer normalization to create a trainable representation from the Mol2vec embedding, followed by layer normalization.
While the Morgan fingerprints (substructure tokens) incorporate some implicit graph information, we explicitly introduce the molecule graph structure using a GCN that takes a molecular graph as input with Mol2vec token embeddings as features. For example, rings are very important substructures in molecules. If the description mentions "aromatic ring" or "phenyl group," we want to be able to match this substructure in the molecule. We could potentially do so by increasing the maximum radius of the Morgan fingerprinting algorithm, but then there might not be enough examples of the resulting large-radius tokens to create a good representation given our corpus size. Particularly for large molecules, to capture the global structural information, we might need a very large radius which will create a lot of rare tokens (that get replaced by the UNK token). Instead, we explicitly incorporate the graph structure by using a GCN.
The Mol2vec token features are input to a threelayer GCN to create node representations for each atom in the molecule. These representations are combined using global mean pooling, and passed through two more hidden layers to produce a molecule representation. Since Mol2vec produces multiple tokens based on Morgan fingerprints of different radii, we select the corresponding token with the largest radius.

Cross-Modal Attention Model
To improve the explainability of our approach, we introduce a model with cross-modal attention by modifying the base model to use a transformer decoder (Vaswani et al., 2017). This decoder uses the SciBERT output as a source sequence and the node representations from the Mol2vec GCN model as a target sequence, and the attentions can be used to learn multimodal association rules. The architecture is shown in Figure 3.

Loss
To optimize the models, we base our loss on the symmetric contrastive loss presented by Radford et al. (2021). The loss takes the output embed-  dings of both submodels, multiplies by the exponent of a learned temperature parameter, τ , and then takes the outer product of the mini-batch. The identity matrix I is used as labels. Categorical cross-entropy (CCE) is then applied along both axes, and the two losses are summed. This improves efficiency by allowing all the other samples in a mini-batch to serve as negatives. It corresponds to cosine similarity because the normalized dot product is minimized or maximized, for positives and negatives respectively. For batch embedding m and t of length n, L(m, t) = CCE(e τ mt T , I n )+CCE(e τ tm T , I n ) We find this loss to be ineffective for training the cross-modal attention model because it encourages the model to ignore the textual information-i.e. information can leak from one encoder to the other. To remedy this problem, we modify this loss function to incorporate a matching task by introducing negative samples. We randomly sample new descriptions and replace their respective ones in the diagonal of the identity matrix with zeros, creating a binary classification task-does the description match the molecule? Since the rows with all zeros are no longer probability distributions, we instead use binary cross-entropy loss. This modified loss provides more signal than a pure matching task since it also receives signal from the other negatives, and it enforces the constraint that the model consider both the molecule and text description.

Cross-Modal Reranking
We want to better understand how the base networks work, so we introduce a modified model with cross-modal attention, which we also use to rerank the output of the base models. Given a training set of molecule-text pairs, P , we first train the cross-modal matching model. We collect the attention weights of the final layer for each of these pairs. Next, the attention weights for molecule token, m, and text token, t, are aggregated to create association rules. We define the support for a rule r from t to m as the sum of all attentions, where a i,j is the attention weight between tokens i and j and p t and p m are the multisets of text and molecule tokens in p, respectively. This produces association rules from every text token t to every molecule token m. We calculate the confidence for each of these rules by taking the support of the rule and dividing by the support of all the rules using t, where T is the set of all text tokens. Following this, given a molecule and text pair, we consider all association rules that can produce it, and we take the average of the top k confidence values. For association-rule based reranking, Bharadwaj et al. (2014) takes the average of all confidence values. However, they have a comparatively smaller number of confidence rules. On the other hand, AnyBURL (Meilicke et al., 2019(Meilicke et al., , 2020 finds maximum aggregation to be most effective. It also shows rule-based approaches can be very efficient (Ott et al., 2021). For our approach, we want to consider multiple one-to-one rules because we only use rules from one text token to one molecule token since the computational costs scale significantly due to the combinatorial number of many-to-many rules. By taking an average of only the top confidence values, we incorporate multiple one-to-one rules but ignore unimportant rules. This combines the two approaches to reranking while keeping in mind efficiency. We calculate a score by interpolating between the cosine similarities with association rule-based scores (AR) linearly, where α ∈ [0, 1] is selected on the validation set.

Ensemble Approach
Upon investigation of the baseline models, we found that the correct molecule was very frequently found in the top molecules. However, many of the molecules ranked above the correct molecule did not occur in the top results of the same model trained with different parameter initialization. We found that by taking an average of these rankings, the correct molecule's average rank would stay roughly the same, but the average rank of false positives increases. When these average ranks are used to reorder the results, the order of the incorrect and correct molecule switches. We find this method to be surprisingly effective, and we connect this to committee of neural networks (Drucker et al., 1994) in ensemble learning (Polikar, 2012). Additionally, we draw comparisons to Mixture of Experts-based models (Masoudnia and Ebrahimpour, 2014) such as Fan et al. (2006) and the Switch Transformer which contains 1.6 trillion parameters (Fedus et al., 2021). We compute the score, S, as a weighted average, for some molecule m where R i is the rank assigned to that molecule by model i and w i is the model weight. A lower score is more desirable.

Data and Evaluation
For our task, we create a dataset using PubChem (Kim et al., 2016 and Chemical Entities of Biological Interest (ChEBI) (Hastings et al., 2016). We collect ChEBI annotations of compounds scraped from PubChem, which consists of 102,980 compound-description pairs. Using this data, we create a dataset consisting of 33,010 pairs, which we call ChEBI-20, that contains descriptions of more than 20 words. We find that longer descriptions tend to be less noisy and more informative. We remove compounds which cannot be processed by RDKit (Landrum, 2021). We separate these datasets into 80%/10%/10% train-validation-test splits. The alignment models are trained on the training data, and the results are evaluated by searching all molecules in the dataset. The molecules in the training set are processed by Mol2vec using default parameters: a radius of 1, a threshold for unknown tokens of 3, an embedding dimension of 300, and a window size of 10.

Results
To train the models, we use Adam optimizer (Kingma and Ba, 2015) and two different learning rates. The SciBERT model uses a finetuning learning rate of 3e-5, as used by Devlin et al. (2019). The rest of the model uses 1e-4 as used by Vaswani et al. (2017). We use a linear annealing rate for the learning rate with 1,000 steps of warmup. We train for 40 epochs with a batch size of 32. We also use a temperature value of 0.07 as suggested by Radford et al. (2021). We use the first 256 text tokens for the text encoder.

Baseline Models
The MLP and GCN encoder models both show similar performance. Three results for both are shown in Table 2. We believe the performance similarity between MLP and GCN is because the description is a bottleneck. However, they appear to be effective at different tasks. In the test set, the mean rank is significantly lower for the MLP models than the GCN models; however, the MRR values are fairly similar. This indicates that these two architectures have different strengths. Further, the difference in mean rank is much smaller in the validation set; the validation mean rank is 30.60 and 28.89 for the MLP and GCN ensembles respectively. This indicates that the GCN architecture is more effective for retrieving the most difficult examples in the validation set (since there are not outlier ranks to increase the mean), but the MLP is more effective at difficult examples in the test set. We further examine this in Section 5.4.

Ensemble
We find that the ensemble method shows significant performance improvements. The ensemble of the three GCN models increases Test Hits@1 by roughly 8% from the baseline models. It is notable that the hyperparameters for these models are exactly the same, and the models are learning different ways of ranking which are complementary. To combine the different models, we find the heuristic of using uniform weights to be very effective.
A further advantage of the ensemble approach is that it can incorporate different encoder architectures and retrieval schemes, which may have different understandings of how to solve the problem. We find that combining both architectures is  Table 2: Results. FPGrowth is the frequent pattern growth algorithm (Han and Pei, 2000). Models 1, 2, and 3 only differ in initial parameter initialization. more effective than either alone; this is shown in Figure 4. Ensembles that only incorporate one architecture are consistently outperformed by models that incorporate both. For example, using 3 MLP models has an MRR of 0.442 but using 2 MLP and 1 GCN model has an MRR of 0.449.

Cross-Modal Attention and Reranking
To better understand the behavior of the model, we apply cross-modal attention using a transformer decoder with 3 layers, and we rerank the top 10 of MLP1 using the 10 most confident association rules. We find cross-modal reranking to slightly improve our baseline model and to outperform traditional association rule mining, which can be accomplished by the FPGrowth algorithm (Han and Pei, 2000). Hits@1 for the baseline MLP model is  increased by about 0.4%, but normal association rules only improve it by 0.2%. Mining these rules using attention also allows us to understand the connections the model is making. Examples of these rules are shown in Table 3. We primarily examine one-to-one rules; however, these one-to-one rules will often "split" the confidence among themselves. For example, toluene is a ring containing different substructures, so there will be multiple one-to-one rules required to capture the substructure. The rule from toluene to the three common substructure tokens in toluene has an increased confidence and support. Since we average the confidence values of all applicable rules, this is accounted for in reranking.
One interesting phenomenon we find is that the model is very interested in O-H structures (hydroxyl groups). It is also interested in positively charged metal ions in salts. The token "acid" has many different rules; however, the most confident Cannabidiolate is a dihydroxybenzoate that is the conjugate base of cannabidiolic acid, obtained by deprotonation of the carboxy group. It derives from an olivetolate. It is a conjugate base of a cannabidiolic acid.
Inositol: Myo-inositol is an inositol having myoconfiguration. It has a role as a member of compatible osmolytes, a nutrient, an EC 3.1.4.11 (phosphoinositide phospholipase C) inhibitor, a human metabolite, a Daphnia magna metabolite, […] Argyssfrywff: Ala-Arg-Gly-Tyr-Ser-Ser-Phe-Arg-Tyr-Trp-Phe-Phe is an oligopeptide composed of Lalanine, L-arginine, glycine, L-tyrosine, L-serine, Lserine, L-phenylalanine, Larginine, L-tyrosine, Ltrytophan, L-phenylalanine and L-phenylalanine joined in sequence by peptide linkages. is a hydroxyl (-OH) group, which matches basic chemical properties of acids. Rules involving rare tokens can result in high confidence values. For example, the rule "crown" implies C − C − O has a confidence of 0.325. This is because the dataset contains two "crown ether" molecules which have multiple occurrences of C − C − O.

Qualitative Analysis
Our technique is capable of retrieving large, complicated molecules as well as small ones. For example, it successfully retrieves both Argyssfrywff (C 79 H 99 N 19 O 17 ) and Inositol (C 6 H 12 O 6 ), shown in Figure 5. Argyssfrywff shows that the model is capable of composing molecules from constituent parts mentioned in the description.
The MLP and GCN models capture different aspects of the molecules leading to different rankings. For example, MLP-ensemble ranks an alphamycolic acid (C 15 H 26 O 3 ) at 43; GCN-ensemble ranks it 3. The compound contains cyclopropyl Fura red is a 1-benzofuran substituted at position 2 by a (5-oxo-2thioxoimidazolidin-4ylidene)methyl group, and at C-5 and C-6 by heavily substituted oxygen and nitrogen functionalities […] Clondronate(2-) is the dianion resulting from the removal of two protons from clondronic acid. It is a conjugate base of a clodronic acid.
An alpha-mycolic acid is a class of mycolic acids characterized by the presence of two cis cyclopropyl groups in the meromycolic chain. It is an organic molecular entity and a mycolic acid. […] Figure 6: Example queries that are ranked incorrectly by All-Ensemble. groups (the triangles), shown in Figure 6, which the GCN captures. On the other hand, Clondronate(2-) (CH 2 Cl 2 O 6 P -2 2 ) is ranked 4,915 by the GCN but 61 by the MLP, showing large differences exist between the architectures. The models are also mutually beneficial; 2-Methylideneglutaric acid (C 6 H 8 O 4 ) is ranked 2nd by MLP and 3rd by GCN, but it is ranked 1st by All-Ensemble. Individual models trained identically (but with different initial parameters) also show this phenomenon. GCN 1, 2, and 3 rank Pierreione C (C 27 H 28 O 6 ) 2nd. GCN1 ranks Aspernidine A 1st, but it is ranked 49 and 64 by GCN 2 and 3, respectively. The average rank of Aspernidine A becomes 38, so GCN-Ensemble ranks Pierreione C 1st.
The model is able to ignore irrelevant description information. For example, MLP achieves rank 1 for Rostratin D (C 18 H 20 N 2 O 6 S 4 ), whose description includes the unique and likely unuseful section "isolated from the whole broth of the marinederived fungus Exserohilum rostratum." Instead, the model successfully identifies it from the following attributes: "bridged compound, a cyclic ketone, a lactam, an organic disulfide, an organic heterohexacyclic compound, a secondary alcohol, a dithiol and a diol." There are some very challenging queries where multiple molecules are very similar. For example, Pro-Arg and Arg-Pro share the same chemical formula C 11 H 21 N 5 O 3 . Fura red (C 41 H 44 N 4 O 20 S) is the most challenging query for the model; it is ranked at 8,320 by All-Ensemble. Its entire description is based off of 1-benzofuran, but the substitutions are each larger than the original molecule and poorly defined.

Remaining Challenges
One further challenge is integrating external domain knowledge. Many current errors can be eliminated by applying this information, such as assuming "oxide" means the molecule should contain an oxygen. Although our association rule approach learns some of these, external knowledge can provide stronger rules. We observe that descriptions appear to be the limiting factor in this model, which is consistent with the similar performance of the GCN and MLP encoders. Comprehensive techniques for extracting information from external knowledge could lead to significant improvements, which we leave for future work.

Conclusions and Future Work
In this work, we present Text2Mol: a novel and challenging cross-modal information retrieval task to retrieve molecules using natural language descriptions. To tackle this problem, we apply contrastive representation learning to a BERTbased text encoder and both MLP and GCN-based molecule encoders. We show that these models are complementary and that an ensemble approach combines them very effectively. We also show that the ensemble approach is effective for combining identically trained neural networks (with different parameter initialization), and we consider attentionbased association rules. Improved encoder architectures will likely yield improvements, and further investigation of how model architectural choices affect these rules and their interactions for reranking may be interesting as well. In the future, we plan to further improve results by integrating external knowledge as constraints. It should also be noted that this task is possible in the reverse direction, from molecules to descriptions. This has many possible applications, such as finding relevant descriptions for newly discovered molecules.

B Reproducibility
The MLP and GCN models were each run three times. The GCN and MLP use 600 hidden units. The mol2vec input and the model outputs are 300-dimensional. GCN uses the substructure representations with the largest radius. MLP contains 110,871,865 parameters. GCN contains 111,953,665 parameters. The cross-modal attention model contains 128,978,441 parameters and attends the first 512 molecule substructures. It achieves about 97% classification accuracy for the matching task from the negative samples. The number of one-to-one association rules with confidence greater than 0.1 and support greater than 2 is 1,835. The MLP and GCN take approximately 7 hours on a NVIDIA V100 and the cross-modal attention model takes approximately 9 hours. We find that early stopping is not useful and that layer normalization increases training speed. The value of α for reranking was selected by grid search for high validation MRR. For the metrics, given a list of rankings R,