Monte Carlo Thought Search: Large Language Model Querying for Complex Scientific Reasoning in Catalyst Design

Discovering novel catalysts requires complex reasoning involving multiple chemical properties and resultant trade-offs, leading to a combinatorial growth in the search space. While large language models (LLM) have demonstrated novel capabilities for chemistry through complex instruction following capabilities and high quality reasoning, a goal-driven combinatorial search using LLMs has not been explored in detail. In this work, we present a Monte Carlo Tree Search-based approach that improves beyond state-of-the-art chain-of-thought prompting variants to augment scientific reasoning. We introduce two new reasoning datasets: 1) a curation of computational chemistry simulations, and 2) diverse questions written by catalysis researchers for reasoning about novel chemical conversion processes. We improve over the best baseline by 25.8\% and find that our approach can augment scientist's reasoning and discovery process with novel insights.


Introduction
Scientific discovery thrives on uncovering the optimal combinations of factors that maximize a property of interest.For example, to discover new efficient fuels (Yang et al., 2019;Tran et al., 2023;Zitnick et al., 2020) or chemical conversion processes requiring less energy, a scientist would need to consider the chemical reaction, the reactants that undergo the reaction, the catalysts that improve the rate of reaction, and find the optimal combination of operating conditions (Fig. 2).Mathematically, one could represent this as an optimization problem where we model a chemical process as a function and formulate the search problem as finding the optimal combination of all process parameters that minimizes a cost function modeled around energy efficiency.For highly empirical fields such as chemistry, these combinatorial searches require The search begins with a generic query at the root node.The answer from each node is passed to the child nodes and additional criterion are added to the prompt.For instance, low cost.Information passed to children nodes is color coded to show the reasoning pathway.
expert reasoning with knowledge of the scientific literature that dates back a century.The emerging capability of large language models (LLMs) (Wei et al., 2022;Ouyang et al., 2022;Taylor et al., 2022;Lai et al., 2023;OpenAI, 2023) provides an opportunity to automatically reason with a large knowledge space in a human-interpretable way.
Despite their promise, the brittleness of language models to their inputs and hallucination remain areas for concern (Creswell and Shanahan, 2022;Taylor et al., 2022).Our initial investigation of LLMs revealed that basic prompting (such as "What is a good catalyst for reaction X?") leads to basic answers that could be found on a Wikipedia page.To improve the quality of answers, one can incorporate desirable properties into the prompt which lead the LLM to produce more specific answers (such as "What is a good catalyst with low cost for reaction X?").Additionally, LLMs often hallucinate, producing answers without grounding in scientific fact.Achieving accurate answers with high speci-Figure 2: Illustration of the combinatorial thinking used by human experts to reason about a catalyst (best viewed in color).They successively "think in terms of" different constraints and factors, each of which are related via scientific principles, and narrow down the set of possible candidates.Our Monte Carlo Reasoner emulates such cognitive thinking by prompting a language model with different combinations, yielding a tree-structured space of queries and potential candidates, and returns the optimal answer via efficient exploration of the possible space.
ficity and which use key technical terminology (Fig. 2) is essential to earn the scientific community's trust and pave the way for the adoption of machine reasoning systems.
In this work, we focus on the problem of prompting an LLM to return the top-k catalysts for a given chemical reaction and generating the reasoning for each candidate.In collaboration with researchers from the catalysis community, we develop a new dataset, BioFuels Question Reasoning (BioFuelQR), consisting of complex reasoning questions and answers.We observe the reasoning pathways used by domain scientists and conclude that it is important for LLMs to progress from "thinking step-by-step" to "thinking step-by-step in terms of relevant properties".In this setting, we are given a question which has some relevant catalyst properties P, |P| = n (e.g.{"crystal planes", "toxicity"}) and we want to identify the best subset R ⊂ P, |R| = r of properties for the language model to "think" in terms of.Considering that language models are sensitive to permutations of their inputs, there are P n r = n! (n−r)!possible prompts to search through.This goal can be accomplished by learning to prompt the LLM with the most relevant subset of properties (Deng et al., 2022) or decomposing the set into a sequence of chained queries (Dohan et al., 2022).In both cases, identification of the prompt-generating property subset becomes the limiting factor.
To solve this problem, we propose the Monte Carlo Reasoner (MCR), a generic heuristic search methodology that addresses the combinatorial chal-lenge of query decomposition.Considering the practical challenges of learning prompts that are both human comprehensible (a key consideration for scientists) and provide the best performance (Deng et al., 2022), we pursue a stochastic, heuristic search-based approach that leverages LLMs trained on scientific literature with sophisticated instruction following capabilities (Ouyang et al., 2022).
We formulate the task as a search problem in which an agent performs a query in an uncertain environment (represented by the LLM) and determines a query variant to pursue based on the evaluated reward.Given an initial query, we construct a tree structure of these unique query variants in order to progressively refine the original query (the root) into property-specific variations (the leaves).Our methodology demonstrates improvement over basic querying of the LLM without any additional training of the LLM.Instead, we use a Monte Carlo Tree Search algorithm (MCTS) to perform a stochastic search over the existing knowledge space of an LLM to achieve more scientifically valuable answers.
Our second major contribution is demonstrating the efficacy of using a scientific domain-specific reward function in LLM-based computations for our top-k catalyst problem.Estimation of the "adsorption energy" of a chemical structure is at the core of developing efficient chemical reactions (see Appendix A.2 for details).Finding catalysts that can enable chemical reactions with the least amount of external energy is key to developing environmen-tally friendly industrial processes.In this work, we implement such energy function specific considerations via a LLM-derived reward function.Our experiments (using questions detailed in Table 4) show that even a simplistic reward function dramatically improves the specificity of answers and their associated reasoning from the LLM.In summary, we make the following contributions: 1. We present Monte Carlo Reasoner (MCR), an algorithm to prompt LLMs for zero-shot complex reasoning tasks involving combinatorial search.2. We introduce a new chemistry-focused dataset, BioFuelQR, that captures key reasoning challenges in hypothesis generation and testing faced daily by scientists.We present in-depth qualitative analysis of MCR on BioFuelQR.3. We demonstrate that a domain-specific reward function that represents a fundamental scientific concept can lead to dramatic improvement in the quality and specificity of LLM answers.

Monte Carlo Reasoner
Problem definition Our goal is find the optimal prompt, P o , which leads the LLM to return the best candidate catalysts for a specific problem.Starting with a general initial prompt P 0 , we use a set of actions to automatically modify the prompt to improve the LLM output with respect to a reward function, R.
For instance, suppose P 0 is the prompt given in Figure 2(left).Each prompt is a template, where we use actions a ∈ A to create better prompts, based on how experts might modify their own queries, so that the LLM will suggest superior catalysts.See Appendix C.1 for a more detailed explanation of the actions and prompt.By modifying prompts, we create a tree of prompts, answers, and rewards, as demonstrated in Figure 1.We call a path from the root to a leaf node a "reasoning pathway".These reasoning pathways can be constructed in several different ways.For instance, we can take an action to introduce additional catalyst properties to consider (such as "composition of metals" and "electronic structure" in Fig. 2 (right)) so that the LLM will include or exclude certain catalysts in its answer.Also, for each prompt P after P 0 , we include P 's parent node's answer in P to provide the LLM with additional context about the previous answer.Further, at each node, we prompt the LLM to produce catalysts with either "new elements", "similar elements", or "different elements" to the parent node's answer candidates (switching between these possibilities is an action).Finally, we can take an action to change the type of catalyst requested (unary, binary, ternary, or -oxide catalysts).Clearly, the number of possible reasoning pathways increases drastically with tree depth due to the possible combinations of actions.Thus, we apply Monte Carlo Tree Search, an efficient method to optimize a sequence of actions with a reward function, R.
In MCTS, each prompt P is stored as a node in a tree T , where edges are prompt-action pairs (P i , a j ).The search tree decides at each prompt which action to take to obtain the best reward based on previous results.Typically, prompt-action pairs are weighted by a policy, which determines a-priori the importance of each action for a prompt, given as prior probabilities.Here, we assign equal weight to all possible actions.Impossible actions are assigned weight of 0 (see Appendix C.2).
In MCTS, each edge stores a count N (P, a), a weight representing a prior probability p(P, a), and the total downstream reward V (P, a) where Here, γ is a discount factor and d is the (tree) distance of P ′ from P .If there are no discovered successors to P , then we set V (P, a) = 0.The search determines the next action to take with policy P(V, N, p): where c is an exploration-exploitation trade-off.The simulation starts at the root node each time and traverses the constructed tree until a new state is reached.Then, its answer and reward are calculated, stored, and the upstream values of V , N are updated.This is repeated to generate the desired number of prompts (in our case 300).MCTS is superior to re-sampling methods because it avoids repeatedly sampling the same prompt and it is superior to brute-force tree search methods such as BFS and DFS because it selects trajectories in the tree that show promising results.

Reward Function
Our reward function, R, measures the effectiveness of the catalysts proposed by the LLM for a given prompt, P .Here, we measure effectiveness Algorithm 1: Run MCR search.Here, a t indicates t th action from root.
1 Require: LLM, initial prompt P 0 , number of candidate catalysts k, number of prompts to generate M 2 Initialize tree T .Define nodes P and edges (P, a j ), discount γ, stored values N (P, a j ), V (P, a j ), p(P, a j ), and reward function R.
19 end 20 return arg max P ∈T (R(P )) of a catalyst by querying the LLM to produce adsorption energies for a given adsorbate in electron volts (eV).We describe the prompt used to generate the adsorption energy in Appendix C.1.The significance of adsorption energy for catalysis design is explained in Appendix A.2.The reward is calculated as where C(P ) is the top-k catalysts from prompt P .

Experiments
Experimental setup We conduct our experiments on two new chemistry-focused reasoning query benchmarks containing 130 queries (Table 2).We compile OpenCatalysis from the OC20 (Chanussot et al., 2010) and OC22 (Tran et al., 2023) catalyst datasets (Zitnick et al., 2020).Second, we develop BioFuelQR-a query dataset targeting biofuels-focused catalyst discovery (see for an example).We collected two answers from catalysis researchers for a subset of 51 queries to observe different reasoning patterns and human biases.See section C for details on dataset design.
Baselines We benchmark MCR's performance with three recent methods: 1) Chain-Of-Thought (CoT) prompting (Kojima et al., 2022), 2) Selfconsistency-based CoT (Wang et al., 2022), 3) breadth-first-search (BFS) based Tree-of-Thoughts (ToT) (Yao et al., 2023) (a contemporary work to ours).Experiments are based on GPT-3 textdavinci-0032 .Table 1 shows MCR improves by 25.8% and 13% over the reward obtained by BFS on OpenCatalysis and BioFuelQR, respectively.Performance improves by ∼600% over CoT.Query Cost Despite significant effort with the dataset creation, our results in Table 2 are obtained from 11/130 queries.MCR and baselines are implemented using OpenAI text-davinci-003 for consistency.MCR and ToT method is computationally expensive (Table 2), so evaluation of all 130 queries over all methods requires approximately 174,470 API calls, and we could not secure compute capacity from OpenAI to evaluate more than 11 queries with each method.We further discuss the limitations that arose in Limitations (4).
Key Takeaways We find that MCR's use of stochastic search prunes the more uniform exploration of search space conducted by ToT (Yao et al., 2023).Table 2 shows given a maximum query limit, MCR was able to search significantly deeper (reported by d max ) than ToT.While MCR reached a higher reward than ToT, MCR generated more nodes than ToT (see C.4).However, we are not able to definitively declare that both tree-based methods outperformed CoT and CoT w/ Self-consistency.
To confirm if the increased reward over CoT indeed translates into better reasoning quality, two catalysis experts compared the best answer generated by MCR with the GPT-3 CoT implementation.Overall, the experts preferred MCR to CoT (Fig- The experts also evaluated how the prompts and LLM answers evolve as MCR searches deeper in the prompt tree (Figures 3 and 10)-in many cases they found the LLM answers to be logically coherent and in some cases even insightful enough for follow-up experimentation (see the second user feedback in Figure 3).Overall, both experts preferred MCR for having higher specificity over CoT and reasoning in terms of correct properties (detailed in Figures 8, 9).

Conclusion and Future Work
LLMs offer major promise to automate the cycle of scientific hypothesis generation and testing.Our work tackles the challenge of identifying key prop-erties for augmenting a chemist's reasoning via use of a domain-specific reward function, enabling generation of relevant scientific explanations with high specificity.MCR is a zero-shot reasoning methodology that circumvents the need for large-scale, hard-to-obtain, domain-specific training data.We apply it to catalyst research: a highly empirical, reasoning-heavy scientific field dating back a century.Future work can investigate large-scale evaluation of our benchmark, integration with atomistic prediction models trained on quantum chemistry datasets for more trustworthy reward functions, and finetuned language models.

Limitations
In this work, we consider applications of large language models in the scientific domain.In general, this comes with a number of limitations.First, LLMs display largely black box behavior, which is exacerbated by many strong models only being accessible as APIs.Second, generative modeling in the scientific domain is incredibly difficult to evaluate, requiring laboratory verification in many settings.Third, hallucination about factual information is a concern.One benefit of our method is that it provides reasonings based on refined prompts, which we show can be inspirational to domain experts searching for a solution.
Our work demonstrates that tree-search methods have a strong value proposition over existing methods for LLM reasoning (CoT, self-consistency etc.).Since ToT is contemporary to our methodology, an important contribution of this work is demonstrating the merit of tree-based reasoning approaches for complex scientific reasoning tasks; scientific reasoning is not discussed in (Yao et al., 2023).We do not claim that MCR is necessarily superior to ToT in all settings.In fact, further experiments have shown the two methods can be quite comparable.However, we are limited in this work by the cost of experimentation that we cannot perform an ideal comparison of MCR to ToT.
In particular, our reward function based on LLM outputs of scientific questions can be considered a limitation.However, it allows for much quicker validation of ideas and we find it to be an effective proxy (which on its own is interesting).In the future, comparatively costly atomistic simulations can be used to replace our reward function.These can be quite time-consuming and computationally expensive, so we focus on our algorithmic contribution in this work.Because of the efficacy we demonstrate using LLM rewards, it may also be possible to use a hybrid approach to save on computational chemistry simulations.This could initially leverage LLM embeddings as an initial reward to narrow down promising search sub-trees by selecting the most promising nodes in the first few layers of the search tree.Advanced simulations can then be used for searching final answers in these sub-trees.Alternatively, simulations can be used as a limited-use oracle like in active learning.We leave this for future work.
Our method's improvement comes with higher cost of inference, similar to Tree-of-Thought.When doing inference locally, this may not be a problem.However, we utilize third-party APIs which are both expensive and rate-limited.We found existing open-source models trained on chemistry text did not possess sufficient instructionfollowing capabilities to be reliable or effective here.Thus, we were limited in quantity of experiments that could be done, as well as the models which could be accessed.This is because our approach requires an average of 750 API calls per tree search.Although we evaluate on relatively few initial questions, our in-depth expert-performed analysis is based on ∼7,200 queries.

Ethical Considerations
We propose a zero-shot prompting methodology for LLMs that enables reasoning for complex queries in the scientific domain.Like most applications of LLMs, this has similar ethical considerations, especially in regards to implicit biases from large-scale pretraining and the hallucination of false information.Thus, it is still important for human oversight and careful evaluation of language model output.One consideration of our method is that it may enable discovery of molecules, materials, or other scientific products which can be used for harmful applications (Urbina et al., 2022).Overall, we believe these downsides are outweighed by the benefits of this work to both the NLP community and other scientific communities which may benefit.

A.1 Scientific Drivers from Catalysis
Discovery of novel catalysts is essential for accelerating the transition to a sustainable future.Despite the significant progress in the development of highly efficient catalysts, heterogeneous catalysis remains largely an empirical science owing to the complexity of the underlying surface chemistry (Nørskov et al., 2011).Currently, there is a lack of data and design guidelines for heterogeneous catalysis because the computational cost of obtaining accurate theoretical models for such complex systems is currently prohibitively high while highthroughput experimental methods that have been applied successfully in related fields have not yet been thoroughly explored (Yang et al., 2019).Experimental validation of a new catalyst and its performance is expensive (Yang et al., 2019).Artificial intelligence-driven computing approaches aims to accelerate such discovery by down-selecting candidates that are most promising and merit extensive evaluation in a laboratory (Ward et al., 2021).The past few years have seen a lot of developments for applying AI to chemistry that range from predicting properties of atomistic structures, or outcomes of reactions (Schwaller et al., 2019;Chen and Jung, 2022).Generative models (Jin et al., 2018) or deep reinforcement learning methods (You et al., 2018) have demonstrated abilities to propose novel chemical compounds that satisfy unique property constraints, and then suggest synthesis pathways for producing such compounds (Struble et al., 2020).Generally, such models are trained on representations of atomistic structures, or reactions between multiple structures (Struble et al., 2020;Chen and Jung, 2022).

A.2 Motivation for molecular energy prediction as a reward function
Electronic structure calculations play a crucial role in developing atomistic-level understanding of the interaction of liquid or gaseous molecules with solids, as a functional of the topological property of the solid surface (Nørskov et al., 2011).Much of the literature from machine-learning for atomistic systems have focused on training system-level properties such as potential energy functions (Schütt et al., 2018;Gasteiger et al., 2021).The following paragraph explains why estimating the energy functions associated with a molecular structure is critical to discovering processes with lower energy requirements.
The amount of usable energy for a physical system with constant temperature and pressure is referred to as the Gibbs free energy, or Gibbs energy and is defined as: G = H − T S, where H is the energy contained in the bonds between atoms, T is the temperature and S is the entropy (Zitnick et al., 2020).The entropy of a system increases when molecules break their bonds and decreases when they form new ones.The computation of H involves the potential energy between atoms.When Gibbs energy is negative, it means that the energy contained in the bonds is higher, and a system will naturally approach a lower energy state.Thus, a reaction or process will proceed spontaneously.On the contrary, a positive Gibbs energy indicates that the extrinsic energy is required to enable a target process or reaction.The path to decarbonization lies with discovering chemical processes that require lesser amount of extrinsic energy.

B Related work
We begin with providing an overview of the broader literature around language models and their applications into chemistry, then specifically focus on large-language models.Finally, we finish with an overview of various chain-of-thought prompting methods that have been instrumental in improving the reasoning capability of LLMs.

B.2 LLMs for Chemistry
Due to recent progress in chat-oriented models such as GPT-4 (OpenAI, 2023), interest has grown in uncovering chemical knowledge and molecular discovery from existing general LLMs (Hocky and White, 2022;White et al., 2022White et al., , 2023;;Castro Nascimento and Pimentel, 2023).This has been extended to work in the few-shot setting (Ramos et al., 2023;Jablonka et al., 2023).In particular, there is an interest in endowing LLMs with scien-tific tools (Bran et al., 2023;Boiko et al., 2023;Liu et al., 2023a).In general, these studies assess the inherent chemistry knowledge in LLMs and the effect of integrating chemistry data via in-context learning or finetuning.This differs from our contribution, where we propose an algorithmic approach for improving model output using domain-specific rewards.A future research direction may be able to incorporate these two approaches together for exciting results.

B.3 Chain-of-Thought (CoT) Variants
Several works have considered improving LLM output on complex reasoning tasks via formulating multiple queries.(Creswell et al., 2022) explored the decomposition of complex queries into smaller, more reliable operators.(Creswell and Shanahan, 2022) presents a methodology for generating the answer in a step-by-step fashion and uses another model or function to pick the top-ranked answers, and avoids hallucination by constraining the output to a narrower set.(Jung et al., 2022) proposed an alternate approach to generate a tree of possible explanations (both correct and incorrect), and then analyzes their relationships to infer the correct set of answers.(Wang et al., 2022) improves reliability by sampling multiple explanations and answers from the model and then selecting the final answer that appears most often.Tree-of-Thoughts (ToT) (Yao et al., 2023) generalizes the CoT approach to enable exploration over coherent units of text (thoughts) to perform deliberate decision making by considering multiple different reasoning paths.We benchmark against (Kojima et al., 2022;Wang et al., 2022;Yao et al., 2023) in our work.

C Dataset Design
We propose two task datasets related to catalyst design: the first is derived from the Open Catalyst (OC) Project (Zitnick et al., 2020) and the second consists of complex reasoning queries designed by catalysis experts.Our multi-disciplinary team involves researchers who actively work on designing new catalysts for bio-fuels development.

C.1 Action-Driven Prompt Design
To apply MCR to catalyst discovery, we define a set of prompt templates and a set of actions to modify the fields of those templates.The exact structure of the prompt templates varies between task datasets, but there are several common elements.Table 3 lists the action types that we use.
Firstly, all prompts query the language model to return "top-k" catalysts as , where k is given by the user.Secondly, each template has a list of "include properties" and "exclude properties", which specify contexts for the LLM to consider positively when determining catalysts to include and exclude, respectively.Next, each prompt in both ToT (Yao et al., 2023) breadth-first-search and MCR after the initial prompt uses the previous list of candidate catalysts.The LLM is prompted to either include elements "similar to" or "different from" the previous list or to "include elements from" or introduce "new elements to" the list.Finally, the template includes a field to prompt for a certain kind of catalyst: unary, binary, trinary, and oxides.Of course, a prompt can have no specification on the catalyst type.
The specific template depends on the task dataset and the original query.

C.2 Open Catalyst Dataset
The Open Catalyst project (Zitnick et al., 2020) is an online repository of datasets intended for training surrogate models for computational chemistry simulations related to catalysis.The dataset contains hundreds of thousands of adsorption energies for adsorbate-catalyst pairs calculated using density function theory (DFT), an accurate method for computing energies of atomic configurations.We use the Open Catalyst dataset to build an evaluation dataset consisting of 79 adsorbates.This dataset targets the LLM's ability to reason about the adsorption of specific adsorbates.
We use the following template for this dataset: Generate a list of candidate {catalyst label} {candidate list statement} for the adsorption of {adsorbate}.{include statement} {exclude statement} Let's think step-by-step and return a list of top {k} answers and their explanations as a list of pairs.
Here, {} denote fields that need to be filled.The fields provided in the base query are the number of candidate catalysts 'k'(k=5 for the OC dataset) and enters the adsorbate symbols 'adsorbate' from the OC dataset.'Include statement' and 'exclude statement' are phrases built from the list of properties to include and exclude, respectively.These statements are affected by the Add

C.3 BioFuelQR Dataset
Our application focus is driven by the design of catalysis for reverse order gas reaction that is key to generation of synthetic biofuels with higher selectivity (Canakci and Van Gerpen, 1999;Daza and Kuhn, 2016;Kattel et al., 2017;Artz et al., 2018;Stolarczyk et al., 2018;Xu and Carter, 2018;Mukhtar et al., 2022).
Questions in the BioFuelQR dataset uses the following template: What are the top-3 {catalyst label} {candidate list statement} that perform the RWGS reaction at a lower temperature (<200 C) and demonstrate higher adsorption energy for both CO2 and H2 (or facilitates both CO2 and H2 adsorption)?.
{include statement} {exclude statement} Provide scientific explanations and return a list of top 3 answers and their explanations as a list of pairs.Let's think step-by-step.

C.4 Baseline implementations
Here we define the parameters for the evaluations of the Baseline and MCR methods.
Chain-of-Thought (CoT) For the CoT baseline, we generated a prompt for each query following the templates described in Appendix C.1.We evaluated 9 adsorbates from the Open Catalysis Dataset and 2 prompts from the BFR dataset.For CoT, we simply send one prompt to the LLM to generate a list of candidate catalysts, including the phrases "Provide a scientific explanation" and "Let's think step-by-step".The reward of the result is reported.
CoT with Self-Consistency For the self consistency baseline, the query was evaluated 10 times independently using the same prompt from CoT.
We checked the answer for consistency.However, there was no consistency between the top-k answers from the LLM over the 10 trials.Perhaps due to the large diversity in catalyst compositions.Thus, the reward estimate returned in Table 1 is simply the maximum reward over the 10 trials.
Tree-of-Thoughts (ToT) For ToT, keeping computational cost in mind, we set a branching factor b = 6.This controls the number of nodes expanded at each point in the search.Thus, at each level the nodes with the top 6 rewards are expanded.To re-duce computational cost, we restricted the number of actions to the top 12 actions with the highest prior probability p(P, a i ).This way, we reduce the number of actions simulated at each step.If there are not 12 actions with nonzero prior probability for a node, we generate as many children as possible.This happens, for instance, at the second level of the search tree, where the action "change relation to previous answer" must be taken (of which there are 4 possibilities).This is because they will pass their candidate catalysts to their successor prompts.The ToT method was run for 5 steps to generate a tree with depth 5.If all actions were possible at every level, we would generate 300 nodes in BFS (not including the root node), but only 252 nodes were generate on average.Still, we were able to select at least 6 best nodes at each level.We did not experience a similar discrepancy in MCR because MCR has a more flexible branching policy.The average observed number of nodes in the final trees is reported in Table 1.
We did not include the depth-first-search method from Tree-of-Thoughts because our search does not support a specific ending criterion.
MCR For MCR, we set a discount factor, γ = 0.9 and exploration-exploitation trade-off of c = 15 to control the branching and depth of the search tree.Generally, decreasing γ decreases the length of chains in the search tree while increasing c increases the branching of the tree.We generated 300 nodes after the root node, meaning 301 nodes were in the final search tree.
MCR utilizes the policy in Equation 2 to determine which actions to carry out at which step.However, the policy must be modified in two cases.First, if a node is a leaf node, the policy is replaced by the prior probability distribution over actions, p(P t , a i ) (see Section 2).Secondly, if a node action pair has no visits (N (P t , a i ) = 0) then the first term of Equation 2 is dropped to avoid dividing by zero.

C.5 Reward Query
To query the language model to return adsorption energies, we use another prompt template: Generate a list of adsorption energies, in eV, for the adsorbate {adsorbate} to the surface of each of the following catalysts: {candidate list}.Return the adsorption energies as a list of only {len(candidate list)} numbers in the order specified.
The LLM should return a list of numbers which can be averaged to produce a final energy.Since adsorption energies are negative we take the absolute value of the numbers listed by the LLM.units are in eV.If multiple adsorbates are given, as in the BFR examples, multiple prompts are generated and the results are summed over.Occasionally, the LLM does not give an output that can be easily parsed into a list of floats.In these cases, the query is re-run a maximum of 3 times.Such examples include but are not limited to uncommon delimiters and sporadic phrases in the output.

Figure 1 :
Figure 1: An example prompt design via tree search.The search begins with a generic query at the root node.The answer from each node is passed to the child nodes and additional criterion are added to the prompt.For instance, low cost.Information passed to children nodes is color coded to show the reasoning pathway.

Figure 3 :
Figure 3: Domain expert evaluation of LLM answers on the reasoning path to the final node with highest reward.

Figure 4 :
Figure 4: Example queries from the BioFuelQR dataset representing reasoning with different combinations of chemical descriptors.

Figure 5 :
Figure 5: Example question and human answer from our compiled QA-dataset.

Figure 6 :
Figure 6: Response to above query returned by Chain-of-Thought promting with GPT-3.

Figure 7 :
Figure 7: Response to above query returned by MCR.

Figure 10 :
Figure 10: Illustration of an evaluation by a domain expert on the progression of top search results found on the path to the answer with highest reward.
a t ) 14 Save P t , (P t , a t ) in T 15 r ← R(P t ) £ Calculate reward using answer from LLM 16

Table 4 Table 1 :
Final catalyst suggestion results.N P is number of prompts evaluated and d max is maximum search tree depth.Values are averaged over evaluated examples.

Table 2 :
Dataset SummaryInclude Property and Add Exclude Property action types in Table3.The 'catalyst label' field determines which kind of catalyst the LLM should return.It's value is set by the Change Catalyst Type action and the Toggle Oxide action can set this field to query for oxide catalysts.Finally, the candidate list statement is a phrase built from the list of candidates generated by the parent prompt.Since the candidate list can have an impact on the output of the LLM, we include an action to re-run the previous query with the candidate list from the previous query's output.Possible actions are weighed with equal prior probabilities p (see Section 2) and impossible actions are given prior probability zero.Actions are impossible if they: add a property to a list which already has that property, add a relationship to the previous candidate list when there is no candidate list, or if they would allow the next action to not have a relationship to the previous candidate list while the candidate list is not empty.

Table 3 :
List of actions and their possibilities.