Answering Questions by Meta-Reasoning over Multiple Chains of Thought

Modern systems for multi-hop question answering (QA) typically break questions into a sequence of reasoning steps, termed chain-of-thought (CoT), before arriving at a final answer. Often, multiple chains are sampled and aggregated through a voting mechanism over the final answers, but the intermediate steps themselves are discarded. While such approaches improve performance, they do not consider the relations between intermediate steps across chains and do not provide a unified explanation for the predicted answer. We introduce Multi-Chain Reasoning (MCR), an approach which prompts large language models to meta-reason over multiple chains of thought, rather than aggregating their answers. MCR examines different reasoning chains, mixes information between them and selects the most relevant facts in generating an explanation and predicting the answer. MCR outperforms strong baselines on 7 multi-hop QA datasets. Moreover, our analysis reveals that MCR explanations exhibit high quality, enabling humans to verify its answers.


Introduction
In chain-of-thought (CoT) prompting, a large language model (Brown et al., 2020;Chowdhery et al., 2022;Kadavath et al., 2022;Touvron et al., 2023) is prompted to generate its answer following a stepby-step explanation (Wei et al., 2022b;Nye et al., 2022).CoT prompting has been shown to dramatically improve performance on reasoning-heavy tasks (Kojima et al., 2022;Zhou et al., 2022).Furthermore, Wang et al. (2023) showed that sampling multiple chains of thought and returning their majority output further improves accuracy, a method which they term self-consistency (SC).
While SC leads to performance gains, it also has several shortcomings.First, when the space of possible outputs is large (Kalyan et al., 2021), Figure 1: An example from STRATEGYQA, showing the output of Multi-Chain Reasoning versus Self-Consistency.MCR uses reasoning chains as its context for QA.SC solely relies on the chains' answers.
each reasoning chain may lead to a different output, in which case no majority will be attained.Second, focusing exclusively on the final output discards relevant information that is present in the intermediate reasoning steps.Consider answering the question "Did Brad Peyton need to know about seismology?" (Fig. 1).Reasoning chain #1 leads to an incorrect answer ("No"), but its steps provide useful information.For example, the intermediate question, and following answer, on "What is seismology?" constitute an important fact that is absent from the other two chains.Last, adding SC to chain-of-thought prompting reduces interpretability, as there is no single reasoning chain that can be considered as an explanation.
In this work, we propose Multi-Chain Reasoning

Meta-Reason
Iteratively generate the reasoning chain by interleaving retrieved evidence Generate the final answer based on the content of sampled reasoning chains (MCR), where we prompt a large language model (LLM) to meta-reason across multiple reasoning chains and produce a final answer, alongside an explanation.Unlike prior work, sampled reasoning chains are used not for their predictions (as in SC) but as a means to collect pieces of evidence from multiple chains.Fig. 1 illustrates MCR compared to SC.While both methods rely on sampling multiple reasoning chains, SC returns the majority answer, "No" (grey box, bottom right).By contrast, MCR concatenates the intermediate steps from each chain (blue boxes, top left) into a unified context, which is passed, along with the original question, to a meta-reasoner model.The metareasoner is a separate LLM, prompted to metareason on multiple reasoning chains and produce a final answer along with an explanation (pink box, bottom left).By reasoning on multiple reasoning chains, MCR is able to mitigate the aforementioned drawbacks -it combines facts from multiple chains to produce the correct final answer, with an explanation of the answer's validity.
MCR has three main components ( §3).To generate reasoning chains we use two components, a decomposition model and a retriever which jointly generate the chain (Fig. 2), similar to prior work (Press et al., 2022;Trivedi et al., 2022a).These chains are then concatenated into a unified multi-chain context which is fed to the aforementioned meta-reasoner.Fig. 1 highlights the ability of the meta-reasoner to combine facts from different reasoning chains (intermediate answers in pink).The output explanation combines facts from each of the three chains: (1) "Seismology is the study of earthquakes"; (2) "San Andreas is a film..."; (3) "Brad Peyton is a film director, writer...".While SC (in grey) errs due to using only the answers, the meta-reasoner reads entire reasoning chains, and is able to correctly answer the question.
We evaluate MCR on a wide range of challenging multi-hop question answering (QA) datasets, in an open-domain setting.We distinguish between two types of tasks: implicit reasoning tasks, where reasoning steps are implicit given the question text and need to be inferred using a strategy (Tafjord et al., 2019;Geva et al., 2021;Kalyan et al., 2021); explicit reasoning tasks, where a single reasoning strategy exists and can be directly inferred given the language of the question (Yang et al., 2018;Welbl et al., 2018;Press et al., 2022;Aly et al., 2021).As our baselines, we compare MCR to SC, as well as to variants of Self-Ask (Press et al., 2022) and CoT augmented with retrieval, following Trivedi et al. (2022a).Our results show MCR consistently outperforms all other baselines, in particular, beating SC by up to 5.7%, while using the same reasoning chains ( §4).We analyze the benefits of MCR in §5, by manually scoring its generated explanations and estimating their accuracy.Our analysis shows that MCR generates high quality explanations for over 82% of examples.In addition, we measure how often MCR incorporates facts from multiple reasoning chains when generating its explanations.
To conclude, our main contributions are: • We introduce the MCR method for metareasoning on multiple chains-of-thought.• We show that MCR outperforms all baselines, including self-consistency, on all 7 multi-hop open-domain QA benchmarks.
• We analyze MCR for its explanation quality and its multi-chain reasoning capabilities.Our codebase and prompts are publicly available.1

Background
Recently, there has been a surge of interest in answering multi-hop questions through few-shot prompting of LLMs (Wei et al., 2022b;Nye et al., 2022;Yao et al., 2022).The majority of these works follow a common standard: First, given a question, plan a step-by-step reasoning chain to derive the answer and answer all intermediate steps, aided by a retriever to minimize model hallucination (Khot et al., 2023;Press et al., 2022;Yao et al., 2022;Lazaridou et al., 2023;Trivedi et al., 2022a;Khattab et al., 2022).Then, incorporate multiple reasoning chains with answers to derive the final answer (Wang et al., 2023;Li et al., 2022).In our work, we follow this template and focus on the last part.However, by reasoning on multiple reasoning chains, our meta-reasoning approach differs from prior work.Namely, we use multiple chains to collect relevant evidence for question answering.

Method
We present a method for answering questions by meta-reasoning on multiple reasoning chains.Our focus is on open-domain QA, where the input is a question q, such that the evidence to answer it exists in one or more sentences in a corpus C. When answering q requires multiple reasoning steps, it can be expressed by a reasoning chain, denoted by r.The reasoning chain is a list of one or more intermediate question-evidence-answer triples (q i , e i , a i ).Evidence e i ∈ C is a sentence that is relevant to answering intermediate question q i .Fig. 2 describes our approach when answering "How many ants would fit in The Shard?".First, we use a prompted LLM to generate multiple reasoning chains, r (1) , ..., r (k) (steps 1-2).Each r (j)  is generated by interleaving generated intermediate questions with retrieved contexts ( §3.1).Our main contribution is step 3: We introduce a second LLM that is prompted to meta-reason on multiple reasoning chains, collecting evidence facts as its explanation and generating the final answer ( §3.2).

Generating Reasoning Chains
Given a question q, we generate its reasoning chain using: (1) a decomposition model, and (2) a retriever component.Our reasoning chain generation process is largely based on prior work (Press et al., 2022;Trivedi et al., 2022a), as discussed in §2.Fig. 3 describes the interleaving of decomposition and retrieval.At each step, the decomposition model generates an intermediate question q i , based on the original question q and the previous reasoning steps.Then, the retriever uses q i to retrieve relevant evidence e i ∈ C. We then feed e i and q i to the decomposition model (along with the previous steps) to generate intermediate answer a i .During answer generation, we prepend intermediate evidence to the beginning of the chain rather than interleaving them, as it improves the accuracy for all baselines.For decomposition prompts, see §D.

Reasoning over Reasoning Chains
The meta-reasoner module is the core contribution of MCR.Instead of sampling multiple reasoning chains for their predicted answers (Wang et al., 2023), we utilize the chains as means for context generation.This context can be fed to a prompted LLM to read the generated chains and reason over them to generate the final answer.
In §3.1, we described sampling a set of reasoning chains r (1) ...r (k) , where each chain is a list of intermediate (q i , e i , a i ) triples.We use all intermediate question-answer pairs (q i , a i ) from all chains as a multi-chain context (a variant using a context with the retrieved intermediate evidence e i , is described in §C.5).Fig. 2 presents a multi-chain context from the three chains (lower pink box).
The multi-chain context is then provided as input, along with the original question, to the meta-reasoner.This model is an LLM, few-shot prompted for the task of QA given a multi-chain context.Fig. 4 presents one exemplar from the meta-reasoner prompt for the FEVEROUS dataset (the full prompts are provided in §D).We prompt the LLM to "answer the question step-by-step" given its context.The multi-chain context follows, with each line describing a single (q i , a i ) pair from one of the sampled chains.Following the context, we append the question and a step-by-step reasoning chain followed by the final answer.This chain serves as the explanation for solving the question.The meta-reasoner LLM is prompted with 6-10 of these exemplars, based on the dataset ( §4.1).
Providing the meta-reasoner with multiple chains allows it to combine and aggregate facts across chains.Moreover, the model needs to extract the most relevant facts in the chains to serve as its explanation.This enables MCR to be both more accurate and more interpretable than past multi-chain approaches (as we analyze in §5).

Experiments
We compare MCR to existing methods on 7 multihop QA benchmarks, covering a wide range of reasoning skills including commonsense, composition, comparison and fact verification.MCR consistently outperforms existing approaches on all benchmarks, setting a new state-of-the-art for the STRATEGYQA test set.Our setting is described in §4.1 and we discuss our main results in §4.2.

Datasets
As our focus is on multi-hop questions (in an opendomain setting), all datasets contain questions that  require multiple reasoning steps to answer.Following prior work (Khattab et al., 2022;Trivedi et al., 2022a) and to limit the cost of model API calls, we evaluate on 500-1000 random examples from the development set of each dataset. 2 We also evaluate on the official test sets of STRATEGYQA and FERMI, as they target implicit reasoning with multiple valid chains and their test set evaluation cost is reasonable.For all datasets, we ensure no evaluation questions appear in any of our prompts.
Tab. 1 provides example questions from each dataset.Our multi-hop QA benchmarks can be categorized based on their required reasoning skills: • Implicit Reasoning: Questions that entail implicit reasoning steps (Geva et al., 2021).The reasoning steps for solving the question are not explicitly derived from the language of the original question and may require commonsense or arithmetic reasoning.Additionally, such questions may have multiple valid reasoning chains.We evaluate on the following datasets: STRAT-EGYQA (Geva et al., 2021), FERMI (Kalyan et al., 2021) and QUARTZ (Tafjord et al., 2019).• Explicit Reasoning: Multi-hop questions where the reasoning steps are explicitly expressed in the language of the question (composition, comparison).These include HOTPOTQA (Yang et al., 2018), 2WIKIMQA (Welbl et al., 2018) and BAMBOOGLE (Press et al., 2022).We also evaluate on the FEVEROUS fact verifi-  cation dataset (Aly et al., 2021), where claims require verifying multiple facts, and evidence may be either in sentences, tables or both.
For evaluation, we use F 1 to compare predicted and gold answers for all explicit reasoning datasets and exact-match for the binary-choice datasets.In FERMI, we use the official order-of-magnitude evaluation by Kalyan et al. (2021).We provide additional technical details on evaluation in §A.2.

Models
Our main models and baselines are all instances of GPT-3, code-davinci-002, prompted with incontext learning exemplars (Brown et al., 2020).We discuss experiments with additional models in §4.3.Our prompts exemplars are formatted as described in §3.2.The number of exemplars varies from 6-12 between datasets ( §D).Exemplars in the decomposition prompt are based on random examples from the train and development sets, coupled with their gold reasoning chain.For exemplars in the meta-reasoner prompt we use reasoning chains sampled from the decomposition model to form the input context.We ensure that the answer can be inferred using the sampled chains and add an explanation before the final answer, as shown in Fig. 4. For the binary-choice datasets, STRATE-GYQA, QUARTZ, and FEVEROUS, the prompt contains an equal number of exemplars from each label.See §D for the detailed prompts.

Meta-Reasoner
We experiment with two variants of the meta-reasoner to measure the effect of reasoning on more than a single chain.
• MCR: The meta-reasoner is given five reasoning chains as its multi-chain context ( §3.2).We decode one chain with greedy decoding, and sample another four reasoning chains with temperature t = 0.7.3This enables the metareasoner to review different pieces of evidence when answering the full question ( §5).• SCR: Single-Chain Reasoning (SCR) serves as an ablation for the effect of the multi-chain context on the meta-reasoner.In SCR, the metareasoner is given the same prompt as MCR aside from having only the greedy-decoded chain in its context.This disentangles the effect of using multiple chains from the effect of having an LLM that is separate from the decomposition model to generate the final answer.
Baselines We evaluate the following baselines: • SA: Self-Ask (Press et al., 2022) returns the answer of a single reasoning chain, that was generated with greedy decoding.• SC: Self-Consistency serves as a baseline which incorporates multiple reasoning chains (Wang et al., 2023).It returns the majority answer based on multiple chains sampled from the decomposition model.We experiment with variants of 3, 5 and 15 sampled chains (SC@3, SC@5 and SC@15), in line with prior work (Wang et al., 2023;Khattab et al., 2022;Sun et al., 2023).As in MCR, we use the chain generated with greedy decoding along with additional chains sampled with t = 0.7.(2022a), we also retrieve evidence for the original question q.Last, all retrieved evidence sentences are prepended to the decomposition ( §3.1).Additional details and hyperparameters are described in §B.1 and §B.2.

Main Results
We report model results of MCR and our baselines averaged over three runs.For test set results we report the results of a single run.
Adding Reasoning Chains We measure the gains of MCR and SC when using additional reasoning chains.As extending MCR to reason on multiple chains is bounded by context length, 5 we follow a straightforward approach and perform selfconsistency on three MCR runs.SC@15 is then compared to this MCR+SC@3 model which uses 15 reasoning chains, 5 chains for each MCR run.Tab. 3 results show that MCR+SC@3 consistently outperforms SC@15.Furthermore, though MCR uses only 5 reasoning chains, its results (Tab.2) still beat SC@15 on all datasets, save STRATEGYQA.our MCR consistently beats SC, when using the same number of reasoning chains.In addition, the 75.3 accuracy of MCR+SC@3 sets a new stateof-the-art result on the official STRATEGYQA test set. 6In FERMI, both methods perform similarly.
Recent Approaches In Tab. 5, we compare MCR to recent CoT-based approaches for multihop reasoning.An apples-to-apples comparison is not possible, as these methods do not evaluate on all 7 of our tasks and use varying samples of 500-1,000 dev examples for evaluation.Moreover, different methods use different retrieval corpora, hyperparameters and LLMs, which we discuss in §B.3.Nevertheless, we argue a direct comparison serves as a measuring stick for MCR compared to similar solutions.MCR performs well on all datasets, showing the robustness of our approach.
On HOTPOTQA, our error analysis ( §5) attributes part of the gap to the retrieval corpus, since we retrieve evidence from an up-to-date version of Wikipedia ( §4.1) rather than the outdated corpus in HOTPOTQA.In STRATEGYQA, MCR beats all other methods, save for CoT+SC@40, which uses 40 reasoning chains compared to the 15 used by MCR.We emphasize that our focus is highlighting the potential of reasoning on reasoning chains.While task-specific improvements are possible, they are orthogonal to our work.Still, we provide a breakdown of errors by dataset in §5.
We use the same prompts as in code-davinci-002, trimmed to fit the 2,048 tokens context length.Tab.6 presents the results for two representative datasets, STRATEGYQA and HOTPOTQA.We observe the same trends as in Tab. 2 with MCR outperforming SC.For code-davinci-002, substituting Google Search with ColBERTv2 has slightly reduced model performance.However, MCR gains remain roughly the same, beating SC@5 on STRAT-EGYQA (+2.3%) and HOTPOTQA (+3.5%).Unsurprisingly, Vicuna-13B performance is noticeably lower than code-davinci-002, although MCR still beats SC@5 on both STRATEGYQA (+0.6%) and HOTPOTQA (+6.1%).Interestingly, for Vicuna-13B the best model on STRATEGYQA is the SCR variant, which meta-reasons over a single reasoning chain.While reasoning over multiple chains can be effectively done with larger language models, meta-reasoning on implicit questions is challenging for the smaller Vicuna-13B.Namely, it generates open-ended answers such as "Unknown" or "It depends" for over 24% of the questions in STRATE-GYQA.This can be attributed to challenges in metareasoning and the lower quality of decompositions the meta-reasoner receives as input.Nevertheless, we observe that in the explicit HOTPOTQA dataset, both Vicuna-13B and code-davinci-002 benefit significantly from MCR.

Analysis
In §4 we presented the empirical advantage of multi-chain reasoning.Next, we measure the importance of incorporating multiple reasoning chains in MCR and qualitatively assess its output.
When are Multiple Chains Helpful?In §4.2 we observed that MCR consistently outperforms single-chain reasoning (SCR).We wish to prove that this advantage lies in cases where the metareasoner uses chains other than the one generated through greedy decoding.To this end, we sort examples based on the similarity of their greedydecoded chain to the MCR explanation (similarity details in §C.1).Lower similarity indicates less reliance of MCR on the greedy chain.Fig. 5 presents an example where the MCR explanation (pink box) includes relevant facts from a chain other than the greedy one (additional examples in §C.2). Results in Fig. 6 empirically demonstrate that MCR outperforms SCR on STRATEGYQA when MCR explanations have lower similarity to the greedy chain.We observe a similar trend in all datasets ( §C.1), serving as further evidence for MCR's utility.
Combining Reasoning Chains A property of MCR is that its meta-reasoner can combine facts from different chains to generate the answer.We automatically estimate the prevalence of this phenomenon on STRATEGYQA and FERMI.We target these datasets as their questions typically have multiple valid chains.Given an example, we determine whether its meta-reasoner output is the result of combining multiple chains.Concretely, we examine whether one of the output sentences appears in exactly one chain while another is absent from that chain while being part of some other chain.This ensures that the meta-reasoner has incorporated facts from at least two chains.We consider sentences as similar if their ROUGE-1 precision is above 0.8,   2) and scored their meta-reasoner explanations.Each explanation is scored as either 1 (irrelevant), 2 (partially relevant) or 3 (highly relevant), based on its relevance to answering the question.We find the explanation is highly relevant in 82% of the cases (87% excluding FERMI, which is most challenging), and is irrelevant in less than 3%.
Next, we evaluated the faithfulness of explanations (Jacovi and Goldberg, 2020), namely, whether a person provided only with the question and MCR explanation would give the same answer as the model.Of the aforementioned examples, we focused on those with quality explanations (score 3), as they are answerable given the explanation.We answered each question based on the model's explanation.In 90% of the cases (95% excluding FERMI), the MCR predictions matched our own, highlighting the interpretability of MCR and its explanations' faithfulness.We attribute part of the gap between human and MCR predictions to implicit reasoning tasks, where humans lead by five points, on average.For additional details see §C.3.

Error Analysis
We manually analyze 700 errors by MCR (100 per dataset).We consider the following error categories: Valid predictions are when the generated answer is accurate or the original question is ambiguous; Decomposition errors occur when none of the chains have the reasoning steps to correctly answer the question; Retrieval errors occur when the retrieved contexts were irrelevant and led the model to hallucinate; Explanation errors are when MCR generates a wrong explanation while a correct one is present in the multi-chain context; Answer errors are when the MCR explanation is correct, but final answer is not; Contradicting facts are cases where MCR errs due to contradicting facts appearing in the multi-chain context, as a result of model hallucination.

Conclusion
This work introduces MCR for meta-reasoning over multiple reasoning chains.We evaluate MCR on 7 datasets for multi-hop QA that require both implicit and explicit reasoning in an open-domain setting and show that it outperforms previous approaches on all evaluation benchmarks.

Limitations
In this work we introduce a meta-reasoner model to reason over multiple reasoning chains.While we opt for a prompted LLM as our meta-reasoner, we do not experiment with a fine-tuned meta-reasoning model.For the meta-reasoner context, we experiment with variants which include either generated QA pairs of retrieved evidence sentences.We leave further improvements to the meta-reasoner context as future work.Due to the inference costs of current state-of-the-art LLMs we evaluate on the code-davinci-002 model, similar to prior work (Trivedi et al., 2022a;Wang et al., 2023).To further improve the reproducibility of our work we added results with an open-sourced LLM (Chiang et al., 2023) and retriever (Khattab and Zaharia, 2020).In the future, we plan to extend the open-source model results to include additional datasets besides STRATEGYQA and HOTPOTQA.We provide technical details, additional examples, and the exact prompts used below.

A.1 FERMI
The FERMI dataset requires approximating numeric answers for open-ended questions.Example questions are shown in Tab. 1 and Fig. 2. When providing a FERMI question to our models and baselines we also add the gold answers measure units (e.g.meters, cubes, litres, etc.).While this additional input helps the model, we note that we provide it to all our baselines, for fair comparison with MCR.Nevertheless, even given the gold units predicting the final answers to FERMI problems remains highly challenging.

A.2 Evaluation
As we prompt LLMs to generate answers, a potential outcome is for the model to abstain from answering the question by generating the token Unknown as its answer.Additional cases are when the model generates end-of-sequence token without a final answer.In the binary-choice datasets, STRAT-EGYQA, QUARTZ and FEVEROUS, we assign a score of 0.5 to such examples, thereby simulating a random guess.When submitting predictions to the STRATEGYQA test set, we identify cases of model abstains or null predictions beforehand.For these examples we assign a label of either Yes or No at random.In datasets with open-ended answers, we assign a score of 0 when the predicted answer is either Unknown or null.To make Self-Ask a stronger baseline, when the greedy decoded chain has a null answer, we randomly choose a prediction from one of the other chains.For SC, we do not consider predictions from chains where answers are Unknown or null.

B Models B.1 Retrieval
For our retrieval, we use the Google Search Engine, via SerpAPI, and return the top-1 retrieved result as an evidence snippet.Snippets can include answerboxes and tables. 7We prepend the page title to the beginning of the snippet, as shown in Fig. 7.

B.2 Hyperparemeters
We describe the hyperparameters used in our MCR model, such as preforming retrieval on the original question and a variant of the meta-reasoner prompt for FEVEROUS.Due to cost limitations, we evaluate our design choices at a smaller scale and avoid running an exhaustive grid search.
Retrieval for the Original Question We follow previous work (Trivedi et al., 2022a) by incorporating retrieved evidence for the original question in addition to evidence for intermediate steps ( §3.1).While this has a positive or negligible effect on most datasets, it dramatically decreases the results of all models on the FERMI task.Results drop for SA (38.3±0.7 to 34.7±0.5),SC (38.3±0.8 to 34.4±0.3),SCR (38.1±0.8 to 34.4±0.8)and MCR (38.9±0.8 to 37.0±0.7).Therefore, our models are run without original question retrieval when evaluated on FERMI.Interestingly, while all models perfom roughly the same without original question retrieval, MCR appears better by 2 points when evidence for the original question is used.We hypothesize that it might be due to MCR being more robust to the addition of irrelevant evidence.
FEVEROUS Meta-Reasoner Prompt As described in §3.2, the meta-reasoner generates an explanation which precedes the final answer.FEVER-OUS is distinct from all other datasets as it require verification of multiple facts in order to verify or disprove a complex statement.When a statement is false, we list one or more of its false intermediate facts along with its correction.For example, in Fig. 4 we explain that Robert Broderip lived in Bristol, not London).When prompting the meta-reasoner to list both true and false intermediate facts, we observed a decrease in performance for both MCR (69.4±1.0 to 66.4±0.7) and SCR (65.1±0.4 to 62.9±0.3).We hypothesize that repeating multiple true facts excessively prompts the model to predict the label "Yes" in cases where most facts are correct.

B.3 Contemporary Work
As discussed in §6, several recent works have introduced retrieval-augmented LLMs, prompted to answer multi-hop questions step-by-step.Closest to our setting are Self-Ask (Press et al., 2022), IR-CoT (Trivedi et al., 2022a) and DSP (Khattab et al., 2022).Next, we briefly review these works and describe their different experimental settings.
We follow Self-Ask Press et al. ( 2022) in our decomposition format into intermediate questions and answers.However, Self-Ask uses Google Search in order to retrieve intermediate answers rather than to have the model generate answers conditioned on the retrieved sentences.They experiment with the text-davinci-002 model by OpenAI and show that Self-Ask prompting outperforms CoT on the BAM-BOOGLE, 2WIKIMQA, and MUSIQUE (Trivedi et al., 2022b) datasets.
Our interleaving of decomposition and retrieval is similar to that by IR-CoT Trivedi et al. (2022a).As their retriever they use BM25 (Robertson and Zaragoza, 2009), while experimenting with both code-davinci-002 and T5-FLAN (Wei et al., 2022a).For their evaluation tasks they test on four multi-hop dataset that require explicit reasoning: 2WIKIMQA, HOTPOTQA, MUSIQUE and IIRC (Ferguson et al., 2020).
DSP Khattab et al. (2022) introduces a framework for passing information between a LLM and a retriever.They experiment with ColBERTv2 (Santhanam et al., 2022) as their retriever and the textdavinci-002 LLM.They shows significant gains on three tasks: HOTPOTQA for multi-hop question answering, open-domain factoid QA with SQUAD (Chen et al., 2017) and QRECC (Anantha et al., 2021) for conversational question-answering.
Additional changes include differences in our retrieval corpora.Both IR-CoT and DSP use the official Wikipedia dump provided with the HOT-POTQA dataset (Yang et al., 2018).Our retrieved evidence are from an updated version of Wikipedia, via Google Search.As certain facts may change over time, this could potentially explain the high percentage of MCR predictions labeled as valid in our error analysis ( §5).

C Analysis
C.1 When are Multiple Chains Helpful?
In §5, we have shown that the advantage of MCR over SCR lies in examples where the metareasoner uses chains other than the one generated through greedy decoding.In Fig. 8 we provide results for all datasets, in addition to the STRATE-GYQA results in Fig. 6.A similar trend is shared amongst all datasets, in examples with lower to the greedy chain, MCR tends to beat SCR.
The similarity between the meta-reasoner explanation and the greedy decoded reasoning chain is defined as follows: We calculate the ROUGE-1precision (Lin, 2004) between the explanation and the chain.Low, Medium, and High are based on thresholds of 1 3 , 2 3 , and 1 respectively, with the Identical category indicating an exact match.

C.2 Combining Reasoning Chains
Fig. 9 provides additional examples for combining facts between multiple reasoning chains.

C.3 Explanation Quality Analysis
We provide additional details on the annotation for scoring meta-reasoner explanations.The task was performed by 4 annotators that are NLP graduate students and authors of this paper.The annotators were presented with a question and an explanation and asked to perform two tasks: (a) score the explanation for its quality and (b) answer the question based on the meta-reasoner explanation.We provide the full instructions shown to the annotators in Fig. 10 and the full results in Tab. 8.

C.4 Error Analysis
We provide additional details regarding our error analysis ( §5).In less than 5%, we encountered Figure 8: MCR and SCR accuracy on FERMI, QUARTZ, 2WIKIMQA, BAMBOOGLE, HOTPOTQA, and FEVEROUS, on examples categorized by their MCR explanation's similarity to the greedy chain.MCR performs similarly to SCR when similarity is high, and outperfoms SCR when similarity is lower.Error bars indicate standard deviation, which tends to be high when the number of examples in the bin is small.For FEVEROUS we display the variant where MCR has to repeat all relevant facts ( §B.2), to make sure the MCR explanation is not empty.
grammatically bad questions which we were unable to comprehend and were therefor discarded from our analysis.For example the HOTPOTQA question: "What does the goddess associated with the goddess Frigg consists of what tales?" Our approach consists of a pipeline.The input to our meta-reasoner model is a context comprised of (q i , a i ) pairs that were generated by the decomposition model.As the decomposition model is an LLM that is conditioned on retrieved evidence and prior decomposition steps it may potentially hallucinate false intermediate answers.In cases of decomposition hallucinations we distinguish between two error types, based on the component that was responsible for the errors.First, Retrieval errors are cases where no relevant information was retrieved, leading to the decomposition model hallucinating an incorrect a i , which is passed on to the meta-reasoner's context.Second, we treat cases where relevant evidence was retrieved, but the decomposition model ignored it and hallucinated an incorrect a i as Decomposition errors.
Errors stemming from Contradicting Facts, are cases where the meta-reasoner context contains two contradicting facts, one accurate while the other was hallucinated by the decomposition model.For example, Fig. 11 displays an example where the context contains contradicting facts regarding who was the father of Eliezer Ben-Yehuda.When the meta-reasoner has contradicting facts, it is expected to accurately choose the correct fact based on the knowledge encoded in its parameters.Addressing such errors in future work could rely on refining generated text with methods such as RARR (Gao et al., 2022).
Overall, as our error classes mainly match the MCR components, the breakdown MCR errors to datasets and error classes may help guide future improvements.

C.5 Reasoning on Retrieved Contexts
The meta-reasoner answers questions given a multichain context of question-answer (q i , a i ) pairs, extracted from multiple reasoning chains ( §3.2).We experiment with an alternative multi-chain context, comprised of questions and retrieved contexts (q i , c i ) ( §3.1).This setting resembles past work (Trivedi et al., 2022a) however, our sentences are intermediate contexts from multiple reasoning chains, not just the greedy-decoded chain.We compare these variants, MCR-EV and SCR-EV, to MCR and SCR that reason on QA pairs.Tab. 9 shows that meta-reasoning on retrieved contexts is less effective than QA pairs.This is more evident in implicit reasoning tasks, perhaps due to retrieved contexts being less relevant on average.Example prompts for MCR-EV and SCR-EV are in §D.Table 8: Full results for the explanation quality analysis.Sim_predictions indicates the similarity between the human and MCR prediction, calculated using the dataset-specific metrics described in §4.1.1.Human_acc and MCR_acc represent the accuracy of humans and MCR predictions, respectively.Since only explanations with a score of 3 are guaranteed to contain the necessary information to arrive at an answer, we filter other examples when calculating sim_predictions, Human_acc, and MCR_acc.
Prompts instructions vary slightly between datasets.All our prompts are in the code associated with this paper.We use random examples and spend minimal effort on prompt engineering.The number of exemplars varies slightly between dataset and model.We present the exact numbers in Tab.10.

Figure 2 :
Figure 2: An overview of MCR, given a question from the FERMI dataset.Steps 1-2 generate multiple reasoning chains by conditioning the generation of intermediate questions and answers on retrieved evidence sentences.In step 3, the meta-reasoner generates the final answer, given multiple reasoning chains from the previous steps.
2 and enable reproducibility, we evaluate our models with an additional open-source retriever and LLM.As our open-source retriever, we use Col-BERTv2 (Khattab and Zaharia, 2020; Santhanam et al., 2022) over the 2018 Wikipedia dump from Karpukhin et al. (2020).In addition to codedavinci-002 we experiment with Vicuna-13B (Chiang et al., 2023), a 13 billion parameters model that has been shown to outperform LLMs like LLaMA

Figure 5 :
Figure 5: An example from STRATEGYQA where the greedy chain is insufficient to answer the question.MCR beats SCR by having access to multiple chains.

Figure 6 :
Figure6: MCR and SCR accuracy on STRATEGYQA, categorized by the similarity of the greedy chain to the MCR explanation.When MCR uses a chain other than the greedy one (lower similarity), it outperforms SCR.

Figure 7 :
Figure 7: Example for a retrieved evidence snippet for one of the intermediate questions from Fig. 1.

Figure 9 :
Figure 9: Examples for combining facts from multiple chains.

Figure 10 :
Figure 10: The annotation instructions for the MCR explanation quality analysis.
Given a question and a context, answer the question stepby-step.If you are unsure, answer Unknown.Context:Who is the father of modern Hebrew?The father of modern Hebrew is Eliezer Ben-Yehuda.Who is the father of Eliezer Ben-Yehuda?The father of Eliezer Ben-Yehuda is Abraham.... Who is the father of modern Hebrew?The father of modern Hebrew is Eliezer Ben-Yehuda.Who is the father of Eliezer Ben-Yehuda?Eliezer Ben-Yehuda's father is Yehuda Leib.Question: Who is the father of the father of modern Hebrew?Answer: The father of modern Hebrew is Eliezer Ben-Yehuda.The father of Eliezer Ben-Yehuda is Abraham.So the answer is: Abraham.Gold answer is: Yehuda Leib

Figure 11 :
Figure 11: Example a Contradicting Facts error.When generating the explanation, the meta-reasoner has to rely on knowledge encoded in its parameters to decide between multiple contradicting facts in its context on who was the father of Eliezer Ben-Yehuda.
Given a question and a context, answer the question stepby-step.If you are unsure, answer Unknown.Is it true that Robert Broderip lived in London all his life and wrote a considerable quantity of music during the earlier part of the nineteenth century?Answer: Robert Broderip lived in Bristol all his life, not in London.So the answer is: No.

Table 1 :
The multi-hop QA datasets in our experiments.

Table 2 :
Experiments using code-davinci-002 on seven multi-hop open-domain QA datasets.Results are averaged over 3 runs.BAMBOOGLE results are averaged over 5 runs due to its smaller size.
by Google.Retrieved evidence may be either sentences or parsed lists.Following Trivedi et al.

Table 3 :
Running SC and MCR on 15 reasoning chains.
Test Set ResultsWe evaluate our models on the official test sets of STRATEGYQA and FERMI, which include 490 and 558 examples respectively.The results in Tab. 4 show that on STRATEGYQA 5 code-davinci-002 context is capped at 8,001 tokens.

Table 4 :
Test set results for STRATEGYQA and FERMI.

Table 6 :
Experiments using ColBERTv2 retriever with code-davinci-002 and Vicuna-13B on the STRATEGYQA and HOTPOTQA datasets.Results are averaged over 3 runs.
Does Mercury help detect coronavirus?The coronavirus is called COVID-19.Mercury does not help detect COVID-19.So the answer is: No. Mercury is used to detect the temperature of things.You have a higher than normal body temperature when you have a fever.Common symptoms of coronavirus are fever...So the answer is: Yes.Q3: Does the temperature of the body increase... Multi-Chain Reasoning ... ...Reasoning chain #5

Table 7 :
Error classes per dataset.
For a thorough survey on LLM reasoning see Lu et al. (2022); Huang and Chang (2022); Qiao et al.

Table 9 :
Effect of using question-answer pairs versus question-context pairs as input to the meta-reasoner.

Table 10 :
The number of exemplars for each model and dataset.Since MCR and SCR, and MCR-EV and SCR-EV use the same prompt they have the same number of exemplars.