Self-prompted Chain-of-Thought on Large Language Models for Open-domain Multi-hop Reasoning

In open-domain question-answering (ODQA), most existing questions require single-hop reasoning on commonsense. To further extend this task, we officially introduce open-domain multi-hop reasoning (ODMR) by answering multi-hop questions with explicit reasoning steps in open-domain setting. Recently, large language models (LLMs) have found significant utility in facilitating ODQA without external corpus. Furthermore, chain-of-thought (CoT) prompting boosts the reasoning capability of LLMs to a greater extent with manual or automated paradigms. However, existing automated methods lack of quality assurance, while manual approaches suffer from limited scalability and poor diversity, hindering the capabilities of LLMs. In this paper, we propose Self-prompted Chain-of-Thought (SP-CoT), an automated framework to mass-produce high quality CoTs of LLMs, by LLMs and for LLMs. SP-CoT introduces an automated generation pipeline of high quality ODMR datasets, an adaptive sampler for in-context CoT selection and self-prompted inference via in-context learning. Extensive experiments on four multi-hop question-answering benchmarks show that our proposed SP-CoT not only significantly surpasses the previous SOTA methods on large-scale (175B) LLMs, but also nearly doubles the zero-shot performance of small-scale (13B) LLMs. Further analysis reveals the remarkable capability of SP-CoT to elicit direct and concise intermediate reasoning steps by recalling $\sim$50\% of intermediate answers on MuSiQue-Ans dataset.


Introduction
Open-domain question-answering (ODQA) is a longstanding and challenging task which addresses factoid commonsense questions without specific contexts provided.While existing works in ODQA primarily focus on resolving questions that mostly require single-hop reasoning, there is a burgeoning interest in multi-hop question-answering (MHQA), which aims to derive the correct answer through multi-step reasoning over a collection of candidate articles (Mavi et al., 2022).Yet, a significant disparity exists between such scenarios and real-world applications, since the latter often lacks an explicit set of candidate articles provided by users.In light of this, we officially introduce open-domain multihop reasoning (ODMR) as a progression task of ODQA, which requires MHQA with explicit rationales in open-domain setting.
For ODMR, an emerging approach is to leverage large language models (LLMs) due to the vast knowledge stored within their numerous parameters.In recent years, LLMs have shown powerful reasoning and instruction-following capabilities, such as GPT-3 (Brown et al., 2020), PaLM (Chowdhery et al., 2022) and InstructGPT (Ouyang et al., 2022).After extensive training on vast corpora of textual data, LLMs prove to be zero-shot reasoners on complex reasoning tasks by breaking down multi-step questions into intermediate ones for step-by-step reasoning before producing the final answer (Kojima et al., 2023).Such series of intermediate reasoning steps is known as chain-ofthoughts (CoTs) (Wei et al., 2023).CoTs often serve as in-context demonstrations for in-context learning (ICL) (Brown et al., 2020), which enables LLMs to generate outputs that are formally consistent with a target task via a few reference examples provided as prompt.Manual-CoT (Wei et al., 2023) adopt manually designed CoTs as in-context demonstrations to improve the reasoning performance of LLMs.However, it demands delicate and meticulous design by humans, and the demonstrations are the same for each question, which may be sub-optimal.Zero-shot-CoT (Kojima et al., 2023) was proposed to trigger automated CoTs by certain specific prompting techniques, such as "Let's think step by step:".Zhang et al. (2022) proposed Auto-CoT, an automated framework to mass-produce CoTs and build in-context demonstrations.However, previous works have not fully leveraged the strong instruction-following and zero-shot reasoning capabilities of LLMs.
In this paper, we propose Self-prompted Chainof-Thought (SP-CoT), an LLM-only framework to mass-produce high-quality CoTs for ODMR.In general, SP-CoT introduces an automated generation pipeline of ODMR datasets, an adaptive sampler for CoT selection and self-prompted inference via ICL.The automated ODMR datasets are MHQA datasets without candidate contexts, yet including multi-hop questions with six types of complex reasoning chains and step-by-step decomposition.Each intermediate QA step is equipped with a short explanation to justify the answer.By leveraging the ICL ability of LLMs, our method is generally effective on LLMs of different scales.
We evaluate our method on four MHQA datasets in an open-domain setting: ComplexWebQuestions (CWebQ) (Talmor and Berant, 2018), Hot-potQA (Yang et al., 2018), 2WikiMultiHopQA (2Wiki) (Ho et al., 2020) and MuSiQue-Ans (MSQ) (Trivedi et al., 2022).Extensive experiments show that our proposed SP-CoT not only significantly surpasses the previous SOTA methods on largescale (175B) LLMs, but also nearly doubles the zero-shot performance on small-scale (13B) LLMs in ODMR.Further analysis reveals the outstanding capability of SP-CoT to elicit direct and concise intermediate reasoning steps by recalling ∼50% of intermediate answers on MSQ dataset.
Our contributions can be summarized as follows: 1. We introduce an automated pipeline to generate high-quality ODMR datasets by LLMs, which include 2-4 hop questions with six types of complex reasoning chains.
2. We propose SP-CoT, an automated framework to mass-produce CoTs while ensuring quality and diversity.
3. We conduct extensive experiments to confirm the effectiveness of SP-CoT on four ODMR benchmarks.In ODMR setting, our approach significantly boosts the performance by eliciting high-quality intermediate reasoning steps.

Multi-Hop Dataset Creation
Creating an annotated MHQA dataset manually requires significant human resources.Therefore, some researchers are dedicated to automating the generation of MHQA datasets.Jiang et al. (2020) elaborated the creation of a multi-hop fact verification dataset from existing HotpotQA dataset.Trivedi et al. (2022) introduced a bottom-up process to build challenging multi-hop reading comprehension QA dataset through meticulous selection and composition of single-hop questions derived from existing datasets.Press et al. (2023) proposed an automatically generated dataset with compositional 2-hop questions about celebrities.Nevertheless, existing approaches are either only partially automated, still requiring crowdsourcing, or they are limited to less complex 1-2 hop questions.In this work, our proposed SP-CoT is capable of automatically generating 2-4 hop questions with six different types of reasoning chains (Figure 6 in Appendix).

Chain-of-Thought Prompting
Recent works on CoT prompting can be divided into two research lines.The first is prompting LLMs step by step to leverage their comprehension and reasoning abilities to answer questions.Zero-shot-CoT (Kojima et al., 2023) adopts a twostage design, which requires LLMs to first generate intermediate rationale and then produce an answer.Wang et al. (2022) introduced iCAP, which iteratively prompts a fine-tuned small-scale LLM to generate CoTs and then combines the generated rationales to formulate answers.Least-to-Most (Zhou et al., 2023) requires LLMs to first decompose a complex question into sub-questions and then sequentially solve them to arrive at the final answer.
The second research line focuses on designing effective CoT as demonstrations for ICL to release more powerful reasoning abilities of LLMs.Manual-CoT (Wei et al., 2023) was introduced to leverage manually designed CoT as in-context demonstrations to solve arithmetic, commonsense, and symbolic problems.A recent work (Zelikman et al., 2022) shed light on the practicality to automate the generation of rationales by LLMs.Subsequently, Self-Ask (Press et al., 2023) was proposed to construct step-by-step demonstrations with explicit decision process and intermediate answers … …

Self-prompted Inference via In-context Learning Adaptive CoT Construction Generation
Question: What is the birth year of the person who remixed the song "Take on Me"? Answer: Let's think step by step: Step 1: The song "Take on Me" was remixed by John Ratcliff.
Step 2: John Ratcliff's birthdate is October 7, 1943.Therefore, the final answer in just one entity is: October 7 1943 Question: What year did the person who originally grouped Brahms with Bach and Beethoven as the "Three Bs" of music make his debut as a pianist after studying with Franz Liszt?Answer: Let's think step by step: Step 1: The comment grouping Brahams with Bach and Beethoven as the "Three Bs" of music was originally made by Hans von Bülow.
Step 2: Hans von Bülow made his debut as a pianist in 1848 after studying with Franz Liszt.Therefore, the final answer in just one entity is: 1848 Demo 1

Demo x
Question: Who sing "Home Alone Tonight" with the performer of "country Man"?Answer: Let's think step by step: Test Sample Step 1: The performer of "Country Man" is Luke Bryan.
Step 2: Luke Bryan sings "Home Alone Tonight" with Karen Fairchild of Little Big Town.
Therefore, the final answer in just one entity is: Karen Fairchild Output as CoT.Zhang et al. (2022) proposed Auto-CoT, which automatically constructs CoTs via LLMs and adopts clustering methods to dynamically build demonstrations for each question.

Self-Generated ODMR Dataset
However, existing methods have two significant limitations: 1) Over-reliance on the reasoning abilities of LLMs.Most methods are reported effective on large-scale LLMs like InstructGPT, while reproducing these methods on small-scale LLMs is quite challenging.2) Over-confidence on the quality of intermediate results.When prompting LLMs step by step, defects in previous steps may limit the performance of subsequent steps.Similarly, while automatically constructing in-context demonstrations, the effectiveness of ICL might be compromised by the unstable quality of CoTs.Admittedly, manually constructed CoTs can ensure quality, yet they face a trade-off between content diversity and costs.To overcome the above drawbacks, our proposed SP-CoT automates CoT generation with quality ensured by leveraging the strong instruction-following capability of LLMs.

Model Enhancement via LLM Generation
With the powerful capabilities of LLMs on content generation and instruction-following, one recent research direction extensively leverages the content generated by LLMs to enhance smaller LLMs.Recent works such as GPTeacher,1 Al-paca (Taori et al., 2023) and Vicuna (Chiang et al., 2023) collect the content generated by GPT-4 (Ope-nAI, 2023) and the corresponding prompts to train smaller-scale LLMs to achieve comparable performance.Another research line aims to boost the performance of large-scale LLMs to higher levels by leveraging the self-generated content.Some works use the self-generation as contexts to assist themselves in answering questions, such as eliciting intermediate rationales as CoT (Kojima et al., 2023) or generating background articles for reading comprehension (Yu et al., 2023).While others instruct LLMs to generate demonstrations for ICL during inference (Zhang et al., 2022), such as prompting LLMs to generate reliable QA pairs as self-prompted in-context demonstrations (Li et al., 2022).Our work is dedicated to extending the second research line to ODMR by leveraging the automated self-generated CoT as in-context demonstrations.Compared to previous works (Kojima et al., 2023;Zhang et al., 2022;Li et al., 2022), our work taps into the potential of self-prompting LLMs with more complicated framework design to solve a most challenging task.

Methods
In this section, we elaborate our proposed SP-CoT in three stages (Figure 1): In the first stage, we prompt LLMs to iteratively generate 2-hop commonsense QA quadruplets with In: Name 10 scientists: Out: Albert Einstein, Marie Curie, Isaac Newton, ... In: Generate a detailed passage from Wikipedia about Albert Einstein.Out: Albert Einstein was a German-born theoretical physicist who ..., and died on April 18, 1955, in Princeton, New Jersey.In: Passage about Albert Einstein: Albert Einstein was a Germanborn … , New Jersey.Question: What is the city where Albert Einstein died?Answer: Princeton According to an evidence from the passage to support the answer, rewrite it to make its meaning clear without passage.Out: Albert Einstein was awarded the Nobel Prize … and his services to Theoretical Physics.

Explanation
In: Generate a detailed passage from Wikipedia about Princeton.Out: Princeton is a university town in Mercer County, … The museum is located on the campus of Princeton University and is free to the public.
In: Passage about Princeton: Princeton is a university town … to the public.Question: What is the county that is home to Princeton University?Answer: Mercer County According to an evidence from the passage to support the answer, rewrite it to make its meaning clear without passage.Out: Princeton is a university town in Mercer County, New Jersey, United States.

Stage 1: 2-Hop QAs via Self-Generation
Step 1: First-Hop Passage Generation Step 3: Second-Hop Passage Generation Step 2: First-Hop QA Generation Step 4: Second-Hop QA Generation In: Passage about Albert Einstein: Albert Einstein was a German-born … , New Jersey.Generate a question to which the answer is the entity 'Princeton'.
Out: What is the city where Albert Einstein died?

Verification
In: Passage about Princeton: Princeton is a university town … to the public.Generate a question to which the answer is the entity 'Mercer County'.
Out: What is the county that … University?
Extract named entities from the passage ... Extract named entities from the passage ...

Verification
Figure 2: Generation steps for 2-hop QA quadruplets.Each QA quadruplet comprises a question q, its corresponding answer a, and a context passage p.The explanation e includes the answer a from the passage p to address the question q.Text highlighted in orange refers to previously generated content, while the response of the LLM is highlighted in green.
context, question, answer and explanation.
In stage 2, we construct multi-hop reasoning chains by connecting 2-hop QA quadruplets and build an ODMR dataset via composition.
In the last stage, we adopt clustering-based sampling approach to dynamically select and construct in-context demonstrations for inference.

2-Hop QAs via Self-Generation
In the first stage, we prompt LLMs to iteratively generate 2-hop QA quadruplets with context, question, answer and explanation, which is illustrated in Figure 2. Inspired by Li et al. (2022), we design a 2-hop commonsense QA generation pipeline, including the following 4 steps: Step 1: First-Hop Passage Generation To guarantee the comprehensive coverage of commonsense knowledge, we manually design 29 diverse topics based on the statistics of TriviaQA (Joshi et al., 2017).For each topic, we require the LLM to name a certain number of keywords.For each collected keyword k 1 , we ask the LLM to generate a Wiki-style passage p 1 .Despite some factoid errors (Li et al., 2022), such generated passages contain sufficient factual information to serve as context for QA generation.
Step 2: First-Hop QA Generation Given that the answers for commonsense questions are likely to be named entities, we use Spacy2 and NLTK (Bird and Loper, 2004) libraries to extract the named entities in the passage p 1 as candidate answers.For each candidate answer a 1 , we require the LLM to raise a question q 1 to which the answer is a 1 based on the passage p 1 .To ensure the quality of q 1 , we employ a double-check process, where we demand the LLM to answer the generated question q 1 given the context p 1 to check if the generated answer a 1 ′ is accordant with a 1 .Once the generated QA pair passes the double-check, we prompt the LLM to write a short explanation e 1 for it.Note that the candidate answers must exclude the keyword (a 1 ̸ = k 1 ) because the answer in the first hop will become the keyword for the second hop (k 2 = a 1 , k 2 ̸ = k 1 ).In addition to that, a valid explanation must contain the answer (a 1 ∈ e 1 ).
Step 3: Second-Hop Passage Generation Before the first-hop answers are used as keywords for second-hop passage generation, we use Spacy to filter out the answers with certain labels (QUANTITY, ORDINAL, CARDINAL, PERCENT, MONEY, DATE, TIME), which are infeasible for Wiki-style passage generation.Given a keyword k 2 , we repeat the same prompts as described in Step 1 to generate the passage p 2 .
Step 4: Second-Hop QA Generation We be- Step 4: Multi-Hop Question Generation Take type 3 as an example: Stage 2: Multi-Hop QAs via Composition X N Step 1: Reasoning Chain Composition In: Is Venice located on the Adriatic Sea? Out: In: Is it the Mississippi River that serves as the eastern border of the American Gothic painting by … that Nikola Tesla emigrated to in 1884 to work for Thomas Edison in New York City? Out: Step 2: Duplication Control Generated 2-Hop QAs: gin with extracting candidate answers in the generated passage p 2 while blocking the keyword k 1 and the answer a 1 (also known as k 2 ) in the firsthop QA to avoid cyclic graphs.
For each candidate answer a 2 , we require the LLM to generate a question q 2 which contains the first-hop answer a 1 and can be answered by the candidate answer a 2 .We examine the quality of q 2 with the same doublecheck in Step 2, and ensure the second-hop question q 2 contains the first-hop answer a 1 (a 1 ∈ q 2 ) for connected reasoning.Then we repeat the same prompts in Step 2 to generate explanation e 2 .So far, we have instructed the LLM to generate a 2-hop commonsense QA quadruplet pair, which is (p 1 , q 1 , a 1 , e 1 ) → (p 2 , q 2 , a 2 , e 2 ) with a 1 ∈ q 2 .Detailed prompt templates are shown in Figure 2 and Appendix B.

Multi-Hop QAs via Composition
In stage 2, we construct multi-hop reasoning chains with the connected 2-hop QA quadruplets, which is illustrated in Figure 3.We propose an automated dataset construction pipeline to build ODMR datasets with 2-4 hops, which has the following 4 steps: Step 1: Reasoning Chain Composition To connect more questions, we follow the composability criteria (Trivedi et al., 2022), that is, two single-hop QA pairs (q 1 , a 1 ) and (q 2 , a 2 ) are composable into a multi-hop question Q with a 2 as a valid answer if a 1 is a named entity and it is mentioned in q 2 .Such criteria are already satisfied when our 2-hop QA pairs are generated, we use this criterion for connecting more questions.We adopt 6 reasoning graphs with 2-4 hops to build 6 types of multi-hop reasoning chains (Figure 6 in Appendix), and we ensure that in each reasoning chain: 1) the answer a i to an intermediate question q i will appear and ONLY appear in its next-hop question q i+1 to avoid shortcuts; 2) the answer to the last question will NOT appear in any intermediate questions.
Step 2: Duplication Control Built by rulebased composition, our new dataset has considerably similar reasoning chains that have duplicate intermediate questions.To ensure the diversity and simplicity of our dataset, we filter out the reasoning chains by a preset duplication degree which is defined by the number of questions that co-existed in other chains within the same reasoning type.
Step 3: Binary Question Generation We notice that MHQA datasets also include general interrogative questions which should be answered by "Yes" or "No", rather than a named entity.Therefore, we leverage the LLM to reform the last QA (q n , a n ) of some reasoning chains to binary question with 4 manually designed in-context demonstrations.For each reasoning type, we randomly sample 10% reasoning chains for positive question generation and Table 1: Comparison of different approaches on four MHQA benchmarks.The fine-tuning methods are fine-tuned on the train split of NQ (Kwiatkowski et al., 2019) dataset.Among them, methods marked with " † " use the Wikipedia dump (Karpukhin et al., 2020) as extra corpus.For retrieval-based methods, we use a fine-tuned DPR (Karpukhin et al., 2020)  10% for negative ones.Then we reform a new reasoning chain by the generated binary question together with its previous question hops and add it to the dataset.
Step 4: Multi-Hop Question Generation Now we need to generate multi-hop questions, to which the previously generated question chains serve as their intermediate reasoning steps.For each question chain, we iteratively replace the answer a i to an intermediate question q i in the next-hop question q i+1 by [q i ] until the last question q n is replaced, which indicates a relative clause.Then we leverage the LLM to reform it into a natural multihop question with 4 manually designed in-context demonstrations.
After the pipeline above, we construct a high quality ODMR dataset with 2-4 hops, including the overall multi-hop question, the decomposed reasoning chains with detailed QA quadruplets.With the double-check in generation and the composability criteria, we automatically build a high quality new dataset.Detailed prompt templates are presented in Figure 3 and Appendix B.

Adaptive In-context Demonstration
In this stage, we sample multi-hop questions from our generated ODMR dataset as in-context demonstrations.
Clustering-based Retrieval Some previous works (Zhang et al., 2022;Li et al., 2022) have shown that clustering-based methods benefit from the diversity of demonstrations.We adopt a clustering-based retrieval approach to adaptively sample in-context demonstrations for the input question.First, all the questions are projected to a high dimension hidden space by encoding with Sentence-BERT (Reimers and Gurevych, 2019).Suppose we need n in-context demonstrations.
Given a test question Q, we use k-means to cluster the question embeddings into n clusters and adaptively retrieve the question with the highest cosine similarity to Q from each cluster.Build Reasoning Chain For each sampled example, we sequentially concatenate the explanation from each hop, prefaced by "Step {i}:", to construct a reasoning chain.

Experiments
Our research questions (RQs) are: RQ1: To what extent can SP-CoT boost the LLMs on our four ODMR benchmarks, compared with other LLM-only methods?
RQ2: Is SP-CoT generally effective on recent popular instruction-following LLMs?
To this end, we conduct experiments on four MHQA datasets that require complex multi-step reasoning and compare different methods across different LLMs.

Benchmarks and Evaluation Metrics
We choose the following four MHQA datasets: CWebQ, HotpotQA, and MSQ.We set them as ODMR benchmarks by taking only the question and the answer in each example.Dataset introduction and statistics are detailed in Appendix A.
We adopt the exact match (EM) and F1 scores as our evaluation metrics.Based on the evaluation script of Karpukhin et al. (2020), we add a preprocessing step which ignores the content within "()" and splits the answer strings by certain delimiters to extract multiple answers.

Experiment Settings
For reference, we experiment with fine-tuning methods using an extra corpus, which are finetuned on the training split of NQ (Kwiatkowski et al., 2019) dataset and most of them adopt the Wikipedia dump (Karpukhin et al., 2020) as extra corpus.We also test our implementation of the retrieval-based methods on most recent LLMs for reference.Specifically, we use a fine-tuned DPR (Karpukhin et al., 2020) to retrieve top-5 documents from Wikipedia as context and employ LLM as Reader to answer the question based on the context.Detailed prompt templates and parameter settings are provided in the Appendix B.
Unless otherwise specified, we use Sentence-BERT (all-mpnet-base-v2) for question encoding following previous works.The default number of in-context demonstrations is 8 and the demonstrations are sampled by the maximum cosine similarity of questions in each cluster.
For RQ1, we adopt ChatGPT (gpt-3.5-turbo-0301)as the LLM to conduct the following experiments.According to OpenAI,3 gpt-3.5-turbo-0301 is an improvement on the InstructGPT text-davinci-003 model, which performs at a similar capability level to text-davinci-003 for inference.We use the whole development set of each dataset in our experiments.

Experiment Results
The main results of RQ1 are shown in Table 1.Even with extra corpus, the models fine-tuned on NQ (Kwiatkowski et al., 2019) present poor performance due to the inherent challenges of MHQA.With the same Retriever model, the performance of retrieval-based methods depends largely on the LLM Readers.Compared to previous LLM-only works, our SP-CoT significantly outperforms the Auto-CoT by +6.2 EM and +6.4 F1 scores on average and surpasses the previous SOTA method GENREAD (Yu et al., 2023) by +2.3 EM and +2.8 F1 scores on average.On the most challenging benchmark MSQ, SP-CoT empowers ChatGPT to outperform other LLM-only methods by a decent margin.
We notice that SP-CoT significantly outperforms GENREAD on MSQ, confirming the effectiveness of providing high quality CoTs as in-context demonstrations for complex multi-hop questions.On the other three datasets, SP-CoT delivers comparable performance with GENREAD.However, GENREAD relies heavily on the generation faithfulness of LLMs, which is challenging for smallscale LLMs.By breaking down demanding instructions into step-by-step simple ones, our method is more applicable to small-scale LLMs, which is validated by Table 2. Table 2 presents the results for RQ2.Our proposed SP-CoT proves to be generally effective by significantly boosting the performance of all these four LLMs on all four benchmarks.With SP-CoT, the performance of small-scale (13B) LLMs can be boosted to be on par with directly prompting LLMs that are over 10× larger, regardless of the elicited high quality intermediate reasoning steps.

Analysis
In this section, we explore the choices of the sampling methods and the number of demonstrations.Then we examine the quality of the intermediate reasoning steps elicited by SP-CoT and the quality of self-generation data.Unless otherwise specified, we use ChatGPT (gpt-3.5-turbo-0301) to conduct analysis on the same subsets mentioned in RQ2 settings.

Methods of Demonstration Sampling
The performance of ICL depends largely on the quality of demonstration sampling.We test the effectiveness of the following five strategies: randomly sampling (Random), sampling globally by maximum cosine similarity (Retrieve), sampling the closest to the centroid in each cluster (Cluster-Center), sampling by the maximum cosine similarity in each cluster (RetrieveInCluster) and sampling the most similar QAs in each cluster in a certain reasoning type (RetrieveInTypeCluster).The reasoning type of the input question is determined by the most frequent reasoning type of its k-nearest neighbors.As indicated in Table 3, RetrieveInCluster (Li et al., 2022) is the best-performing strategy, which is exactly the strategy we adopt in previous experiments.

Impact of Demonstration Amount
Providing more in-context demonstrations empirically improves ICL performance; however, it also causes increasing computational cost.To this end, we investigate the trade-offs of number of demonstrations and the resulting performance boost.We report the EM and F1 scores over the four benchmarks for 2, 4, 6, 8, and 10 in-context demonstrations, as well as the scores in a zero-shot setting.As illustrated in Figure 4, the performance of SP-CoT increases with the number of demonstrations when the count is between 2 and 8; however, using 10 demonstrations doesn't yield any further performance boost.In our main experiments, we opted for 8 as the default number of demonstrations, striking a balance between performance and cost.

Intermediate Reasoning Quality Analysis
Given the high-quality CoTs constructed by our proposed SP-CoT, we investigate the quality of intermediate reasoning steps generated during inference.For this analysis, we use the development set of MSQ, as it's the most challenging of the four datasets and offers decomposed step-by-step QAs.
We compare the CoTs generated by Zero-shot-CoT, Auto-CoT and SP-CoT during inference.For fairness, we select 50 out of a total of 59 questions that all of the three methods answered correctly.First, we use GPT-4 to evaluate4 the intermediate reasoning steps in terms of clearness, conciseness, comprehensibility and directness separately on a scale of 1 to 10.Additionally, we compute the recall accuracy of intermediate answers co-occurring in the reasoning steps of each method.For fairness, we only report the intermediate answer recall accuracy of correctly answered questions for each method.As depicted in Figure 5, GPT-4 highly favors our SP-CoT, which achieves nearly a 50% recall accuracy for intermediate answers.This suggests that SP-CoT elicits high-quality reasoning steps in terms of clearness, conciseness, comprehensibility, and directness.

Conclusion
In this work, we harness the capabilities of LLMs combined with self-prompted CoTs to tackle the intricate MHQA task within the open-domain context, termed as ODMR.Our innovative SP-CoT not only sets a new benchmark by surpassing preceding CoT prompting techniques but also outclasses the erstwhile SOTA LLM-only methodologies in open-domain question-answering.A distinguishing feature of SP-CoT is its proficiency in eliciting high-caliber intermediate reasoning steps, and its universal efficacy across both large and smallscale LLMs.We anticipate our innovative selfgeneration pipeline for ODMR to not just be foundational for SP-CoT, but also to pave the way for future research, catalyzing a shift towards leveraging self-generation in LLMs, by LLMs, and for LLMs.

Limitations
Our proposed method (SP-CoT) leverages the strong instruction-following power of LLMs.Such capability is easy to acquire through instruction fine-tuning for small-scale LLMs (even 7B), however, some LLMs proposed in early years may show poor capability in following human instructions due to lack of corresponding training before their release.Therefore, the performance of such LLMs may not be boosted by our proposed SP-CoT.In fact, we did not succeed in boosting the performance of GPT-NeoX by any of Zeroshot-CoT, Auto-CoT and SP-CoT.GPT-NeoX is a 20B LLMs released in early 2022, which shows poor instruction-following capability.Please note that neither GENREAD (Yu et al., 2023) nor Selfprompting (Li et al., 2022) boosts the performance of GPT-NeoX.
It is acknowledged that the efficacy of LLM-only approaches is predominantly reliant on the LLMs themselves.With smaller-scale LLMs, specifically those of 13B scale, our SP-CoT together with other CoT methodologies, demonstrate comparable or similar performance enhancement across four ODMR benchmarks as presented in Table 6.The consistent performance of handcrafted CoTs remains ambivalent across different LLMs and benchmarks; our empirical observations indicate that Manual-CoT occasionally outperforms SP-CoT, while at other instances, it does not.
Given the potential for LLMs to generate imprecise information, the process by which our SP-CoT produces datasets might also result in the emergence of inaccurate QA pairs as well as erroneous explanations.Despite the incorporation of a doublecheck mechanism to ensure data integrity, certain errors and inaccuracies are inevitably present.

A Datasets
A.1 Introduction HotpotQA (Yang et al., 2018) HotpotQA is a widely used dataset for multi-hop questionanswering (MHQA), which contains 113k multihop questions in natural language.The questions are collected by crowdsourcing based on Wikipedia articles with human annotated supporting evidence and answers.
2WikiMultiHopQA (Ho et al., 2020) 2Wiki-MultiHopQA s a recently proposed large-scale MHQA dataset, which contains over 192k samples constructed jointly from Wikipedia and Wikidata.
MuSiQue-Ans (Trivedi et al., 2022) MuSiQue-Ans (MSQ) is a recent challenging MHQA dataset created via single-hop question composition.It includes 25k 2-4 hop questions with six different composition structures.Although MSQ is composed from existing datasets, it poses 3× the human-machine gap with a substantially lower disconnected reasoning score.
ComplexWebQuestions (Talmor and Berant, 2018) ComplexWebQuestions is a manually generated MHQA dataset of 35k QA pairs.CWebQ is generated by rephrasing questions generated by machine from existing dataset.

A.2 Statistics
The statistics of four datasets are shown in Table 4.We following the notations described in Section 3. The templates are: 1. Name {Number} {Topic}: 2. Generate a Wikipedia passage about {k 1 }.
5. Passage about {k 1 }:\n{p 1 }\n\n Question:\n {q 1 }\n\nAnswer:\n{a 1 }\n\nAccording to an evidence from the passage to support the answer, rewrite it to make its meaning clear without passage.
4. Passage about {k 2 }:\n{p 2 }\n\n Question:\n {q 1 }\n\nAnswer:\n{a 2 }\n\nAccording to an evidence from the passage to support the answer, rewrite it to make its meaning clear without passage.

B.3 Binary Question Generation
• Question: {q n }\nAnswer: {a n }\nReform the question to a general interrogative question that can be answered with yes: • Question: {q n }\nAnswer: {a n }\nReform the question to a general interrogative question that can be answered with no: B.4 Multi-Hop Question Generation • Raw question: {q n }\nReplace the sentence within [] with a relative clause and make the raw question into a natural question:

B.5 CoT Construction
Suppose q * is the generated multi-hop question, e i denotes the explanation from intermediate hop (q i , a i , e i ), a * is the answer of the last hop (a * = a n ).

B.6.2 SP-CoT
Suppose we have the input question Q and 2 demonstrations (q 1 , r 1 , a 1 ), (q 2 , r 2 , a 2 ), where q i , r i , a i denote the question, CoT and answer of the i th demonstration.The inference template is:   In most cases, the default temperature is set to 0 for obtaining a most deterministic response.When we ask ChatGPT to name some terms or to generation a question, the temperature is set to 1.0 for more diversity.

D.1 Comparison of CoT Variants
To provide a more comprehensive picture of current CoT methods on ODQA, we report hereby (Table 5) the performance of additional CoT variants, including Manual-CoT (Wei et al., 2023) and Auto-CoT (Zhang et al., 2022) on ChatGPT (gpt-3.5-turbo-0301).
In our experiment, Manual-CoT (Cherry-Pick) adopts 8 cherry-picked questions and their CoTs manually writen by the authors.The results of Manual-CoT (Random) report the mean EM scores of randomly selected questions and theirs manual CoTs for 2 experiments across the 4 benchmarks.Self-Consistency (Wang et al., 2023) is based on SP-CoT with 5 responses for each question.
To the best of our knowledge, Self-Consistency (Wang et al., 2023) is orthogonal to existing CoT methods, including SP-CoT.Although Self-Consistency boosts the performance of SP-CoT to a higher level (10%-30% increase for 5 runs), it's worth noting that the cost of Self-Consistency is also 5 times higher.
In Table 6, we report the performance (EM) of CoT methods with recent popular LLMs on 1k subsets of the test sets.The scores are the average EM scores on 4 ODMR benchmarks.Although Manual-CoT (Wei et al., 2023) outperforms Automated methods, it requires high quality human-labeled CoTs, which is not always accessible in real world applications.Since the cherry-picked CoTs take the dataset features in to consideration, we consider their results as the theoretical upper limit of automated approaches.Compared to previously automatic SOTA method (Auto-CoT), our proposed SP-CoT shows a decent performance boost in most cases.We report our experiment of CoT methods on 1k subset of test set provided by GrailQA (Gu et al., 2021).According to our ODMR setting, no external knowledge is provided to LLMs.From the results below, we notice that our proposed SP-CoT is effective on GrailQA, our results on InstructGPT (text-davinci-003) are presented in Table 8.  (Gu et al., 2021).

E.2 Topics
The 29 manually designed topics for generation are: politicians, athletes, sports teams, sports events, countries, cities, historical figures, historical events, wars, religions, singers, songs, actors or actresses, movies or TV series, writers, books, painters, paintings, composers, classical music, tourist attractions, scientists, scientific terms, video games, animals, plants, foods, enterprises, international organizations.

E.3 Quality Control
To ensure the self-generation quality, two mechanisms are included in our proposed method.
Self-validation in self-generation: To ensure the quality of generated QA pairs, we employ a double-check process, where we demand the LLM to answer the generated question given the generated context and double-check If the generated answer is accordant with the target answer.
Composability criteria in composition: Two single-hop QA pairs (q 1 , a 1 ) and (q 2 , a 2 ) are composable into a multi-hop question Q with a 2 as a valid answer if a 1 is a named entity and it is mentioned in q 2 .Such criteria are already satisfied when our 2-hop QA pairs are generated, we use it for connecting more questions.

E.4 Composition rather than Generation
Directly generating k-hop questions will produce many highly-duplicated reasoning chains, which is less effective than conposition with 2-hop QAs.
Take a connected 3-hop QAs as example: Suppose there are 2 valid next-hop QAs (Q 4 , A 4 ) and (Q 5 , A 5 ) for (Q 3 , A 3 ).Now we have 2 generated 4-hop reasoning chains: which are highly-duplicated.When directly generating k-hop reasoning chains, the number of highly-duplicated chains will increase exponentially and such chains will be filtered out in the Duplication Control Process (Stage 2, Step 2).
Furthermore, when there are more that 2 question hops in one reasoning chain, more effort should be made to ensure direct acyclic graphs (DAGs).An example of cyclic reasoning chain is (Q 1 , A 1 ) → (Q 2 , A 2 ) → (Q 1 , A 1 ), which should be avoided.

Figure 1 :
Figure 1: The overall framework of our proposed SP-CoT, including an automated generation of ODMR datasets, an adaptive sampler for CoT selection and self-prompted inference via ICL.Texts highlighted in purple refer to questions, in red to previously generated CoTs, in orange to answers, and in green to newly generated contents.

Figure 3 :
Figure3: Generation steps for MHQA groups.In step 3 and step 4, we use 4 manually designed demonstrations for ICL.Each MHQA group includes a multi-hop question, the corresponding answer and decomposed QA quadruplets.Nodes and texts highlighted in red, blue and green successively refer to the last hops, intermediate hops and generated hops.Manually designed texts are highlighted in purple.

Figure 5 :
Figure 5: Evaluation results of the CoT generated by three methods.The first four scores are in terms of clearness, conciseness, comprehensibility (Comp.)and directness given by GPT-4 on 50 examples.The recall accuracy of intermediate answers (IA Recall) is reported on the questions that are correctly answered by all 3 methods.
and scaleFor better understanding of the constructed ODMR datasets, we offer a well-designed figure (Figure6) to illustrate the six types of generated questions and their step-by-step decomposition.The scale of the generated ODMR datasets is about 1-4k (Table 7), however, it's largely dependent by the selfgeneration setting (how many examples to generate?) and the Duplication Control process in Stage 2 Step 2 (How many examples to keep?)To be more specific, the number of topic terms for selfgeneration decides the scale of generated 2-hop question pairs, and the level of duplication (how many existing question hops are allowed when constructing a new reasoning chain) decides the scale of the remaining examples after filtering.
Question: When did the International Court of Justice begin its work?Answer: April 1946 Reform the question to a general interrogative sentence that can be answered with Yes: Did the International Court of Justice begin its work in April 1946?Raw question: Is it the Mississippi River … [What is the American Gothic … about?] region in [What country did Nikola Tesla … in New York City?]? Replace the sentence within [] with a relative clause and … into a natural question: to retrieve top-5 documents from Wikipedia as context and employ LLM as Reader to answer the question based on the context.Methods based on ChatGPT are performed by gpt-3.5-turbo-0301version.

Table 2 :
The performance (EM) of our method on recent popular LLMs.We use text-davinci-003 for InstructGPT.On small-scale LLMs, SP-CoT nearly doubles the average zero-shot performance on four MHQA benchmarks.

Table 3 :
The performance (EM) of different methods of demonstration selection.The results from random selection represent the mean value and standard deviation obtained from 3 runs, each with a different seed.

Table 4 :
Statistics of development set of four MHQA benchmarks, average tokens for questions, answers and average number of reasoning steps.CWebQ does not provide supporting evidence or question decomposition.

Table 5 :
(Wang et al., 2023) of additional CoT variants.In our experiment, Manual-CoT (Cherry-Pick) adopts 8 cherry-picked questions and their CoTs manually writen by the authors.The results of Manual-CoT (Random) report the mean EM scores of randomly selected questions and theirs manual CoTs for 2 experiments across the 4 benchmarks.Self-Consistency(Wang et al., 2023)is based on SP-CoT with 5 responses for each question.

Table 6 :
The performance (EM) of CoT methods with recent popular LLMs on 1k subsets of test sets.We use gpt-3.5-turbo-0301for ChatGPT and text-davinci-003 for InstructGPT.On smaller-scale (13B) LLMs, CoT methods achieve comparable performance boost on four MHQA benchmarks.

Table 7 :
The scale and average hops of each ODMR dataset generated in this paper.Please note that the scale and the average hops are largely decided by the self-generation setting and the duplication control process.

Table 8 :
Performance of CoT methods on 1k subset of test set provided by GrailQA