Exploring Chain-of-Thought Style Prompting for Text-to-SQL

In-context learning with large language models (LLMs) has recently caught increasing attention due to its superior few-shot performance on various tasks. However, its performance on text-to-SQL parsing still has much room for improvement. In this paper, we hypothesize that a crucial aspect of LLMs to improve for text-to-SQL parsing is their multi-step reasoning ability. Thus, we systematically study how to enhance LLMs' reasoning ability through chain of thought (CoT) style prompting, including the original chain-of-thought prompting (Wei et al., 2022b) and least-to-most prompting (Zhou et al., 2023). Our experiments demonstrate that iterative prompting as in Zhou et al. (2023) may be unnecessary for text-to-SQL parsing, and using detailed reasoning steps tends to have more error propagation issues. Based on these findings, we propose a new CoT-style prompting method for text-to-SQL parsing. It brings 5.2 and 6.5 point absolute gains on the Spider development set and the Spider Realistic set, respectively, compared to the standard prompting method without reasoning steps; 2.4 and 1.5 point absolute gains, compared to the least-to-most prompting method.


Introduction
Text-to-SQL parsing, the task of translating a natural language question into a SQL query, has found wide applications in building natural language interfaces to databases and thus piqued significant research interest in recent years (Wang et al., 2020;Deng et al., 2021;Yu et al., 2021;Rajkumar et al., 2022;Hongjin et al., 2023;Ni et al., 2023).To develop a text-to-SQL parser, a prevalent approach is to collect labeled data and train a model via supervised learning (Shaw et al., 2021;Scholak et al., 2021).While effective, this approach necessitates a considerable amount of training data, which is costly to obtain because annotating SQL queries requires programming expertise.Consequently, the lack of data hinders real-life applications of stateof-the-art parsers, especially on novel databases and unseen domains (Suhr et al., 2020).
As an alternative to supervised learning, incontext learning (Brown et al., 2020), an emergent capability of large language models (LLMs), alleviates the need for large-scale data.With only a few examples, in-context learning enables LLMs to demonstrate performance comparable to or even better than fully supervised models on many NLP tasks, such as question answering, machine translation, and natural language inference (Chowdhery et al., 2022;Kojima et al., 2022;Wei et al., 2022b,a;Brohan et al., 2023).When applied to text-to-SQL parsing, in-context learning also shows encouraging results, but it still lags behind supervised approaches (Rajkumar et al., 2022;Chang et al., 2023;Liu et al., 2023a).
We hypothesize that the under-performance is because text-to-SQL parsing requires complex, multistep reasoning.Even for a seemingly simple question, such as "What is the ID of Kyle," a model has to ground it to database schemas, infer the relational algebra among schema items, and construct syntactically correct SQL clauses.Recently, the chain-of-thought (CoT) style promptings (Wei et al., 2022b;Zhou et al., 2023) are proposed and have shown promising multi-step reasoning capabilities.To enhance LLMs' reasoning ability, we systematically explore CoT style prompting for text-to-SQL parsing.Specifically, we seek to answer two research questions: (1) Which prompting style is better, generating all reasoning steps in a single pass, or iterative prompting and problem solving?(2) Does including more detailed information in the reasoning steps lead to better results for text-to-SQL parsing?
To address the questions, we adopt two widely used prompting methods for text-to-SQL parsing As the first method, we apply chain-of-thought prompting (Wei et al., 2022b) by drawing an analogy between its problem-solving process and the execution procedure of a SQL query.Referring to the logical execution order of SQL clauses (Narechania et al., 2021), we compose the intermediate execution steps in natural language and prompt LLMs to derive them before generating the SQL query.As the second method, we follow Zhou et al. (2023) to apply least-to-most prompting in two stages: (1) reduction: generate a series of sub-questions from the original question and (2) solving: iteratively translate each sub-question into its corresponding SQL query, with the original question as the last sub-question.However, in our case study 1 , we find that directly applying chainof-thought and lease-to-most promptings leads to error propagation issues.Their rationales contain very demonstration-specific information and are easier to mislead the reasoning process.Furthermore, least-to-most prompting technique leads to additional computational and time cost due to the multiple stages of reduction and solving.
Therefore, we propose a new method called question-decomposition prompting (QDecomp).Similar to chain-of-thought prompting, QDecomp generates a sequence of reasoning steps and then the SQL query in one pass.However, we mod-1 we will discuss it in Section 5.2 ify the steps to instruct LLMs to decompose the original complex question, akin to the problem reduction stage in least-to-most prompting.Also, to help LLMs ground database schemas, we design a variant of question decomposition prompting (QDecomp + InterCOL) by including the table and column names involved in each sub-question.We conduct comprehensive evaluations on two textto-SQL datasets, Spider (Yu et al., 2018) and Spider Realistic (Deng et al., 2021).Our proposed prompting methods substantially outperform existing prompting ones by 2.4 and 1.5 point absolute gains on the development set of Spider and Spider Realistic, respectively.The results suggest that the iterative prompting which is costly due to additional computational resources requirement as in least-to-most prompting may not be necessary (RQ1).In addition, our analysis shows the proposed question decomposition prompting methods, which do not instruct LLMs to generate detailed reasoning steps, reduce the chance of error propagation when generating the reasoning steps.(RQ2).Finally, we evaluate the robustness of our proposed prompting methods by varying the number, selection, and format of in-context examples and show that they can achieve consistently strong performance across different settings.
Large Language Models and Prompting.As large language models (LLMs) advance (Brown et al., 2020;Chowdhery et al., 2022), in-context learning emerged as a new paradigm in natural language processing (Liu et al., 2023b).Although LLMs can achieve outstanding performance by prompting them with few-shot examples in context, they struggle with tasks that require complex reasoning.As a solution, Wei et al. (2022b) proposed chain-of-thought prompting.By explicitly describing intermediate reasoning steps to answer a complex question in the prompts, chain-of-thought prompting improves the accuracy of LLMs by a large margin across several natural language reasoning tasks.However, chain-of-thought prompting has a key limitation, where it often performs poorly on tasks that require generalization of solving problems harder than the demonstration examples, such as compositional generalization (Zhou et al., 2023).Our work systematically explores chain-of-thought style prompting methods for the text-to-SQL parsing task.Additionally, we propose a new chain-ofthought style prompting method that guides LLMs to perform complex reasoning via question decomposition.We show that text-to-SQL parsing indeed requires multi-step reasoning, and chain-of-thought style prompting can help LLMs to achieve higher parsing accuracy.
Question Decomposition.Question decomposition is a method that facilitates QA models by converting a complex problem into a sequence of simpler subquestions (Gupta and Lewis, 2018;Min et al., 2019).In light of question decomposition, Zhou et al. (2023) proposed Least-to-Most prompting to solve complex problems with better compositional generalization in two stages.The method first prompts LLMs to generate a list of subquestions as a decomposition of the given problem.Then, it uses the subquestions to guide LLMs to incrementally solve each of them and derive a correct final answer.Our work is related to Wolfson et al., 2020Wolfson et al., , 2022, which applies question decomposition to text-to-SQL parsing, but we explore question decomposition for text-to-SQL parsing under in-context learning context and propose to leverage question decomposition as a novel chainof-thought style prompting.We conduct comprehensive experiments and show that our question decomposition prompting outperforms the two widely used methods, chain-of-thought prompting and least-to-most prompting, on several text-to-SQL datasets.
Text-to-SQL Semantic Parsing.Text-to-SQL semantic parsing has long been studied to build natural language interfaces for database applications (Dahl et al., 1994;Zelle and Mooney, 1996).Since the release of Spider (Yu et al., 2018), a crossdatabase text-to-SQL benchmark, many parsers have been developed on top of language models to better understand various database schemas (Wang et al., 2020;Yu et al., 2021;Deng et al., 2021).Recent work starts to explore the potential of LLMs, such as Codex (Chen et al., 2021), in textto-SQL parsing by including database schemas in the prompts (Rajkumar et al., 2022), retrieving similar questions as few-shot examples (Hongjin et al., 2023), or reranking SQL parses with their execution results (Ni et al., 2023).Our work is in parallel with these methods and extends this line by teaching LLMs to become a better text-to-SQL parser by itself without additional engineering efforts or introducing new modules.With our question decomposition prompting, an LLM, such as Codex in our experiments, can effectively learn to decompose natural language questions and predict table and column names (Section 3) incrementally in each step with a few in-context examples.

Prompting for Multi-Step Reasoning in
Text-to-SQL In this section, we outline three prompting methods to guide an LLM to progressively derive a sequence of reasoning steps and then generate the target SQL query.We first describe how we adopt chain-of-thought and least-to-most prompting for text-to-SQL parsing.Moreover, we introduce a new prompting method, question decomposition prompting (QDecomp) and its variant (QDecomp + InterCOL).Figure 1 demonstrates different prompting methods and more examples are provided in Appendix A. For all experiments, we use Codex (Chen et al., 2021), code-davinci-002, as the LLM.The experiments were conducted between January and March 2023 through OpenAI API2 , using greedy decoding with temperature 0.

Chain-of-Thought Prompting
Chain-of-thought prompting (Wei et al., 2022b) aims to improve LLMs' reasoning ability by generating a coherent series of intermediate steps before predicting the final answer.For text-to-SQL parsing, one challenge is how to come up with intermediate reasoning steps.We are inspired by the logical execution process of SQL queries, as adopted in Narechania et al. (2021) to construct an interactive natural language interface.For the SQL query in Figure 1(a), it has a logical execution order of FROM, followed by WHERE, and then SELECT.
Following the execution order, we put together a natural language description of all clauses as the intermediate reasoning steps for the in-context examples in CoT, as shown in Figure 1(a).

Least-to-Most Prompting
Unlike chain-of-thought prompting, which instructs LLMs to generate all reasoning steps in a single pass, least-to-most prompting (Zhou et al., 2023) tackles complex questions by prompting LLMs in two stages: problem reduction and problem solving.During problem reduction, it prompts the LLM to generate a series of subquestions from the original complex question.During problem solving, it prompts the LLM with one sub-question at a time and iteratively builds up the final solution.
When composing in-context examples, to derive the sub-questions for problem reduction, we segment the original question by sub-sentences or using conjunction words (such as "and," "or", and "but") and prepositions (such as "for", "with" and "without").This segmentation allows the LLM to focus on parsing a subset of SQL clauses for each sub-question, thereby decreasing the complexity of the original problem (Wolfson et al., 2022).For instance, the question "Show first name, last name, age for all female students.Their sex is F." in Figure 1(b) would derive two sub-questions: (1) "Show first name, last name, age for all students."(2) "Show first name, last name, age for all female students.Their sex is F." For the first sub-question, the LLM only needs to construct the SELECT and FROM clauses, while for the other sub-question, it can focus solely on the WHERE clause, as shown in Figure 1(b).

Question Decomposition Prompting
We propose a new prompting method, question decomposition prompting (QDecomp).Similar to chain-of-thought, QDecomp generates intermediate reasoning steps and the final SQL query in a single pass.Instead of using the logical execution procedure of SQL, we follow the problem reduction stage in least-to-most prompting and instruct the LLM to decompose the original complex question as the reasoning steps.Through this design, we hope to explore (1) the potential advantage of using question decomposition over the logical execution order of SQL clauses for composing reasoning steps; (2) whether an iterative process as in least-to-most prompting is necessary.
On top of that, we propose a variant, QDecomp + InterCOL, to address the table/column linking issue in text-to-SQL parsing (Wang et al., 2020).Specifically, we augment the in-context examples to prompt the LLM to identify any corresponding table/column names when generating each subquestion.Given a sub-question and its corresponding SQL parse, we annotate all table-column pairs mentioned in the parse as ground-truth.For star operators ( * ), we sample a random column from tables mentioned in the same (sub-)query.If a tablecolumn pair has been mentioned in the SQL parse of a sub-question, we would exclude it from the annotations of all subsequent steps.If a sub-question does not have any table-column pairs to annotate, we randomly choose one pair from preceding steps.
We include examples of these two variants in Figure 1(c) and 1(d).Following the same decomposition method in least-to-most prompting, the example has two sub-questions.For the first subquestion, "Show first name, last name, age for all students," we expect the model to highlight the table "student" and its columns "fname," "lname," and "age."As shown in Figure 1(d), it corresponds to the SQL parse of this sub-question.Then, for the follow-up question, the model is expected to identify the table "student" and its column "sex," which is not mentioned in the previous step.
In addition to promptings mentioned above, we also include Standard prompting for comparison.It serves as the baseline approach in which question-SQL pairs are presented as in-context examples which guide LLMs to produce the corresponding SQL query directly when a question from the test set is given.In this paper, we conduct comprehensive experiments on the Spider development set to demonstrate the effectiveness of our question decomposition prompting methods.
Spider Realistic (Deng et al., 2021).Spider Realistic is a more challenging version of the Spider development set.The authors modify the natural language questions in Spider by removing or paraphrasing explicit mentions of column names to generate a more realistic dataset that reflects real-world scenarios, where questions rarely contain explicit mentions of column names.The final dataset comprises a total of 508 question-query pairs.

In-context Example Selection
To show the robustness of question decomposition prompting, we consider two ways of choosing in-context examples: random selection and difficulty-based selection.In our main results, we use random selection for its simplicity and ease of replication.Additionally, in Section 5.3, we compare results obtained using random selection with those obtained using difficulty-based selection.

Prompt Formats
We also experiment with two prompt formats introduced by Rajkumar et al. (2022), API Docs and Create Table + Select 3.Both formats have their own advantages and can be utilized together with any prompting method in Section 3.
API Docs format is similar to the text-to-SQL parsing example in OpenAI's documentation 3 .It represents a database schema as a Python-style API comment, which only includes the table and column names.This format reduces the prompt length for each example, so we may include more incontext demonstrations from different databases to increase the domain diversity.In comparison, Create Table + Select 3 format adheres more closely to the SQLite standards, but with considerably longer prompts.It represents a database schema using the CREATE TABLE command, which provides more information, such as column data types and foreign key declaration.Besides, this format includes the results of executing SELECT * FROM T LIMIT 3 for each table T in the database as SQL comments.With an ablation study (Section 5.3), we show that API Docs format can achieve competitive performance compared to the Create Table + Select 3 format.Thus, we primarily use the API Docs format in our experiments due to its efficiency.

Evaluation Metric
We use test-suite execution accuracy (Zhong et al., 2020) to evaluate different combinations of prompting methods, in-context example selection strategies, and prompt formats.Leveraging the idea of "code coverage" in software testing, the metric synthesizes a large number of databases as "test cases" and compares the execution results (denotations) of predicted and gold SQL queries on all of them.In this way, test-suite accuracy reduces the number of false positives (different SQL queries that happen to have the same execution result) in standard execution accuracy, which compares denotations on only one database.As we can see in Table 1, Standard prompting falls behind Least-to-Most prompting in test-suite accuracy.However, their standard execution accuracy results are very close, which might be misleading.Meanwhile, test-suite accuracy still maintains the lower false negative rate of standard execution accuracy compared against exact match accuracy.

Results and Analysis
By analyzing our experimental results, we seek to answer the following two research questions: • RQ1: Which prompting style is better, generating all reasoning steps in a single pass, or iterative prompting and solving?
• RQ2: Does including more detailed information in the reasoning steps lead to better results for text-to-SQL parsing?

Main Results
Through comprehensive experiments on Spider and Spider Realistic (Table 1), we demonstrate that our proposed question decomposition (QDecomp) prompting and its variant (QDecomp + Inter-COL) consistently outperform two existing methods, chain-of-thought and least-to-most prompting.Specifically, QDecomp + InterCOL achieves 68.4% test-suite accuracy on the Spider development set and 56.5% on Spider Realistic, indicating its superiority in comparison to other prompting methods.
Additionally, the experiments show that iteratively solving a series of sub-questions may not be necessary for text-to-SQL parsing (RQ1).Although chain-of-thought prompting (56.8%) underperforms least-to-most prompting (66.0%) on the Spider development set, these two methods have several distinct designs besides iterative prompting; however, comparing our QDecomp prompting with least-to-most prompting, we can see generating sub-questions and the SQL query in a single pass can even achieve better performance.Thus, iterative prompting does not play an indispensable role in prompting LLMs to reason for text-to-SQL parsing.

Error Analysis
To answer RQ2, we conduct a quantitative error analysis of all four prompting methods with the component matching accuracy (Yu et al., 2018) on the Spider development set.Component matching accuracy is a fine-grained exact match metric that evaluates five SQL components, including SELECT clauses, WHERE clauses, GROUP BY clauses, ORDER BY clauses, and KEYWORDS (all SQL keywords, operators, and column names).Since exact match is too strict, we also consider a component to be correct if the whole SQL query's test-suite accuracy is 1.
As shown in Table 2, our QDecomp and QDecomp + InterCOL prompts achieve better performance than other COT style prompting methods across all five SQL components.Further analysis shows that a major reason why chain-of-thought prompting performs worse than prompting Codex without reasoning steps is error propagation.For example, in Table 3, Codex follows its reasoning steps faithfully to generate the corresponding SQL query, but the reasoning steps themselves have several errors, such as choosing the "breed_name" column instead of the "name" column in the SELECT clause.Least-to-most prompting makes improvements by separating reduction and solving.However, Codex sometimes still cannot translate a subquestion into the correct SQL query, especially when involving hard components, such as JOIN clauses (Table 4), GROUP BY clauses, and ORDER BY clauses (Table 2).As a result, the errors are propagated to subsequent reasoning steps, leading to an incorrect final SQL parse.In contrast, QDecomp + InterCOL prompting outperforms these two methods because it does not instruct Codex to generate detailed reasoning steps or intermediate SQL queries.In this way, it reduces the possibil- ity of accumulating mistakes in reasoning steps, which can negatively impact the model's overall performance.

Robustness to Prompt Design
To further validate our conclusions in the main experiments, we conduct additional experiments to test the robustness of all four prompting methods in this section and Appendix B. We only select standard and least-to-most prompting methods, as they demonstrate themselves as competitive baseline approaches for comparative purposes.terCOL prompting methods.Therefore, we use API docs as our default format and leave further experiments as future work.

Conclusion
In this paper, we systematically explore CoT style prompting to enhance LLMs' reasoning capability for text-to-SQL parsing.We adopt two existing methods, chain-of-thought and least-to-most prompting, and compare them with our proposed question decomposition prompting.In addition to the experiments discussed in the previous section 5.3, Table 8 presents the performance results of the model when different prompt formats are used to provide schema information.We use exemplars from G3, which are randomly selected from eight extra hard exemplars.Due to restrictions on prompt length, we set the number of in-context exemplars to four.Surprisingly, we observe a significant improvement in performance for methods when using the Create table + Select 3 prompt format.These findings suggest that if we use hard exemplars the model benefits from additional database content in Create table + Select 3 prompting.One possible reason for this improvement is those hard exemplars contain more SQL clauses, which provide more information for the model to learn how to link phrases in question and details of the database content from the Create table + Select 3 prompt format.Table 9 shows the performance results of the model as we vary the number of in-context exemplars, using exemplars from G3.As the number of in-context exemplars increases, we observe only a negligible improvement in performance for Least-to-Most prompting, and even worse performance for the standard prompt format.However, we found significant improvements in performance for QDecomp and QDecomp + Inter-COL promptings, demonstrating the effectiveness of the proposed method in handling challenging in-context examples.
For random selection, we uniformly sample incontext demonstrations from the Spider training set at random.For difficulty-based selection, we first group the Spider training examples into four pre-defined difficulty levels by Yu et al. (2018), including easy, medium, hard, and extra-hard.Then, we devise three methods to randomly select incontext examples based on their difficulties: (G1) sampling an equal number of examples at each difficulty level, (G2) sampling the same number of examples from the hard level and the extra-hard level respectively, and (G3) sample all examples from the extra-hard level.For example, for 8-shot learning, we have two examples at each difficulty level for G1, four examples respectively from the hard level and the extra-hard level for G2, eight examples from the extra-hard level for G3.

Table 1 :
8-shot test-suite accuracy of Codex on Spider and Spider Realistic using different prompting methods and API Doc format.We also include standard execution accuracy (EX) in parentheses as a reference.The best performances are in bold and the second best ones are underlined.For each method, we repeat the experiments with 5 different sets of in-context examples and report the average performances with their standard deviation.

Table 4 :
A demonstrative example of an error made by the Codex with Least-to-Most prompting.
with API docs, there is only a negligible improvement in performance for the standard prompting; it also results in a slight decrease in the accuracy of Codex using QDecomp and QDecomp + In-

Table 5 :
Test-suite accuracy of Codex on the Spider development set using different numbers of in-context examples.We do not have 0-shot results for the proposed methods as they need at least one support example to learn to solve the task step by step.

Table 8 :
Figure 3: An example for Create Table + Select 3 prompt format, introduced by Rajkumar et al. (2022), on Spider.Figure4: An example prompt under the standard API docs prompting for 2-shot on Spider.This query chooses records from the Book_Club table, followed by a WHERE clause that selects records where the year column is greater than 1989.It then groups the results by the category column.It then filters the results where the count of each category is greater than or equal to 2. It then selects the category column.Find the first names and offices of all instructors who have taught some course and the course description and the department name.",weneed to know: "Find the first names and offices of all instructors.","Findthefirst names and offices of all instructors who have taught some course.","Findthefirst names and offices of all instructors who have taught some course and the course description.".Find out the send dates of the documents with the grant amount of more than 5000 were granted by organisation type described as "Research".decompose the question 1. Find out the send dates of the documents.2.Find out the send dates of the documents with the grant amount of more than 5000.3.Find out the send dates of the documents with the grant amount of more than 5000 were granted by organisation type described as "Research".#Thus, the answer for the question is: Find out the send dates of the documents with the grant amount of more than 5000 were granted by organisation type described as "Research".Find out the send dates of the documents with the grant amount of more than 5000 were granted by organisation type described as "Research".decompose the question 1. Find out the send dates of the documents.SQL table (column): documents (sent_date) 2. Find out the send dates of the documents with the grant amount of more than 5000.SQL table (column): grants (grant_amount, grant_id) 3. Find out the send dates of the documents with the grant amount of more than 5000 were granted by organisation type described as "Research".SQL table (column): organisation_Types (organisation_type_description, organisation_type), organisations (organisation_type, organisation_id) # Thus, the answer for the question is: Find out the send dates of the documents with the grant amount of more than 5000 were granted by organisation type described as "Research".SELECT T1.sent_date FROM documents AS T1 JOIN Grants AS T2 ON T1.grant_id = T2.grant_idJOINOrganisationsAS T3 ON T2.organisation_id = T3.organisation_idJOINorganisation_TypesAS T4 ON T3.organisation_type = T4.organisation_typeWHERET2.grant_amount > 5000 AND T4.organisation_type_description = 'Research' ### SQLite SQL tables, with their properties: # stadium (stadium_id, location, name, capacity, highest, lowest, average) # singer (singer_id, name, country, song_name, song_release_year, age, is_male) # concert (concert_id, concert_name, theme, stadium_id, year) # singer_in_concert (concert_id, singer_id) # ### Question: How many singers do we have?decompose the question Figure9: An example prompt under QDecomp + InterCOL + API docs prompting for 1-shot on Spider.The impact of different prompts design of providing schema information on test-site accuracy on the Spider, where the number of in-context exemplars is set to 4, where the exemplars from the G3 are used Figure 7: An example prompt under least-to-most + API docs prompting (problem solving) for 1-shot on Spider.

Table 9 :
Test-suite accuracy with varying numbers of support examples, where the exemplars from G3 are used.