ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated Chain-of-Thought

Recently Large Language Models (LLMs) have been proven to have strong abilities in various domains and tasks. We study the problem of prompt designing in the text-to-SQL task and attempt to improve the LLMs' reasoning ability when generating SQL queries. Besides the trivial few-shot in-context learning setting, we design our chain-of-thought (CoT) prompt with a similar method to schema linking. We provide a method named ACT-SQL to automatically generate auto-CoT exemplars and thus the whole process doesn't need manual labeling. Our approach is cost-saving since we only use the LLMs' API call once when generating one SQL query. Furthermore, we extend our in-context learning method to the multi-turn text-to-SQL task. The experiment results show that the LLMs' performance can benefit from our ACT-SQL approach. Our approach achieves SOTA performance on the Spider dev set among existing in-context learning approaches.


Introduction
The text-to-SQL task (Zhong et al., 2017;Xu et al., 2017) aims to translate the natural language question into the corresponding SQL query with the given database schema.It is the key technique to establish the natural language interface on relational databases, which can help common users access data from relational databases in a more convenient way.
Recent studies in text-to-SQL research have primarily centered on the development of semantic parsers within the framework of cross-domain analysis.In cross-domain text-to-SQL datasets such as Spider (Yu et al., 2018), SParC (Yu et al., 2019b), and CoSQL (Yu et al., 2019a), the databases employed in the train set, dev set, and test set do not overlap.Prior research endeavors have focused on training specialized text-to-SQL models and optimizing their structural components to enhance overall performance.Notably, these efforts have yielded impressive model performances across various datasets.Nevertheless, the construction of such models necessitates a substantial number of highquality training examples and entails significant time investments for finetuning.Moreover, these models often possess intricate structures, rendering their deployment challenging.
Recent research has provided empirical evidence establishing the substantial capabilities of Large Language Models (LLMs), such as GPT-3 (Brown et al., 2020) and ChatGPT (Ouyang et al., 2022), across a wide spectrum of domains and tasks.As the scale of LLMs continues to expand, scholarly investigations have revealed the presence of emergent abilities (Wei et al., 2022) exclusive to larger LLMs and absent in their smaller counterparts.Therefore, the latest studies employ LLMs in the context of the text-to-SQL task, utilizing the incontext learning method (Brown et al., 2020).Owing to the impressive performance demonstrated by LLMs in zero-shot or few-shot prompting scenarios, the need for extensive finetuning using an abundance of training examples has been rendered unnecessary.Consequently, the integration of LLMs in the text-to-SQL process yields notable time and cost savings.
Nonetheless, contemporary in-context learning approaches for text-to-SQL encounter certain challenges.For instance, Rajkumar et al. (2022), in comparison to SOTA finetuned models, employ a simplistic prompt designing approach that yields relatively subpar performance.Similarly, Pourreza and Rafiei (2023) employs a convoluted workflow to generate the final SQL query, resulting in achieving SOTA performance on the test set of the Spider dataset.However, this approach proves time-consuming and resource-intensive, as it ne-cessitates multiple API calls to LLMs during the query generation process.Moreover, the recent advancements in in-context learning methods for text-to-SQL have yet to be extended to multi-turn datasets, such as SParC, CoSQL, and DIR (Li et al., 2023b).
Despite the proficiency of LLMs as zero-shot and few-shot learners, the mere superficial prompt design fails to fully activate their capabilities.To address this limitation, Wei et al. (2023) proposes a novel prompting technique called chain-of-thought (CoT).Through the CoT method, the prompt text encompasses a comprehensive thinking process that guides LLMs towards accurate deduction of answers.Notably, the CoT method mirrors the sequential nature of human reasoning, wherein intermediate answers are obtained before arriving at a final conclusion.Given the intricate nature of the text-to-SQL task, the CoT method proves highly suitable, as generating the SQL query entails complex reasoning processes.However, existing CoT methodologies necessitate extensive time investments in the selection of canonical examples and manual labeling.The text-to-SQL task lacks an automated approach for generating CoT sequences.
In this paper, we propose our in-context learning method for the text-to-SQL task with the automatically-generated CoT.First, under the zeroshot setting, we study the influence on LLMs' performance caused by the input format of the database schema.Second, under the few-shot setting, we provide a hybrid strategy to select exemplars and study the influence on LLMs' performance caused by the number of exemplars.Our experiment results show that the strategy is effective.Third, we present our approach named ACT-SQL to generate auto-CoT for the dataset training example consisting of the database schema, the question, and the corresponding SQL query.The experiment results show that the generated auto-CoT can indeed improve the LLMs' performance.The ACT-SQL achieves the SOTA performance on the Spider dev set among existing in-context learning methods.In addition, the ACT-SQL does not need to use extra LLMs' API calls, which means that our workflow is relatively fast and cheap.Finally, we apply our approach in multi-turn text-to-SQL datasets including SParC and CoSQL and achieve comparable accuracy scores with finetuned models.Our main contributions can be summarized as follows: 1. We explore the influence on LLMs' perfor-mance under the text-to-SQL task with different prompting styles and few-shot exemplar selection strategies.
2. We propose our approach named ACT-SQL to generate auto-CoT.The ACT-SQL achieves the SOTA performance on the Spider dev set among existing in-context learning methods.Furthermore, our automatic method is costsaving and time-saving and does not need extra LLMs' API calls.
3. We extend our method onto the multi-turn textto-SQL task and achieve comparable performances with finetuned models on the SParC and CoSQL datasets.

Related Work
Text-to-SQL models Over the past several years, text-to-SQL researches mainly focus on building well-designed deep neural networks (Chen et al., 2021b;Cao et al., 2023).RATSQL model (Wang et al., 2020) and LGESQL model (Cao et al., 2021) are AST-based approaches, where AST is the abbreviation of the abstract syntax tree.They encode the input and decode the AST of the SQL query with predefined grammar.AST-based approaches perform well but are generally complex to deploy.PICARD (Scholak et al., 2021) is a sequenceto-sequence model.SQL is a formal language that follows strict grammar rules.Directly finetuning pretrained language models (PLMs) on text-to-SQL datasets would make PLMs likely to generate invalid SQL queries.The PICARD model rejects invalid tokens at each decoding step and constrains the generated results into a certain output space.
Although these specialized models have achieved excellent performances, there still exist some inevitable disadvantages.In order to train a text-to-SQL model, abundant high-quality training examples are needed.Constructing and labeling a large-scale text-to-SQL dataset is always not easy and would consume a lot of resources and time.
Training and finetuning the model is also a hard project which costs many computing resources.
In-context learning for text-to-SQL Since LLMs have shown amazing ability across various domains and have been applied in many academic and industrial fields, the latest researches begin to activate the LLMs' ability for the text-to-SQL task.Rajkumar et al. (2022) uses the trivial zeroshot and few-shot learning setting and performs an empirical evaluation of text-to-SQL capabilities of LLMs including GPT-3 (Brown et al., 2020) and Codex (Chen et al., 2021a).They perform the zeroshot prompt learning on Spider (Yu et al., 2018), a large-scale human-labeled cross-domain text-to-SQL dataset.Their work is relatively simple and the performance falls behind finetuned models.Nan et al. (2023) mainly concentrates on the strategy of exemplars selection.Their work achieves good performance on several crossdomain datasets including Spider, Spider-Syn (Gan et al., 2021a), Spider-DK (Gan et al., 2021b) and Spider-Realistic (Deng et al., 2021).However, their work requires an extra preliminary predictor to evaluate the SQL's difficulty level and needs to use LLMs' API call many times due to the majority vote method.
DIN-SQL (Pourreza and Rafiei, 2023) provides a relatively complex approach.DIN-SQL consists of a complex workflow that decomposes the problem into several simpler sub-problems.With the LLM GPT-4, DIN-SQL has surpassed previous finetuned models and has achieved the best score on the Spider dataset.But DIN-SQL's workflow is obviously slow and expensive since it uses LLMs' API call many times to generate one SQL.

Methodology
With the in-context learning method, the SQL generation process can be formulated as S = LLM(I, D, Q, E).
I represents the instruction.D represents the database schema.Q represents the question.
is the list of exemplars where P i is the answer prompt which contains the correct SQL for the i-th exemplar.Thus the performance of LLMs is mainly influenced by the database prompt style, the exemplar selection strategy, and the exemplar prompt design.
In this section, we first describe the prompt styles of the database schema.Then we state our strategy of exemplar selection for the few-shot learning setting.Furthermore, we introduce our ACT-SQL approach, i.e. the automatically generated CoT method for constructing effective answer prompts.Finally, we extend our approach to the multi-turn text-to-SQL task.

Database Prompt Style
Previous works have shown that given the database schema, strong LLMs (e.g.GPT models) can trans-late the relatively simple natural language question into the correct SQL query, though no exemplar is provided.Under the zero-shot setting, the LLMs merely take the database schema and the question as the input.Thus the input format of the database schema would mainly influence the LLMs' performance.Generally, we use five different database schema styles, which are shown in Appendix C.1: 1. Furthermore, database contents are concerned.Specifically c example rows are appended to each table.Appendix C.2 shows instances where c = 3.

Exemplar Selection
Given a few exemplars, LLMs can benefit and acquire tips from them and thus generate SQL queries with a more standard format and higher accuracy.Exemplar selection is an important work under the few-shot setting, which would influence the LLMs' performance a lot.
We select exemplars using a hybrid strategy.Specifically, we first of all select n s examples from the training dataset at random.These dataset examples are named static exemplars.They would be used in the context of every test case.As for each specific test case, we select n d extra examples from the training dataset.These dataset examples are named dynamic exemplars since they are selected according to some features of the current test case.Consequently, there are total n s + n d exemplars for each test case.
In order to get dynamic exemplars that are more relevant to the current test case, we compare the natural language question of the current test case with all questions in the training dataset.We calculate the similarity scores with the suitable pretrained model and then select the top-n d training dataset examples.We believe that dynamic exemplars with more relevant questions would provide more effective information to the LLMs.

Chain-of-Thought Prompt Design
Under the few-shot learning setting, it has been proven that the LLMs' performance can benefit a lot from the chain-of-thought (CoT) (Wei et al., 2023) method.In the text-to-SQL task, only the database schema, the question, and the corresponding SQL query are provided in the prompt under the trivial few-shot learning setting.However, with the CoT method, the thought process of how to write the correct SQL query is added to the prompt.These prompting texts can help the LLMs think step by step when generating the complete SQL query and thus can activate the logical reasoning ability of the LLMs.
In previous works, some grammar-based text-to-SQL models utilize the graph encoding technique to jointly encode both the database schema and the question.Schema linking (Bogin et al., 2019;Wang et al., 2020;Cao et al., 2021) is a commonly used algorithm for building the input graph.If the question tokens exactly or partially match some schema item (i.e.table and column) names, then they are linked with the specific graph edge.It is obvious that the schema linking method can help the text-to-SQL models fetch the most relevant tables and columns among a great number of schema items based on the question.We design our chain-of-thought prompt with a similar method to schema linking.Figure 1 shows an instance of the manually labeled CoT for the example from the train set of the Spider dataset (Yu et al., 2018).As suggested in Kojima et al. (2023), the CoT prompt starts with "Let's think step by step".For each slice of the question sentence that may contain some information about the schema item, we add them into the CoT prompting text in the format shown in Figure 1.Furthermore, the values mentioned in the question and the SQL query are also a concern.The final SQL query is appended at the end of the CoT prompt.
Auto-CoT Although CoT prompts can be manually labeled, it costs a lot of time to find sufficient canonical and effective training dataset examples for CoT labeling.In addition, manually labeled CoT exemplars are fixed, which means that they are all static exemplars and dynamic exemplars are deficient.In order to deal with problems in the manual labeling process, we introduce an automatic method to generate auto-CoT prompts for every example in the training dataset.
Given the question q = (q 1 , q 2 , • • • , q |q| ) and the SQL query s, the q i represents the i-th token in the question sentence.We define q i,j = (q i , q i+1 , • • • , q j ) as a slice of the original question.We first enumerate each column [tab].
[col] appearing in the SQL query, where [tab] represents the table name and [col] represents the column name.For each column, we use the suitable pretrained model to compute the similarity scores between the current column and all the question sentence slices.The most relevant slice is arg max where Sim is the similarity function.We link the column and its most relevant slice and add them to the auto-CoT prompt in the same format as the manual labeled CoT prompt.Note that during this process, we ignore the column appearing in the GROUP BY clause of the SQL query, since the GROUP BY column is commonly not mentioned directly in the question.
Secondly, we enumerate each table [tab] appearing in the SQL query, where [tab] represents the table name.In this process, we eliminate tables that have occurred in the columns, since those tables have been added into the auto CoT prompt.The left tables only appear in the FROM clause and indicate some extra information.For each table, we also compute all the similarity scores and find out the most relevant question slice, i.e., arg max q i,j Sim([tab], q i,j ).
We link the table and its most relevant slice and add them to the auto-CoT.
Finally, we enumerate the values in the SQL query and then add them to the auto-CoT.Figure 2 shows an instance of the auto-generated CoT from the train set of the Spider dataset.

Extension for Multi-turn Text-to-SQL
The prompts described in the previous sections are designed for the single-turn text-to-SQL task.However, questions in the multi-turn text-to-SQL task are context-dependent and thus those prompts cannot be directly used.Moreover, the auto-CoT method is also disabled under the multi-turn setting, since the auto-CoT method finds information about schema linking based on the question slices.Under the multi-turn setting, this information may distribute into several context-dependent sentences.
In order to deal with the challenge of the multiturn text-to-SQL task, we use LLMs to convert the multi-turn text-to-SQL task into the single-turn text-to-SQL task.Concretely, with the help of the LLMs, we can rewrite the question sentences and remove the context dependency among them.Thus each rewritten question and its corresponding SQL query turn into a new independent dataset example.We then directly apply the previous in-context learning method in the converted multi-turn text-to-SQL task.
The quality of the rewritten questions would influence the LLMs' performance a lot.It is necessary to manually label some rewriting exemplars in order to fix the format and improve the quality of the LLMs' outputs.For each multi-turn textto-SQL dataset, we select 10 examples from the train set at random and manually label the rewritten results.

Experiment Setup
Models We mainly use the GPT-3.5-turbomodel to evaluate our proposed approach.The GPT-3.5turbo model is a low-cost LLM and is very large to have the emergent ability (Wei et al., 2022) for handling the text-to-SQL task.In addition, we use the GPT-4 model to evaluate our auto-CoT method on the Spider dataset (Yu et al., 2018), since the GPT-4 model has a stronger reasoning ability but is much more expensive.We use the PLM text2vec-base-chinese to compute the similarity scores when selecting dynamic exemplars and generating auto-CoT prompts.
Hyperparameters The temperature in LLMs' API is set to 0, i.e. the greedy decoding strategy is applied.The text-to-SQL tasks require the model to generate SQL queries with strict grammar rules.The LLMs are likely to generate invalid SQL queries or to write SQL queries that are not relevant to the given questions if the temperature is too high.The number of max tokens is set to 150 for the trivial in-context learning setting and 750 when using the CoT method.
Datasets We mainly evaluate our proposed approach on Spider, a large-scale human-labeled cross-domain text-to-SQL dataset across 200 databases covering 138 domains.The Spider dataset contains 8,659 examples in the train set and 1,034 examples in the dev set.It also provides the evaluation script which divides SQL queries into four categories (i.e.easy, medium, hard, and extra) according to the difficulty level.The test set of Spider is not publicly available.We conduct the experiments on the dev set.
In addition, we also conduct the in-context learning experiments on Spider-Syn (Gan et al., 2021a), Spider-DK (Gan et al., 2021b) and Spider-Realistic (Deng et al., 2021).Based on Spider, Spider-Syn replaces some schema-related tokens in the question with synonyms, which would make models unable to discover useful schema items with the simple string-matching method.Spider-DK defines five types of domain knowledge and modifies some examples by adding domain knowledge that reflects real-world question paraphrases.Spider-DK evaluates the models' generalization ability across domains when domain knowledge does not frequently appear in the train set.Spider-Realistic removes explicit mentions of column names to evaluate the model's ability to capture text-table alignment.
As for multi-turn text-to-SQL datasets, we conduct our experiments on SParC (Yu et al., 2019b) and CoSQL (Yu et al., 2019a).SParC totally consists of 4,298 coherent question sequences including 12k+ individual questions and the corresponding SQL queries.CoSQL totally contains 10k+ annotated SQL queries.Each dialogue in CoSQL simulates a real-world scenario where the common user is exploring the database and the expert is retrieving answers with SQL.
Evaluation metrics We use three commonly used evaluation metrics of the text-to-SQL task: exact match accuracy (EM), execution accuracy (EX), and test-suite accuracy (TS).The EM metric requires each component of the predicted SQL to be equivalent to the corresponding component of the gold SQL.Values in the SQL query are not concerned with the EM metric.The EX metric requires the execution result of the predicted SQL to be correct.Since there may exist different SQL queries that represent the same semantic, the EX metric is commonly more precise than the EM metric.The TS metric also evaluates the execution result but requires the result to be correct under multiple database instances per database schema3 .
For multi-turn text-to-SQL datasets, we evaluate our approach with question match accuracy (QM) and interaction match accuracy (IM).The QM score is 1 if the predicted SQL query for the single question is correct.The IM score is 1 if all the predicted SQL queries in the interaction are correct.As discussed in Section 3.1, the LLMs' performance is mainly influenced by the database prompt style and the rows of database contents under the zero-shot learning setting.We first conduct experiments for studying the influence on LLMs' performance caused by the rows of database contents.We fix the LLM as the GPT-3.5-turbomodel and the database style as Table(Column) and only change the rows of database contents for each table in the prompt.Figure 3 shows the result on the Spider dev set.The LLM achieves the lowest score when no database content is provided.This indicates that database contents can provide useful tips for LLMs, especially when the testing case is sensitive to values in SQL where Table 1 shows two cases.In the first case, the 3 records from the database contain exactly one cell value "France" instead of "French" for the column "singer.Citizenship".Thus the LLM successfully predicts the correct value when these records are added to the prompt.In the second case, the database contents point out that "Aberdeen" is the city name so that the LLM can predict the correct SQL structure.The LLM gets the best score when the rows of database contents is set to 3. Too much database content in the prompt would not improve the LLMs' performance.Therefore, we always set the rows of database contents to 3 for the subsequent experiments.

Few-shot Results
Table 11 shows all the few-shot experiment results on the Spider dev set, where different database styles and different numbers of static and dynamic exemplars are used.Compared with the zero-shot results, it is obvious that all the EM scores increase a lot.This is because SQL queries from the same dataset usually share similar grammar and structure and thus the exemplars from the Spider train set lead LLMs to output a similar SQL query.Under the trivial few-shot learning setting, the TS scores also get improved by 1%-3% except for the The experiment results prove that our ACT-SQL approach is effective.When the GPT-3.5turbomodel uses the ACT-SQL approach with the Create(EoT) database style, it achieves the best EM score of 62.7% and the best TS score of 71.4%.The best database style changes because LLMs can learn from exemplars.Table 13 shows the case study for the ACT-SQL method.With the trivial few-shot learning setting, there is a redundant column "TV_Channel.Hight_definition_TV" appearing in the SELECT clause.When the ACT-SQL method is applied, the entire output generated by the LLM contains the complete thinking process which successfully does the schema linking.After clarifying all the tables and columns that may be used in SQL, the LLM eventually writes the correct SQL query without any redundant schema item.
Since the GPT-4 model is expensive, we use the GPT-4 model to evaluate our ACT-SQL approach only with the Create(EoT) database style and n s = n d = 2. Table 2 shows the performances of our ACT-SQL and other previous works using in-context learning with LLMs.The ACT-SQL approach uses the LLMs' API call only once for generating one SQL query and achieves the highest EM, EX, and TS scores among existing in-context learning approaches.ACT-SQL's performance is also comparable to finetuned models.Actually, finetuned models would get higher scores on the dev set than the test set, since these models are selected by the dev set performance.Instead, incontext learning methods would not suffer the performance gap between the dev set and the test set.Table 4 shows some previous works' performances on Spider dev set and test set.For finetuned ap-proaches mentioned in the table, the performances drop from the dev set to the test set.On the contrary, for in-context learning approaches mentioned in the table, the performances increase from the dev set to the test set.After all, finetuned models are selected by the dev set performance, which would lead to the overfitting on the dev set and the performance dropping on the test set.For in-context learning approaches, the dev set and the test set are equal to the model.Performances between the dev set and the test set are only affected by the dataset feature.7 shows the GPT-3.5-turbo'sperformances on Spider-Syn, Spider-DK, and Spider-Realistic dev set.We use the Create(EoT) database style and set n s = n d = 2.The experiment results show that our approach is still comparable to finetuned models on Spider-Syn and Spider-Realistic datasets.On the Spider-DK dataset, our approach's EX score surpasses finetuned models.This is due to the wide range of domain knowledge stored in LLMs.

Multi-turn Datasets Results
Table 8 and Table 9 show the GPT-3.5-turbo'sperformances on two multi-turn text-to-SQL datasets, i.e.SParC and CoSQL.The database style is set to Create(CoT) and n d , n s are set to 2 as before.
The ACT-SQL approach is not that effective when applied to multi-turn datasets.We believe that our two-phase method causes bad performance.In the first phase, we use LLMs to rewrite questions in the interaction and convert the multi-turn dataset into the single-turn dataset.Sometimes the rewritten result's quality is bad, which influences the schema-linking process.Table 10 shows two rewritten instances from the SParC dev set.In the first instance, the LLM correctly rewrites all sentences without missing any key information.However, in the second instance, the LLM does not remove the context dependency for the second sentence.This also leads to the error in the third sentence, where the keyword "airline" in the first sentence is missing.In general, our in-context learning method is comparable to finetuned models (GAZP + BERT) though there is still a big room for improvement.Improving LLMs' performance on this difficult task is a challenging future work.We just complete the initial exploration.

Conclusion
LLMs have shown a strong ability in various domains with the in-context learning method.The latest studies have attempted to use LLMs to solve the text-to-SQL task.However, previous prompting approaches either perform worse than finetuned models or need to use LLMs' API call many times.We design the CoT prompt which can be automatically generated and propose our ACT-SQL approach.The ACT-SQL approach uses LLMs' API call only once to generate one SQL query.The experiment results prove that our approach achieves state-ofthe-art performance on the Spider dev set among existing in-context learning approaches.Furthermore, we extend our approach to multi-turn text-to-SQL datasets.

Limitations
There are some limitations in our work.First of all, we use a hybrid strategy for the exemplar selection.The numbers of static and dynamic exemplars are hyperparameters and still need manually determined.In addition, it is a relatively simple strategy that still needs improvement.Furthermore, our approach achieves relatively poor scores on some robustness variants of the Spider dataset and some multi-turn text-to-SQL datasets.Exploration of these datasets can be conducted in future work.

Figure 1 :
Figure 1: Manually labeled CoT for the dataset example.

Figure 2 :
Figure 2: Auto-CoT for the dataset example.

Figure 3 :
Figure 3: Zero-shot performances of GPT-3.5-turbo using Table(Column) DB style with different rows of DB contents on Spider dev set.

Question:
What are the names of the singers who are not French citizens?DB content 0: SELECT Name FROM singer WHERE Citizenship != 'French' DB content 3: SELECT Name FROM singer WHERE Citizenship != 'France' Questions: Give the flight numbers of flights leaving from Aberdeen.DB content 0: SELECT FlightNo FROM flights WHERE SourceAirport = 'Aberdeen' DB content 3: SELECT FlightNo FROM flights WHERE SourceAirport IN (SELECT AirportCode FROM airports WHERE City = 'Aberdeen')

Table 1 :
Case study for different rows of DB contents.

Table 2 :
Performances of ACT-SQL and other previous works on Spider dev set.

Table 3 :
Zero-shot performances of GPT-3.5-turbo with different DB styles on Spider dev set.
may be more similar to LLMs' pretrained data.Create(EoC) and Create(EoT) performs better than Create(NoPF) in EX and TS metrics.This indicates that primary keys and foreign keys in the prompt can offer LLMs effective information.
Table(Column) database style.Table(Column) no longer performs better than Table(Column)(PF), since LLMs' accuracy for predicting hard and extra hard SQL queries get increased with the few-shot exemplars and thus primary keys and foreign keys in the prompt become more important.

Table 4 :
Performances of different previous approaches on Spider dev set and test set.

Table 5 ,
Table 6 and Table

Table 5 :
Performances of GPT-3.5-turbo and other previous works on Spider-Syn dev set.

Table 6 :
Performances of GPT-3.5-turbo and other previous works on Spider-DK dev set.

Table 7 :
Performances of GPT-3.5-turbo and other previous works on Spider-Realistic dev set.

Table 8 :
Performances of GPT-3.5-turbo and other previous works on SParC dev set.

Table 9 :
Performances of GPT-3.5-turbo and other previous works on CoSQL dev set.

Table 10 :
Case study for rewritten questions from SParC dev set.
C.1.1 Table(Column)# stadium(Stadium_ID, Location, Name, Capacity, Highest, Lowest, Average) # singer(Singer_ID, Name, Country, Song_Name, Song_release_year, Age, Is_male) # concert(concert_ID, concert_Name, Theme, Stadium_ID, Year) # singer_in_concert(concert_ID, Singer_ID)We only use the Table(Column) and the Create(EoT) database styles in the following prompt examples.The other three database styles are similar.The rows of database contents is set to 3 in the following prompt examples.We only use the Create(EoT) database styles in the following prompt examples.The other four database styles are similar.The rows of database contents is set to 3 in the following prompt examples.Under the few-shot setting, the first two shots are static exemplars and the last two shots are dynamic exemplars.Find the name and savings balance of the top 3 accounts with the highest saving balance sorted by savings balance in descending order.According to "savings balance", columns [SAVINGS.balance]may be used.According to "accounts", columns [ACCOUNTS.name]may be used.Values [3] may be used.So the final answer is: SELECT T1.name , T2.balance FROM accounts AS T1 JOIN savings AS T2 ON T1.custid = T2.custidORDER BY T2.balanceDESC LIMIT 3 According to "flights", columns [flight.destination]may be used.Values [1] may be used.So the final answer is: SELECT destination FROM Flight GROUP BY destination ORDER BY count(*) LIMIT 1