Adapt and Decompose: Efficient Generalization of Text-to-SQL via Domain Adapted Least-To-Most Prompting

Cross-domain and cross-compositional generalization of Text-to-SQL semantic parsing is a challenging task. Existing Large Language Model (LLM) based solutions rely on inference-time retrieval of few-shot exemplars from the training set to synthesize a run-time prompt for each Natural Language (NL) test query. In contrast, we devise an algorithm which performs offline sampling of a minimal set-of few-shots from the training data, with complete coverage of SQL clauses, operators and functions, and maximal domain coverage within the allowed token length. This allows for synthesis of a fixed Generic Prompt (GP), with a diverse set-of exemplars common across NL test queries, avoiding expensive test time exemplar retrieval. We further auto-adapt the GP to the target database domain (DA-GP), to better handle cross-domain generalization; followed by a decomposed Least-To-Most-Prompting (LTMP-DA-GP) to handle cross-compositional generalization. The synthesis of LTMP-DA-GP is an offline task, to be performed one-time per new database with minimal human intervention. Our approach demonstrates superior performance on the KaggleDBQA dataset, designed to evaluate generalizability for the Text-to-SQL task. We further showcase consistent performance improvement of LTMP-DA-GP over GP, across LLMs and databases of KaggleDBQA, highlighting the efficacy and model agnostic benefits of our prompt based adapt and decompose approach.


Introduction
Recently, Large Language Models (LLMs) such as GPT3 (Brown et al., 2020a), Codex(Chen et al., 2021b), PaLM (Chowdhery et al., 2022), pretrained with massive volumes of data have shown improved performance for multiple reasoning tasks using incontext learning (Brown et al., 2020b;Huang and Chang, 2022), including program synthesis (Austin et al., 2021;Jain et al., 2021;Nijkamp et al., 2022) and semantic parsing (Shin and Durme, 2021;Drozdov et al., 2022;Shin and Durme, 2021;Shin et al., 2021).There are a few recent approaches where LLMs are specifically used for Text-to-SQL semantic parsing in a (i) zero-shot setting (Rajkumar et al., 2022a;Chang and Fosler-Lussier, 2023;Nan et al., 2023) where only the test Natural Language (NL) query constitutes the prompt, (ii) few-shot setting where exemplars similar to the test query in the target domain are retrieved from the available training data and appended to the test NL query to constitute the prompt (Poesia et al., 2022a;Chang and Fosler-Lussier, 2023;Nan et al., 2023;An et al., 2023).For this setting the available NL-SQL pairs would belong to domains that are different from the target database domain (iii) few-shot setting where exemplars are sampled NL-SQL queries available for the target-domain with maximum coverage of compositions (Rajkumar et al., 2022a;Qiu et al., 2022;Hosseini et al., 2022;Yang et al., 2022;Chang and Fosler-Lussier, 2023).In this paper, we are mainly interested in (ii) i.e. cross-domain generalization along with the scenario where the test queries may not have the set-of compositions covered in the training data (cross-composition generalizability).Moreover, considering a purely cross-domain setting, as opposed to (iii) above, we assume NO availability of exemplars belonging to the target databases.One solution to this setting is manual synthesis of few-shots for every new target database from scratch.However, this process is very tedious, time-consuming and also does not ensure diversity in the few-shots, with good coverage of SQL operators.Thus, there is a need for an efficient approach which can exploit available NL-SQL pairs from distinct domains, to intelligently sample few-shots and design prompts.Synchromesh (Poesia et al., 2022a) and (Nan et al., 2023;An et al., 2023) have a similar setting, except they retrieve exemplars during runtime (during inference) by selecting NL queries as exemplars with similarity based on (a) NL query semantics (Nan et al., 2023) or (b) target SQLs (Target Similarity Tuning) (Poesia et al., 2022a) or (c) LLM generated 'skill' based NL representation, focusing on program compositions and ignoring the surface NL forms (An et al., 2023).This reliance on inference-time retrieval of similar few-shots from the available data to build a run-time prompt and generate SQL for a test NL query, results in a less efficient solution.As opposed to this, we devise an algorithm which samples a minimal set-of fewshots from the training data ensuring complete coverage of SQL clauses, operators and functions and maximum coverage of database domains, which fits into the token length restriction.We append these few-shots with the out-of-distribution test NL query to define what we term as a Generic Prompt (GP), which is further used to generate the corresponding SQL.The GP is generated offline and is common across distinct test queries, resulting in a more time-efficient solution obviating the need for real time retrieval.We further auto-adapt the GP to (a) the target database domain, and refer to it as domain adapted GP (DA-GP), to better handle cross-domain generalization, (b) decompose it into a Least-To-Most-Prompting approach (Zhou et al., 2022) (LTMP-GP) to better handle crosscompositional generalization, and (c) combine the approaches (a) and (b) to exploit their complementary benefits (LTMP-DA-GP).In line with our motivation, formation of LTMP-DA-GP is an efficient solution, as it is an offline task to be performed onetime per new database and is mostly programmatic with minimal human intervention (only needed for validation of prompts).We further demonstrate a consistent performance improvement of (a) and (b) over the base GP and (c) over (a) and (b) on the databases of Kaggle-DBQA dataset (Lee et al., 2021a), designed to evaluate the generalizability of the Text-to-SQL task using distinct LLMs.Moreover, our approach not only yields an efficient solution, but also yields best performance for the used LLMs on KaggleDBQA.Following are our main contributions: • To the best of our knowledge, apart from (Nan et al., 2023) ours is the only approach to implement offline programmatic prompt generation ensuring diversity of samples for NL-to-SQL task.Moreover, as opposed to (Nan et al., 2023), our GP based sampling technique guarantees complete coverage of SQL operators and maximum converge of database domains.
• Ours is the first approach of programmatic domain-adaptation of prompt, consistently showcasing performance improvement across multiple Kaggle-DBQA databases validating NL-to-SQL domain generalization capability.
• Ours is the first approach applying Least-to-Most-Prompting (Zhou et al., 2022) for compositional generalization of complex NL-to-SQL task, showcasing consistent performance improvement across LLMs.
• As compared to existing similarity based exemplar sampling approaches (Poesia et al., 2022b;Chang and Fosler-Lussier, 2023;An et al., 2023), our approach of offline prompts synthesis proves to be more efficient.
• Our pipeline yields the best performance on the KaggleDBQA dataset, reported in the literature for the used LLMs.

Datasets
Spider (Gan et al., 2022): Is a large-scale, complex, and cross-domain Text-to-SQL benchmark dataset with a total of 200 databases (140, 20, 40 in the training, development and test splits with 7000, 1034, and 2147 Text-to-SQL pairs).We use stateof-the-art models trained on Spider to benchmark performance of our approach against supervised approaches.We also use the training split of Spider to sample the exemplars, which serve as few-shots in our prompt (Section 4).Spider-CG (Gan et al., 2022): This dataset is designed for evaluating Text-to-SQL compositional generalization performance.To synthesize this dataset, Text-to-SQL pairs from Spider-Train are transformed to corresponding sub-sentences and NatSQL pairs to form Spider-SS dataset.The subsentences are obtained using a sentence-split algorithm and the corresponding NatSQL, which is an intermediate representation of SQL, is manually annotated.We use the Spider-SS for the Least-To-Most-Prompting (LTMP) experiment, by retrieving the NL query decompositions in terms of subsentences and the corresponding intermediate Nat-SQL representations for few-shots sampled from Spider-Train, to synthesize prompts for each stage of LTMP (Section 4.4).
Kaggle-DBQA (Lee et al., 2021b): This is a cross-domain Text-to-SQL evaluation dataset with  (Gan et al., 2022), because it is a real-web dataset and contains fewer test queries allowing us to showcase the efficacy of our approach using commercial LLMs with low cost overheads.

Large Language Models (LLMs)
The literature has demonstrated state-of-the-art fewshot performance for NL-to-SQL (Rajkumar et al., 2022a;Nan et al., 2023;Chang and Fosler-Lussier, 2023) (Nijkamp et al., 2022) for our study as they do not offer the token length required for our prompt (4K).

Database Schema Format
As the part of the prompt design, an important choice we make is the format of the schema of the Databases(DBs) to which the sampled few-shot exemplars and the test query belong.Each of our experiment, we keep the format of the DBs for the few-shots and the test (target) DB to be consistent.Existing approaches, yield state-of-the-art zero and few-shot results on Spider-Dev dataset and its variants using CREATE TABLE schema format with (i) TEN selected rows (Rajkumar et al., 2022b), (ii) THREE column values (Chang and Fosler-Lussier, 2023) or (iii) semantic augmentation with blocked column descriptions (Nan et al., 2023).These approaches can afford to use these elaborate schema formats because of the use of Codex and GPT4 LLMs with allowable > 8K token lengths.As opposed to this, we use models with smaller allowable token length of maximum 4K (Section Section 3).To allow the the inclusion of few-shot exemplars in the Generic Prompt sampled by our algorithm (Section 4.2) along with the reasoning Least-to-Most-Prompting (LTMP) stages (LTMP-GP: Section 4.4), we compromise on the elaborate schema format to fit the prompt in the allowable token length.For the GP and LTMP-GP based pipelines we use the CREATE TABLE database schema format with Primary-Key and Foreign-Key constraints, but without the mention of column data-types and inclusion of row/column values and descriptions (Figure 1).With domain adaptation of the GP to the target database (Section 4.3), we require inclusion of only one (target) schema in the prompt allowing us to use an elaborate schema format.Thus, for the domain adaptation DA-GP and it's further enhancement with LTMP (LTMP-DA-GP: Section 4.5), we include the domain information to the CREATE TABLE format (Figure 1) in the form of (i) column data-types, (ii) randomly sampled FOUR values for categorical and datetime columns, (iii) range of values for numerical columns and (iv) additional column descriptions, wherever necessary.

Generic Prompt (GP) Design Algorithm
We have defined Algorithm 1 to sample the fewshots to form Generic Prompt (GP).The algorithm is designed to select exemplars from a dataset with available text-SQL annotations, to ensure complete coverage of SQL clauses, operators and functions and maximum coverage of domains (databases) which can fit into the allowable token length.We assume to have an annotated dataset , where t and s are the annotated text-SQL query pairs posed on databases db and a are the answers of the SQL queries after execution, N j are the query pairs of database db j , M are total number of databases.We have a test dataset Thus, as we consider completely cross-domain setting, we do not have any overlap between the training and test databases.We manually collect SQL operators, clauses and functions covered by queries s ij ∈ D to form a set-of primitive operations O, including (i) SQL Clauses ('FROM', 'HAVING', 'WHERE', 'OR-DER BY', etc), (ii) SQL Operators such as arithmetic (+, -, *, /, %), comparison (=, !=, <, >, etc) and logical (ALL, AND, ANY, LIKE, etc) and (iii) SQL Functions (AVG, COUNT, MAX, MIN, etc).
To sample the few-shot exemplars E for the GP, we sort the databases db j ∈ D based on the operator coverage by the SQL queries s ij .We perform query-pair (sample) traversal of this ordered list of databases.A sample {db j , t ij , s ij } becomes part of E, if s ij covers at the least one uncovered primitive operation in O.If s ij covers a super-set of primitive operations of any query s x in an existing exemplar {db x , t x , s x } ∈ E then this exemplar is replaced by {db i , t ij , s ij }.The algorithm terminates when all the possible primitive operations in O are covered by exemplars in E. This algorithm allows us to sample a minimal set of query-pairs as exemplars covering the complete set of primitive operations.This brings in diversity in the compositions of the SQL queries chosen as exemplars.The database ordering as per the primitive operator coverage, ensures minimal set of DB schema to be added in the GP, exhausting less number of tokens.However, selection of multiple DBs achieves diversity in the domains covered.We append the sampled few-shot exemplars with a NL test query, which is a sample {db l , t lk } ∈ T , retaining the consistency in the schema representation, forming the GP (Figure 1).

Domain Adaptation of GP (DA-GP)
As ours is a completely cross-domain setting, the GP consists of domains defined by database for which the few-shots are sampled from the traindataset, which are distinct from the target database.We auto-adapt these few-shots to the domain of the Algorithm 1: Generic Prompt Creation // Dataset with databases, text, SQL queries and answer tuples target database, keeping the query compositions consistent.The hypothesis is that the adaptation should facilitate the LLMs to achieve better performance on the queries of the target domain.This task is performed in three stages.
Stage1: Generating Compositionally similar SQLs in the target domain: For each few-shot SQL query in the GP we feed the serialized source schema and SQL (without NL) along with the serialized target schema and prompt the LLM to generate SQL on the target schema which is compositionally similar to the source query, by explaining what is compositional similarity.We sample SQL queries from the beam, until we find an executable SQL query on the target DB, whose skeleton has the tree edit distance to be within a threshold to that of the skeleton of the original few-shot SQL query.This ensures compositional similarity (Figure 1).
Stage2: Generating text queries for the SQL queries: We feed each compositionally similar SQL, generated for each few-shot exemplar in the GP, to the LLM along with the target schema and prompts it to generate the NL question which describes the SQL query in text form.This step al-lows us to have a NL-SQL pair in the target domain (database) for each few-shot exemplar in the GP.
Stage 3: Using Domain Adapted GP to generate SQLs for the test NL queries: We form Domain Adapted Generic Prompt(DA-GP) using the target schema with the available domain information (Section 4.1) and the domain specific NL-SQL few-shot pairs.Note that DA-GP has one database schema consistent across the few-shots as well as the test query.We append the test NL query to DA-GP and feed it to the LLM to generate SQL.

Least-to-Most-Prompting with GP (LTMP-GP)
The GP covers all the primitive SQL Operations, Clauses and Functions.However, few-shots cover only a few compositions of these primitive operations.We perform LTMP to help the LLMs achieve generalizability on compositions unseen in the fewshots as well as in the pre-training data.To achieve this, we decompose the NL-to-SQL task into the following three sub-tasks and semi-auto-adapt each of the few-shot exemplars in the GP for each of the following sub-tasks (Figure 1).The hypothesis is that the decomposition of few-shots, helps exposing the underlying NL-SQL mappings at more primitive level, through the NAT-SQL based intermediate representations, which can be reused by the LLMs to synthesize SQLs for unseen compositions.
Stage 1: NL Query Decomposition: The fewshot NL queries in GP sampled from Spider-Train, along with their decompositions fetched from the Spider-SS (Section 2) forms the prompt for the first stage of LTMP.We append it with the test NL query to generate the decomposition for the same.For exploiting the reasoning capabilities of LLMs, for each few-shot, we manually include the Chain-Of-Thoughts (COT) behind the decomposition of the NL queries, in terms of explaining the choice of split point (semantic segmentation) of the NL.
Stage 2: Mapping of NatSQL to NL decomposition: The few-shot NL queries in GP sampled from Spider-Train, along with their decompositions and NAT-SQLs, which is an intermediate representations of ground truth SQLs, fetched from Spider-SS (Section 2) forms the prompt for the second stage of LTMP.For each few-shot, we explain the COT behind the mapping of each decomposed NL query to the NatSQL, in terms of selection of the SQL clause for the NatSQL (part of the skeleton of  (Gan et al., 2022) 13.56 T53B (Lan et al., 2023) 26.80 SmBOP (Rubin and Berant, 2020) 27.20 RASAT (Qi et al., 2022) 27.60 Picard (Scholak et al., 2021) 29.80 REDSQL (Li et al., 2023) 31.90UL-20B (Lan et al., 2023) 34.90 RASAT (Rubin and Berant, 2020) Supervised Trained on UNITE (Lan et al., 2023) 26.80 T53B (Lan et al., 2023) 33.80 Picard (Scholak et al., 2021) 36.80 GPT-Turbo-3.5 (Ours) LTMP-DA-GP 33.89 Text-da-Vincci-003 (Ours) LTMP-DA-GP 38.04 NatSQL) and schema linking including specific table(s) and Column(s) selected for the NL decomposition to form the NatSQL.We append the prompt with the test NL query followed by its decompositions generated in the prior stage, to generate the NatSQL for each decomposition.
Stage 3: Generating SQL from NatSQL: We auto-generate the third stage prompt of LTMP-GP to include the few-shot NL queries in the GP with their decompositions, corresponding NatSQLs and the ground truth SQLs.We append this with the test NL query, its decompositions and corresponding NatSQLs generated in the prior stages to generate the SQL for the test NL.

LTMP with DA-GP (LTMP-DA-GP)
To exploit the complementary advantages of domain adaptation (domain generalization) and leastto-most prompting (compositional generalization), we perform LTMP over DA-GP with executing all the three stages explained in the Section 4.4 to construct the LTMP-DA-GP .For this the domain adapted NL query decompositions and NAT-SQLs are manually created.We append the test NL query to the prompt of the first stage and the outputs of the prior stages to the subsequent stages recursively, as explained in Section 4.4, to finally generate the SQL as the result of the last stage.

Benchmarks
State-of-the-art supervised approaches: Include models (Table 2 trained with Spider-Train and UNITE (Lan et al., 2023) datasets and yielding SOTA results on Spider-Dev and its variants.
Zero-shot approaches: (Rajkumar et al., 2022b;Chang and Fosler-Lussier, 2023;Nan et al., 2023), yield SOTA results on Spider-Dev and its variants with Codex and GPT4 as LLMs.For fair comparison, we compute zero-shot KaggleDBQA results with our schema format and LLMs.
Existing few-shot approaches: Few-shots are sampled using Top-K samples from the train set using following sampling strategies found in the literature.For fair comparison, we choose the number few-shots (K) to be the same as the number of exemplars in the GP.(i) Test Query specific TeXt-based Similarity sampling (QP-TX) (Poesia et al., 2022b): having maximum semantic similarity (Reimers and Gurevych, 2019) with the test NL query, (ii) Test Query specific Tree-Edit-distancebased Similarity sampling (QP-TE) (Poesia et al., 2022b): having maximum target program based similarity with the test NL query.Following Synchromesh (Poesia et al., 2022b), we train Sentence BERT (Reimers and Gurevych, 2019)) as a scoring  Here LLMs are used to retrieve skill based representations of the queries, by eliminating unimportant surface features.(iv) Diversity based sampling: (Nan et al., 2023) performs diversity sampling by picking up the exemplars near the centroids of the training sample clusters, formed using a combination of continuous NL embedding and discrete embedding with binary features representing syntactic elements of the SQL counterpart, including keywords, operators, and identifiers.Our GP based approach not only selects diverse samples, but also ensures SQL operator coverage.We have not benchmarked against this approach due to unavailability of the prompts or the code.
Chain-of-Thoughts (COT) approaches: DIN-SQL (Pourreza and Rafiei, 2023) performs the NLto-SQL task by dividing it into stages, viz.schema linking, NL query classification based on difficulty, distinct well-curated COTs prompting for distinct difficulty levels, along with few-shots with COT explanations and self-refinement at the end.We use the prompts in the paper for computing results.
Note that for fair comparison, we have not benchmarked KaggleDBQA results against the of LLMs not used the experimentation (An et al., 2023;Chang and Fosler-Lussier, 2023;Nan et al., 2023) .

Experimentation and Results
We use the Spider-train set as the training set to fetch few-shots for our GP and Kaggle-DBQA test set to evaluate the performance.Our algorithm yields 16 exemplars as few-shots covering a total of 4 databases.The selected queries cover a total of 32 SQL operators and clauses.For more deterministic results, we set the LLM parameters temperature to be 0.
The results are illustrated in the Table 1.Our basic GP performs better than (i) some state-of-the-art supervised models, (ii) zero-shot and (iii) QP-TX.This is possible due to generalization capabilities of LLMs along with programmatically sampled diverse few-shots in GP.Our final adapted and decomposed LTMP-DA-GP consistently performs better than (i) All state-of-the-art supervised benchmarks and (ii) All few-shot benchmarks.This is due to the combined effect of diversity based sampling in GP (Nan et al., 2023;An et al., 2023) and effect of domain adaptation and LTMP.Except US Wildfires database DA-GP, consistently performs better than GP showcasing the effect of domain adaptation for domain generalizability.LTMP-GP consistently performs better than GP, showcasing the effect of LTMP for compositional generalizability.Except Greater Manchester and Pesticide DBs LTMP-DA-GP consistently improves over LTMP-GP and DA-GP, demonstrating complementary benefits of DA and LTMP for certain queries and thus proves the efficacy of our adapt and decompose pipeline.We find for Greater Manchester DB the GPT-Turbo-3.5 performance drops with LTMP-DA-GP as with additional DA information the model tries to reason better generating long COTs, leading unavailability of tokens left to generate the desired output.

Qualitative Analysis
We manually analyze the test-queries (Table 3).Case 1 Category (1) and ( 2) demonstrates samples where inclusion of domain knowledge in terms of table values and column descriptions rectifies the SQLs.Case 2 demonstrates LTMP rectifying samples due to better resolution of decomposed queries (e.g.'Which country' and 'lead the total capacity of the power plants it held?') as opposed to the need of resolving complete query at once, with the prior approaches.Case 3 are erroneous samples, where (1) LTMP can not fix additional aggregation operation appearing in the SELECT clause, especially where the NLs can not be decomposed or generation of logically incorrect queries due to insufficient domain information such as: (2) query specific values (eg.2010) not being present in the sampled values of the schema column descriptions (eg.season) in the prompt (wrong column 'Year' gets picked up due to its specified range as 2009-2013) and (3) absence of understanding of domain specific numerical formula 'downloaded numbers = sum(total snatched)' for songs for CD Hiphop DB.There are very few samples for Case 4 (1) where additional aggregation operation is added as a part of SELECT clause due to mis-interpretation of the NL query semantics.Eg. 'state with' is been interpreted as providing some additional information along with 'state' by LLM with DA as well as LTMP.Case 5 (1) LTMP fails to decompose the question due to complex value further propagating error in the following stages (2) Queries come out correct for GP and DA due to samples being arranged in order of time, however with decompostions LTMP tried to come up with a right query but fails due to adding extra DESC condition to each column as following a few-shot decomposition.

Conclusion
In this paper, we leverage LLMs for the crossdomain and cross-composition generalization of Text-to-SQL.As opposed to prior approaches, which rely on inference-time retrieval of exemplars similar to the test query; we devise an algorithm which samples diverse set-of exemplars with complete coverage of SQL operators, clauses and functions and maximal coverage of databases to form the Generic Prompt (GP), which is common across every test sample obviating the need for dynamic exemplar retrieval and thus leading to an efficient approach.We further perform programmatic domain-adaptation of this prompt DA-GP, which consistently showcases performance improvement across multiple databases and LLMs better achieving domain generalization.We further decompose the exemplars of DA-GP, to execute a novel pipeline of Least-to-Most-Prompting (LTMP-DA-GP) for compositional generalization of the complex NL-to-SQL task.This pipeline showcases consistent improvement over GP across multiple databases and LLMs demonstrating complementary benefits of the adapt and decompose steps and thus proving the efficacy of our approach.Our pipeline, being offline with minimal human intervention, is not only efficient; but also yields the best performance reported in the literature with the experimented LLMs, on KaggleDBQA dataset designed to test generalizability of NL-to-SQL task.

A Supplementary Material
A.1 Existing Work on LLM based NL-to-SQL (Rajkumar et al., 2022a;Chang and Fosler-Lussier, 2023;Nan et al., 2023) has attempted to use LLMs for Text-to-SQL semantic parsing task in zero-shot as well as few-shot settings.For zero-shot setting, they experiment with various formats of the database schema, such as the APIDocs or SQL 'CREATE TABLE' commands, with and without randomly selected data rows or columns from the database table (elaborated in Section 4.1).For the few-shot setting, (Rajkumar et al., 2022a;Qiu et al., 2022;Hosseini et al., 2022;Yang et al., 2022) focus on cross-composition generalization and provide the queries which are posed on the target database itself as exemplars (cross-domain setting is not considered).Thus, the assumption is that few queries are available for a new database.They work with datasets such as GeoQuery ((Tang and Mooney, 2001;Zelle and Mooney, 1996)) with data in US geography domain, Scholar ( (Iyer et al., 2017)) with data in academic publications or a dataset designed for queries in E-commerce domain ( (Yang et al., 2022)).In our approach, we assume no availability of annotated data in terms of SQL programs for the test database and thus, completely a crossdomain setting.
On the similar lines of our work, Synchromesh (Poesia et al., 2022a) and (Nan et al., 2023;An et al., 2023) assumes cross-domain setting.These approaches select few-shot exemplars from the training set based on (a) the semantic similarity with the NL test query (Nan et al., 2023) or (b) using Target Similarity Tuning (TST) where the NL queries with similar target programs are selected as exemplars (Poesia et al., 2022a) or (c) selecting similar NL queries as exemplars with their LLM generated 'skill' representation, which focuses on program compositions and ignores the surface NL forms (An et al., 2023).In addition to TST, Synchromesh (Poesia et al., 2022a) performs constrained semantic decoding (CSD), which reduces the implementation errors in the generated SQLs by ensuring that the generated tokens lead to correct programs following a pre-specified grammar.As constraint decoding is not the focus our approach, we compare our performance using Target Semantic Tuning (TST), without Constraint Semantic Decoding (CSD).All the above similarity based approaches have a reliance on inference-time retrieval of similar few-shot samples from the avail-able data to build a run-time prompt and generate SQL for a test NL query leading to a less efficient solution.As opposed to this, we devise an algorithm to generate a prompt with diverse exemplars generic across test queries in offline fashion, resulting in a more time-efficient solution obviating the need for real time retrieval.Moreover, with further adaptation of this offline prompt with the adapt and decompose techniques, our approach yields better performance than existing similarity based sampling techniques (Poesia et al., 2022b;Chang and Fosler-Lussier, 2023;An et al., 2023).(Nan et al., 2023) defines a 'diversity' based sampling method for few-shot exemplar selection, where the samples in the training set are clustered using a combination of continuous representation of the NL queries and discrete representation of SQL counterparts, to pickup the near-centroid samples are few-shots.They showcase that this diversity based sampling method performs better than similarity based sampling, which is further enhanced by similarity-diversity sampling.Our GP based sampling technique ensures diversity in the samples, by guaranteeing complete coverage of SQL operators and maximum converge of database domains.
A.2 Addressing LLM Memorization Concerns (Rajkumar et al., 2022a) have addressed the concerns around possible memorization of existing datasets such as Spider (Yu et al., 2018) by large language models, which are trained on code data.The possibility of memorization arises as the the Spider Dev split file (dev.sql)resides on Github4 .However, prompting LLMs with verbatim fragments of this file leads to generations which do not match with the file contents.For example, given a question in the format specified in the file, the table aliasing strategy followed in the generated SQLs does not match with the gold SQLs provided in the file.On the similar lines of (Rajkumar et al., 2022a), our prompting format of text queries (Explained in Section 4.1) is completely different than the format in which NL-SQL pairs are stored in the Spider Dev split file.Moreover, to avoid the concerns of memorization, we assess the performance of our pipelines using Kaggle DBQA (Lee et al., 2021b) dataset, for which the evaluation files are not residing on Github 5 .We also showcase with performance improvements with our approach over zero-shot setting on this dataset for distinct LLMs.

A.3 Prompts
In this section we provide the prompts generated by our pipeline including the (i) GP which is common across all the KaggleDBQA dataset (ii) DA-GP for GeoNeuclear database.The same prompt template can be used to recreate the prompts for other databases.(iii) LTMP-GP which is common across all the KaggleDBQA dataset (iv) LTMP-DA-GP for GeoNeuclear database.The same prompt template can be used to recreate the prompts for other databases.The Yellow part indicates the few-shot schemas, Blue part the test schema and queries and orange the domain information.

Domain Adapted Exemplar n: NL: return
the smallest capacity for each nuclear power plant.

Table 3 :
Qualitative analysis.NL:Natural Language, GT: Ground Truth ✓: Correct SQL and ✗: Incorrect SQL. % of Erroneous Test Queries function to compute the tree-edit distance between the corresponding SQLs of the input NL queries (Target Semantic Tuning (TST)), (iii) Test Query specific Skill based similarity sampling (QP-SK): maximum skill based similarity (An et al., 2023).