Weakly Supervised Text-to-SQL Parsing through Question Decomposition

Text-to-SQL parsers are crucial in enabling non-experts to effortlessly query relational data. Training such parsers, by contrast, generally requires expertise in annotating natural language (NL) utterances with corresponding SQL queries. In this work, we propose a weak supervision approach for training text-to-SQL parsers. We take advantage of the recently proposed question meaning representation called QDMR, an intermediate between NL and formal query languages. Given questions, their QDMR structures (annotated by non-experts or automatically predicted), and the answers, we are able to automatically synthesize SQL queries that are used to train text-to-SQL models. We test our approach by experimenting on five benchmark datasets. Our results show that the weakly supervised models perform competitively with those trained on annotated NL-SQL data. Overall, we effectively train text-to-SQL parsers, while using zero SQL annotations.


Introduction
The development of natural language interfaces to databases has been extensively studied in recent years (Affolter et al., 2019;Kim et al., 2020;Thorne et al., 2021). The current standard is Machine Learning (ML) models which map utterances in natural language (NL) to executable SQL queries Rubin and Berant, 2021). These models rely on supervised training examples of NL questions labeled with their corresponding SQL queries. Labeling copious amounts of data is costprohibitive as it requires experts that are familiar both with SQL and with the underlying database structure (Yu et al., 2018). Furthermore, it is often difficult to re-use existing training data in one domain in order to generalize to new ones (Suhr et al., 2020). Adapting the model to a new domain requires new NL-SQL training examples, which results in yet another costly round of annotation.
In this paper we propose a weak supervision approach for training text-to-SQL parsers. We avoid the use of manually labeled NL-SQL examples and rely instead on data provided by non-expert users. Fig. 1 presents a high-level view of our approach. The input (left corner, in red) is used to automatically synthesize SQL queries (step 3, in green) which, in turn, are used to train a text-to-SQL model. The supervision signal consists of the question's answer and uniquely, a structured representation of the question decomposition, called QDMR. The annotation of both these supervision sources can be effectively crowdsourced to nonexperts (Berant et al., 2013;Pasupat and Liang, 2015;. In a nutshell, QDMR is a series of computational steps, expressed by semi-structured utterances, that together match the semantics of the original question. The bottom left corner of Fig. 1 shows an example QDMR of the question "Which authors have more than 10 papers in the PVLDB journal?". The question is broken into five steps, where each step expresses a single logical operation (e.g., select papers, filter those in PVLDB) and may refer to previous steps. As QDMR is derived entirely from its question, it is agnostic to the underlying form of knowledge representation and has been used for questions on images, text and databases (Subramanian et al., 2020;Geva et al., 2021;Saparina and Osokin, 2021). In our work, we use QDMR as an intermediate representation for SQL synthesis. Namely, we implement an automatic procedure that given an input QDMR, maps it to SQL. The QDMR can either be manually annotated or effectively predicted by a trained model, as shown in our experiments.
We continue to describe the main components of our system, using the aforementioned supervision (Fig. 1). The SQL Synthesis component (step 1) attempts to convert the input QDMR into a corresponding SQL query. To this end, Phrase DB linking matches phrases in the QDMR with rele- vant columns and values in the database. Next, SQL join paths are automatically inferred given the database schema structure. Last, the QDMR, DB-linked columns and inferred join paths are converted to SQL by the SQL Mapper. In step 2, we rely on question-answer supervision to filter out incorrect candidate SQL. Thus, our Execution-guided SQL Search returns the first candidate query which executes to the correct answer.
Given our synthesis procedure, we evaluate its ability to produce accurate SQL, using weak supervision. To this end, we run our synthesis on 9,313 examples of questions, answers and QDMRs from five standard text-to-SQL benchmarks (Zelle and Mooney, 1996;Li and Jagadish, 2014;Yaghmazadeh et al., 2017;Yu et al., 2018). Overall, our solution successfully synthesizes SQL queries for 77.8% of examples, thereby demonstrating its applicability to a broad range of target databases.
Next, we show our synthesized queries to be an effective alternative to training on expert annotated SQL. We compare a text-to-SQL model, trained on the queries synthesized from questions, answers and QDMRs, to one trained using gold SQL. As our model of choice we use T5-large, which is widely used for sequence-to-sequence modeling tasks (Raffel et al., 2020). Following past work Herzig et al., 2021), we fine-tune T5 to map text to SQL. We experiment with the SPIDER and GEO880 datasets (Yu et al., 2018;Zelle and Mooney, 1996) and compare model performance based on the training supervision. When training on manually annotated QDMRs, the weakly supervised models achieve 91% to 97% of the accuracy of models trained on gold SQL. We further extend our approach to use automatically predicted QDMRs, requiring zero annotation of in-domain QDMRs. Notably, when training on predicted QDMRs models still reach 86% to 93% of the fully supervised versions accuracy. In addition, we evaluate cross-database generalization of models trained on SPIDER (Suhr et al., 2020). We test our models on four additional datasets and show that the weakly supervised models are generally better than the fully supervised ones in terms of cross-database generalization. Overall, our findings show that weak supervision, in the form of question, answers and QDMRs (annotated or predicted) is nearly as effective as gold SQL when training text-to-SQL parsers.
Our codebase and data are publicly available. 1

Background
Weakly Supervised ML The performance of supervised ML models hinges on the quantity and quality of their training data. In practice, labeling large-scale datasets for new tasks is often costprohibitive. This problem is further exacerbated in semantic parsing tasks (Zettlemoyer and Collins, 2005), as utterances need to be labeled with formal queries. Weak supervision is a broad class of methods aimed at reducing the need to manually label large training sets (Hoffmann et al., 2011;Ratner et al., 2017;. An influential line of work has been dedicated to weakly supervised semantic parsing, using question-answer pairs, referred to as learning from denotations (Clarke et al., 2010;Liang et al., 2011). Past work has shown that non-experts are capable of annotating answers over knowledge graphs (Berant et al., 2013) and tabular data (Pasupat and Liang, 2015). This approach could potentially be extended to databases by sampling subsets of its tables, such that questionanswer examples can be manually annotated. A key issue in learning text-to-SQL parsers from denotations is the vast search space of potential candidate queries. Therefore, past work has focused on constraining the search space, which limited applicability to simpler questions over single tables (Wang et al., 2019). Here, we handle arbitrary SQL by using QDMR to constrain the search space.
Question Decomposition QDMR expresses the meaning of a question by breaking it down into simpler sub-questions. Given a question x, its QDMR s is a sequence of reasoning steps s 1 , ..., s |s| required to answer x. Each step s k is an intermediate question which represents a relational operation, such as projection or aggregation. Steps may contain phrases from x, tokens signifying a query operation (e.g., "for each") and references to previous steps. Operation tokens indicate the structure of a step, while its arguments are the references and question phrases. A key advantage of QDMR is that it can be annotated by non-experts and at scale . Moreover, unlike SQL, annotating QDMR requires zero domain expertise as it is derived entirely from the original question.

Weakly Supervised SQL Synthesis
Our input data contains examples of question x i , database D i , the answer a i , and s i , which is the QDMR of x i . The QDMR is either annotated or predicted by a trained model f , such that f (x i ) = s i . For each example, we attempt to synthesize a SQL queryQ i , that matches the intent of x i and executes to its answer,Q i (D i ) = a i . The successfully synthesized examples ⟨x i ,Q i ⟩ are then used to train a text-to-SQL model.

Synthesizing SQL from QDMR
Given QDMR s i and database D i , we automatically map s i to SQL. Alg. 1 describes the synthesis process, where candidate queryQ i is incrementally synthesized by iterating over the QDMR steps. Given step s k i , its phrases are automatically linked to columns and values in D i . Then, relevant join paths are inferred between the columns. Last, each step is automatically mapped to SQL based on its QDMR operator and its arguments (see Table 1).

Phrase DB Linking
As discussed in §2, a QDMR step may have a phrase from x i as its argument. When mapping QDMR to SQL these phrases are linked to corresponding columns or values in D i . For example, in Table 1 the two phrases "ships" and "injuries" are linked to columns ship.id and Algorithm 1 SQL Synthesis death.injured respectively. We perform phrase-column linking automatically by ranking all columns in D i and returning the top one. The ranked list of columns is later used in §3.2 when searching for a correct assignment to all phrases in the QDMR. To compute phrase-column similarity, we tokenize both the phrase and column, then lemmatize their tokens using the WordNet lemmatizer. 2 The similarity score is the average GloVe word embeddings similarity (Pennington et al., 2014) between the phrase and column tokens. All columns in D i are then ranked based on their word overlap and similarity with the phrase: (1) we return columns whose lemmatized tokens are identical to those in the phrase; (2) we return columns who share (non stop-word) tokens with the phrase, ordered by phrase-column similarity; (3) we return the remaining columns, ordered by similarity.
We assume that literal values in D i , such as strings or dates, appear verbatim in the database as they do in the question. Therefore, using string matching, we can identify the columns containing all literal values mentioned in s i . If a literal value appears in multiple columns, they are all returned as potential links. This assumption may not always hold in practice due to DB-specific language, e.g., the phrase "women" may correspond to the condition gender = 'F'. Consequently, we measure the effect of DB-specific language in §4.2.

Join Path Inference
In order to synthesize SQL (Codd, 1970), we infer join paths between the linked columns returned in §3.1.1. Following past work (Guo et al., 2019;Suhr et al., 2020), we implement a heuristic returning the shortest join path connecting two sets of columns. To compute join paths, we convert D i into a graph where the nodes are its tables and edges exist for every foreign-key constraint connecting two tables.

QDMR Step
Phrase-DB Linking SQL  Table 1: Mapping the QDMR of the question "What is the ship name that caused most total injuries?" to SQL.
x: "What are the populations of states through which the Mississippi river runs?" s: the Mississippi river; states #1 runs through; the populations of #2 Figure 2: Previously mapped QDMR steps (with operations and arguments) used as nested SQL queries.
The JOINP procedure in Alg. 1 joins the tables of columns mentioned in step s k (cols) with those mentioned in the previous steps which s k refers to (other_cols). If multiple shortest paths exist, it returns the first path which contains either c i ∈ cols as its start node or c j ∈ other_cols as its end node.
Step 3 of Table 1 underlines the join path between the death and ship tables.

QDMR to SQL Mapper
The MAPSQL procedure in Alg. 1 maps QDMR steps into executable SQL. First, the QDMR operation of each step is inferred from its utterance template using the OPTYPE procedure of . Then, following the previous DB linking phase, the arguments of each step are either the linked columns and values or references to previous steps (second column of Table 1). MAP-SQL uses the step operation type and arguments to automatically map s k to a SQL queryQ k . Each operation has a unique mapping rule to SQL, as shown in Table 2. SQL mapping is performed incrementally for each step. Then, when previous steps are referenced, the process can re-use parts of their previously mapped SQL (stored in the mapped array). Furthermore, our mapping procedure is able to handle complex SQL that may involve nested queries ( Fig. 2) and self-joins (Fig. 3).
x: "What papers were written by both H. V. Jagadish and also Yunyao Li?"  Figure 3: Handling Self-joins in QDMR mapping. The two FILTER steps have conflicting assignments to the same column and are identified as a self-join. This is resolved by using a nested query in the SQL of step 3.

Execution-guided SQL Candidate Search
At this point we haveQ i , which is a potential SQL candidate. However, this candidate may be incorrect due to a wrong phrase-column linking or due to its original QDMR structure. To mitigate these issues, we search for accurate SQL candidates using the answer supervision. Following phrase DB linking ( §3.1.1), each phrase is assigned its top ranked column in D i . However, this assignment may potentially be wrong. In step 1 of Fig. 1 the phrase "authors" is incorrectly linked to author.aid instead of author.name. Given our weak supervision, we do not have access to the gold phrase-column linking and rely instead on the gold answer a i . Namely, we iterate over phrase-column assignments and synthesize their corresponding SQL. Once an assignment whose SQL executes to a i has been found, we return it as our result. We iterate over assignments that cover only the top-k ranked columns for each phrase, shown to work very well in practice ( §4.2).
Failing to find a correct candidate SQL may be due to QDMR structure rather than phrasecolumn linking. As s i is derived entirely from  the question it may fail to capture databasespecific language. E.g., in the question "How many students enrolled during the semester?" the necessary aggregate operation may change depending on the database structure. If D i has the column course.num_enrolled, the query should sum its entries for all courses in the semester. Conversely, if D i has the column course.student_id, the corresponding query would need to count the number of enrolled students. We account for these structural mismatches by implementing three additional search heuristics which modify the structure of a candidatê Q i . If the candidate executes to the correct result following modification, it is returned by the search process. These modifications are described in detail in Appendix B. Namely, they include the addition of a DISTINCT clause, converting QDMR FILTER steps into SUPERLATIVE and switching between the COUNT and SUM operations.

Experiments
Our experiments target two main research questions. First, given access to weak supervision of question-answer pairs and QDMRs, we wish to measure the percentage of SQL queries that can be automatically synthesized. Therefore, in §4.2 we measure SQL synthesis coverage using 9,313 examples taken from five benchmark datasets. Second, in §4.3 we use the synthesized SQL to train textto-SQL models and compare their performance to those trained on gold SQL annotations.

Setting
Datasets We evaluate both the SQL synthesis coverage and text-to-SQL accuracy using five textto-SQL datasets (see Table 3). The first four datasets contain questions over a single database: ACADEMIC (Li and Jagadish, 2014) has questions over the Microsoft Academic Search database; GEO880 (Zelle and Mooney, 1996) concerns US geography; IMDB and YELP (Yaghmazadeh et al., 2017) contain complex questions on a film and restaurants database, respectively. The SPIDER dataset (Yu et al., 2018) measures domain generalization between databases, and therefore contains questions over 160 different databases. For QDMR data we use the BREAK dataset . The only exception is 259 questions of IMDB and YELP, outside of BREAK, which we (authors) annotate with corresponding QDMRs and release with our code. See Appendix C for license.
Training We fine-tune the T5-large sequence-tosequence model (Raffel et al., 2020) for both textto-SQL and QDMR parsing ( §4.2). Namely, for each task we fine-tune the pre-trained model on its specific data. For text-to-SQL, we fine-tune on mapping utterances x i ; cols(D i ) to SQL, where sequence cols(D i ) is a serialization of all columns in database D i , in an arbitrary order. In QDMR parsing, input questions are mapped to output QDMR strings. We use the T5 implementation by Hugging-Face (Wolf et al., 2020) and train using the Adam optimizer (Kingma and Ba, 2014). Following finetuning on the dev sets, we adjust the batch size to 128 and the learning rate to 1e-4 (after experimenting with 1e-5, 1e-4 and 1e-3). All models were trained on an NVIDIA GeForce RTX 3090 GPU.

SQL Synthesis Coverage
Our first challenge is to measure our ability to synthesize accurate SQL. To evaluate SQL synthesis we define its coverage to be the percentage of examples where it successfully produces SQLQ which executes to the correct answer. To ensure our procedure is domain independent, we test it on five different datasets, spanning 164 databases (Table 3).

Annotated QDMRs
The upper rows of Table 3 list the SQL synthesis coverage when using manually annotated QDMRs from BREAK. Overall, we evaluate on 9,313 QDMR annotated examples, reaching a coverage of 77.8%. Synthesis coverage for single-DB datasets tends to be slightly higher than that of SPIDER, which we attribute to its larger size and diversity. To further ensure the quality of synthesized SQL, we manually validate a random subset of 100 queries out of the 7,249 that were synthesized. Our analysis reveals 95% of the queries to be correct interpretations of their original question. In addition, we evaluate synthesis coverage on a subset of 8,887 examples whose SQL denotations are not the empty set. As SQL synthesis relies on answer supervision, discarding examples with empty denotations eliminates the false positives of spurious SQL which incidentally execute to an empty set. Overall, coverage on examples with non-empty denotations is nearly identical, at 77.6% (see Appendix D). We also perform an error analysis on a random set of 100 failed examples, presented in Table 4. SQL synthesis failures are mostly due to QDMR annotation errors or implicit database-specific conditions. E.g., in GEO880 the phrase "major river" should implicitly be mapped to the condition river.length > 750. As our SQL synthesis is domain-general, it does not memorize any domain-specific jargon or rules.
Predicted QDMRs While QDMR annotation can be crowdsourced to non-experts , moving to a new domain may incur anno-  Table 3, the last row shows that coverage on SPIDER dev is nearly identical to that of manually annotated QDMRs (77.6% to 77.2%).

Training Text-to-SQL Models
Next, we compare text-to-SQL models trained on our synthesized data to training on expert annotated SQL. Given examples ⟨x i , D i ⟩ we test the following settings: (1) A fully supervised training set, that uses gold SQL annotations A weakly supervised training set, where given answer a i and QDMR s i , we synthesize querieŝ Q i . As SQL synthesis coverage is less than 100%, the process returns a subset of m < n examples {⟨x i ,Q i , D i ⟩} m i=1 on which the model is trained. 5

Training Data
We train models on two text-to-SQL datasets: SPI-DER (Yu et al., 2018) and GEO880 (Zelle and Mooney, 1996). As our weakly supervised training sets, we use the synthesized examples ⟨x i ,Q i , D i ⟩, described in §4.2, (using annotated QDMRs). We successfully synthesized 5,349 training examples for SPIDER and 547 examples for GEO880 train.

Models and Evaluation
Models We fine-tune T5-large for text-to-SQL, using the hyperparameters from §4.1. We choose T5 as it is agnostic to the structure of its input sequences. Namely, it has been shown to perform competitively on different text-to-SQL datasets, regardless of their SQL conventions Herzig et al., 2021). This property is particularly desirable in our cross-database evaluation ( §4. 3   We train and evaluate the following models: , using gold SQL. This models helps us measure the degree to which the smaller size of the synthesized training data and its different query structure (compared to gold SQL) affects performance Evaluation Metric Due to our SQL being automatically synthesized, its syntax is often different from that of the gold SQL (see Appendix E.2). As a result, the ESM metric of Yu et al. (2018) does not fit our evaluation setup. Instead, we follow Suhr et al. (2020) and evaluate text-to-SQL models using the execution accuracy of output queries. We define execution accuracy as the percentage of output queries which, when executed on the database, result in the same set of tuples (rows) as a i .

Training on Annotated QDMRs
We begin by comparing the models trained using annotated QDMRs to those trained on gold SQL. Meanwhile, the discussion of T5-QDMR-P, trained using predicted QDMRs, is left for §4.3.4. The results in Tables 5-7 list the average accuracy and standard deviation of three model instances, trained using separate random seeds. Tables 5-6   As SPIDER is used to train cross-database models, we further evaluate our models performance on cross-database semantic parsing (XSP) (Suhr et al., 2020). In Table 6 we test on four additional text-to-SQL datasets (sizes in parenthesis): ACA-DEMIC (183), GEO880 (877), IMDB (113) and YELP (66). For ACADEMIC, IMDB and YELP we removed examples whose execution result in an empty set. Otherwise, the significant percentage of such examples would result in false positives of predictions which incidentally execute to an empty set. In practice, evaluation on the full datasets remains mostly unchanged and is provided in Appendix E. Similarly to Suhr et al. (2020), the results in Table 6 show that SPIDER trained models struggle to generalize to XSP examples. However, T5-QDMR-G performance is generally better on XSP examples, which further indicates that QDMR and answer supervision is effective, compared to gold SQL. Example predictions are shown in Appendix E.2. Table 7 lists the execution accuracy of models trained on GEO880. Models were trained for 300 epochs, fine-tuned on the dev set and then evaluated on the 280 test examples. We note that T5-QDMR-G achieves 90.7% of the performance of T5-SQL-G (74.5 to 82.1). The larger performance gap, compared to SPIDER models, may be partly to due to the dataset size. As GEO880 has 547 training examples, fewer synthesized SQL to train T5-QDMR-G on (454) may have had a greater effect on its accuracy.

Training on Predicted QDMRs
We extend our approach by replacing the annotated QDMRs with the predictions of a trained QDMR parser (a T5-large model, see §4.1). In this setting, we now have two sets of questions: (1) questions used to train the QDMR parser; (2) questions used to synthesize NL-SQL data. We want these two sets to be as separate as possible, so that training the QDMR parser would not require new in-domain annotations. Namely, the parser must generalize to questions in the NL-SQL domains while being trained on as few of these questions as possible.
SPIDER Unfortunately, SPIDER questions make up a large portion of the BREAK training set, used to train the QDMR parser. We therefore experiment with two alternatives to minimize the in-domain QDMR annotations of NL-SQL questions. First, is to train the parser using few-shot QDMR annotations for SPIDER. Second, is to split SPIDER to questions used as the NL-SQL data, while the rest are used to train the QDMR parser.
In Table 5, T5-QDMR-P is trained on 5,075 queries, synthesized using predicted QDMRs (and answer supervision) for SPIDER train questions. The predictions were generated by a QDMR parser trained on a subset of BREAK, excluding all SPI-DER questions save 700 (10% of SPIDER train). Keeping few in-domain examples minimizes additional QDMR annotation while preserving the predictions quality. Training on the predicted QDMRs, instead of the annotated ones, resulted in accuracy being down 2.9 points (65.8 to 62.9) while the model achieves 92.5% of T5-SQL-G performance on SPIDER dev. On XSP examples, T5-QDMR-P is competitive with T5-QDMR-G (Table 6).
In Table 8, we experiment with training T5-QDMR-P without in-domain QDMR annotations. We avoid any overlap between the questions and domains used to train the QDMR parser and those used for SQL synthesis. We randomly sample 30-40 databases from SPIDER and use their corresponding questions exclusively as our NL-SQL data. For training the QDMR parser, we use BREAK while discarding the sampled questions. We experiment with 3 random samples of SPIDER train, numbering 1,348, 2,028 and 2,076 examples, with synthesized training data of 1,129, 1,440 and 1,552 examples respectively. Results in Table 8 show that, on average, T5-QDMR-P achieves 95.5% of the performance of T5-SQL-G. This indicates that even without any in-domain QDMR annotations, data induced from answer supervision and out-of-domain QDMRs is effective in training text-to-SQL models, compared to gold SQL.
GEO880 For predicted QDMRs on GEO880, we train the QDMR parser on BREAK while discarding all of its 547 questions. Therefore, the parser was trained without any in-domain QDMR annotations for GEO880. SQL synthesis using the pre-  Table 8: SPIDER models results on the dev set. T5-QDMR-P is trained without using any QDMR annotations for training set questions. We train separate models on the three randomly sampled training sets.
dicted QDMRs resulted in 432 queries. In Table 7, T5-QDMR-P reaches 85.7% of T5-SQL-G performance while being trained using question-answer supervision and no in-domain QDMR annotations.

Limitations
Our approach uses question decompositions and answers as supervision for text-to-SQL parsing. As annotating SQL requires expertise, our solution serves as a potentially cheaper alternative. Past work has shown that non-experts can provide the answers for questions on knowledge graphs (Berant et al., 2013) and tables (Pasupat and Liang, 2015). However, manually annotating questionanswer pairs on large-scale databases may present new challenges which we leave for future work. During SQL synthesis we assume that literal values (strings or dates) appear verbatim in the database as they do in the question. We observe that, for multiple datasets, this assumption generally holds true ( §4.2). Still, for questions with domain-specific jargon (Lee et al., 2021) our approach might require an initial step of named-entityrecognition. Failure to map a QDMR to SQL may be due to a mismatch between a QDMR and its corresponding SQL structure ( §3.2). We account for such mismatches by using heuristics to modify the structure of a candidate query (Appendix B). A complementary approach could train a model, mapping QDMR to SQL, to account for cases where our heuristic rules fail. Nevertheless, our SQL synthesis covers a diverse set of databases and query patterns, as shown in our experiments.

Related Work
For a thorough review of NL interfaces to databases see Affolter et al. (2019); Kim et al. (2020). Research on parsing text-to-SQL gained significant traction in recent years with the introduction of large supervised datasets for training models and evaluating their performance (Zhong et al., 2017;Yu et al., 2018). Recent approaches relied on specialized architectures combined with pre-trained language models (Guo et al., 2019;Lin et al., 2020;Yu et al., 2021;Deng et al., 2021;Scholak et al., 2021). As our solution synthesizes NL-SQL pairs (using weak supervision) it can be used to train supervised text-to-SQL models.
Also related is the use of intermediate meaning representations (MRs) in mapping text-to-SQL. In past work MRs were either annotated by experts (Yaghmazadeh et al., 2017;Kapanipathi et al., 2020), or were directly induced from such annotations as a way to simplify the target MR (Dong and Lapata, 2018;Guo et al., 2019;Herzig et al., 2021). Instead, QDMR representations are expressed as NL utterances and can therefore be annotated by non-experts. Similarly to us, Saparina and Osokin (2021) map QDMR to SPARQL. However, our SQL synthesis does not rely on the annotated linking of question phrases to DB elements (Lei et al., 2020). In addition, we train models without gold QDMR annotations and test our models on four datasets in addition to SPIDER.

Conclusions
This work presents a weakly supervised approach for generating NL-SQL training data, using answer and QDMR supervision. We implemented an automatic SQL synthesis procedure, capable of generating effective training data for dozens of target databases. Experiments on multiple text-to-SQL benchmarks demonstrate the efficacy of our synthesized training data. Namely, our weakly-supervised models achieve 91%-97% of the performance of fully supervised models trained on annotated SQL. Further constraining our models supervision to few or zero in-domain QDMRs still reaches 86%-93% of the fully supervised models performance. Overall, we provide an effective solution to train text-to-SQL parsers while requiring zero SQL annotations. Table 9 lists all of the QDMR operations along with their mapping rules to SQL. For a thorough description of QDMR semantics please refer to .

B SQL Candidate Search Heuristics
We further describe the execution-guided search process for candidate SQL queries, that was introduced in §3.2. Given the search space of candidate queries, we use four heuristics to find candidateŝ Q i which execute to the correct answer, a i .

Phrase linking search:
We avoid iterating over each phrase-column assignment by ordering them according to their phrase-column ranking, as described in §3.1.1. The queryQ (1) i is induced from the top ranked assignment, where each phrase in s i is assigned its top ranked column. If Q (1) i (D i ) ̸ = a i we continue the candidate search using heuristics 2-4 (described below). Assuming that the additional search heuristics failed to find a candidateQ (1) ′ i such thatQ (1) ′ i (D i ) = a i , we return to the phrase linking component and resume the process using the candidate SQL induced from the following assignmentQ (2) i , and so forth. In practice, we limit the number of assignments and review only those covering the top-k most similar columns for each phrase in s i , where k = 20. Our error analysis (Table 4) reveals that only a small fraction of failures are due to limiting k.
Step 2 in Fig. 1 represents the iterative process, wherê Q (1) i executes to an incorrect result while the following candidateQ (2) i correctly links the phrase "authors" to column author.name and executes to a i , thereby ending the search.
2. Distinct modification: Given a candidate SQL Q i such thatQ i (D i ) ̸ = a, we add DISTINCT to its SELECT clause. In Table 10 the SQL executes to the correct result, following its modification.

Superlative modification:
This heuristic automatically corrects semantic mismatches between annotated QDMR structures and the underlying database. Concretely, steps in s i that represent PROJECT and FILTER operations may entail an  implicit ARGMAX/ARGMIN operation. For example for the question "What is the size of the largest state in the USA?" in the third row of Table 10.
Step (3) of the question's annotated QDMR is the PROJECT operation, "state with the largest #2". While conforming to the PROJECT operation template, the step entails an ARGMAX operation. Using the NLTK part-of-speech tagger, we automatically identify any superlative tokens in the PROJECT and FILTER steps of s i . These steps are then replaced with the appropriate SUPERLATIVE type steps. In Table 10, the original step (3) is modified to the step "#1 where #2 is highest".

Aggregate modification:
This heuristics replaces instances of COUNT in QDMR steps with SUM operations, and vice-versa. In Table 10, the question "Find the total student enrollment for different affiliation type schools.", is incorrectly mapped to a candidate query involving a COUNT operation on university.enrollment. By modifying the aggregate operation to SUM, the neŵ Q i correctly executes to a i and is therefore returned as the output.

C Data License
We list the license (when publicly available) and the release details of the datasets used in our paper. The BREAK dataset  is under the MIT License. SPIDER (Yu et al., 2018) is under the CC BY-SA 4.0 License. GEO880 (Zelle and Mooney, 1996) is available under the GNU General Public License 2.0.
The IMDB and YELP datasets were publicly released by Yaghmazadeh et al. (2017) in: goo. gl/DbUBMM.

D SQL Synthesis Coverage
We provide additional results of SQL synthesis coverage.   ples. These results further indicate the effectiveness of the SQL synthesis procedure. Namely, this ensures the synthesis results in Table 3 are faithful, despite the potential noise introduced by SQL with empty denotations.

E.1 Evaluation on the Full XSP Datasets
We provide additional results of the models trained on SPIDER. Namely, we evaluate on all examples of the ACADEMIC, IMDB and YELP datasets, including examples whose denotations are empty. Table 12 lists the results of all the models trained on the original training set of SPIDER. In Table 13 we provide the XSP results of the models trained on the random subsets of SPIDER train, used in §4.3.4. Similar to our previous experiments, T5-QDMR-P is generally better than T5-SQL-G in terms of its cross-database generalization.   select publication_0.title from author as author_0, author as author_1, publication as publication_0, writes as writes_0, writes as writes_1 where author_0.name = "Yunyao Li" and author_1.name = "H. V. Jagadish" and pub-lication_0.year > 2005 and author_0.aid = writes_0.aid and author_1.aid = writes_1.aid and publication_0.pid = writes_0.pid and publication_0.pid = writes_1.pid;

Question:
Find the actor with most number of films. Target SQL: select actor_0.name from actor as actor_0, cast as cast_0, movie as movie_0 where actor_0.aid = cast_0.aid and cast_0.msid = movie_0.mid order by count(distinct(movie_0.title)) desc limit 1; T5-SQL-G: select t1.name from actor as t1 join cast as t2 on t1.aid = t2.id group by t1.aid order by count(*) desc limit 1; ✗  Table 14: Example predictions of the SPIDER trained models from Tables 5-6. We denote correct and incorrect predictions by ✓and ✗.