Semi-Automatic Construction of Text-to-SQL Data for Domain Transfer

Strong and affordable in-domain data is a desirable asset when transferring trained semantic parsers to novel domains. As previous methods for semi-automatically constructing such data cannot handle the complexity of realistic SQL queries, we propose to construct SQL queries via context-dependent sampling, and introduce the concept of topic. Along with our SQL query construction method, we propose a novel pipeline of semi-automatic Text-to-SQL dataset construction that covers the broad space of SQL queries. We show that the created dataset is comparable with expert annotation along multiple dimensions, and is capable of improving domain transfer performance for SOTA semantic parsers.


Introduction
Due to the broad use of SQL in real-world databases, the task of mapping natural language questions to SQL queries (Text-to-SQL) has drawn considerable attention. Several large-scale crossdomain Text-to-SQL datasets have been manually constructed and advanced the development of Textto-SQL semantic parsing (Zhong et al., 2017;Yu et al., 2018).
While these datasets are built for domain-general semantic parsing, current state-of-the-art (SOTA) semantic parsers still suffer sharp performance drop when generalising to unseen domains Guo et al., 2019;.
This could be attributed to the observation that the mapping of Text-to-SQL vary vastly across different domains, particularly in terms of the expressions of predicates 1 . It is very difficult for models to generalize to those variations in a zero-shot fashion. Thus, additional in-domain data is desirable when applying semantic parsers to novel domains. 1 An example illustrating such difference is presented in Appendix A Unfortunately, the cost and scarcity of supervised data have been a major barrier for the wider application of the Text-to-SQL task, as creating pairs of natural language questions and SQL queries is a complex task demanding expertise in both SQL language and the specific domains. Take the SPIDER dataset (Yu et al., 2018) for example, 10,181 Text-SQL pairs in 200 databases (from 138 domains) required 11 computer science graduates to invest 1,000 human hours.
In semantic parsing, some semi-automatic dataset construction methods have been proposed. Wang et al. (2015) built logical forms compositionally, converted them to rigid pseudo natural language (Pseudo-NL) questions with rules, then crowd-sourced those Pseudo-NL questions into NL questions. Cheng et al. (2018) further broke down the pseudo-NL questions into question sequences to make them more digestible for crowd workers.
While these approaches shed light on the methodology of semi-automatic construction of semantic parsing datasets, applying them to collect broad-coverage Text-to-SQL data for domain transfer is not trivial. Firstly, SQL language has a much larger variety of realistic queries than Lambda-DCS logical forms (Liang, 2013), which were the focus of earlier work. Blind enumeration-up-to-a-certaindepth from a CFG is therefore intractable in size. Secondly, Herzig and Berant (2019) have discovered a mismatch between semi-automatically constructed queries and real-world queries, in terms of the distribution of logical form and the style of natural language expressions. As achieving accuracy gains in domain transfer demands high quality for the in-domain data, narrowing these mismatches is crucial.
In this paper, we propose a novel semi-automatic pipeline for robust construction of Text-SQL pairs as training data in novel domains, with broad semantic coverage, from databases only. Following Wang et al. (2015), our pipeline consists of three parts: automatic SQL query construction, SQL-to-PseudoNL conversion and PseudoNL-to-NL paraphrasing. In SQL query construction, we use a context-dependent probabilistic approach: first, we choose a topic of interest, a random n-tuple of tables in the novel domain, such as Concerts and Stadiums; then, we sample from a set of grammar rules, at each step pruning the generation space based on decision history. In SQL-to-PseudoNL conversion, we follow Cheng et al. (2018) in breaking down constructed SQL queries with templates, but also assign a "dominant concept" to topics to simplify Pseudo-NL questions. In PseudoNL-to-NL paraphrasing, we do crowd annotation, and provide crowd workers with various annotation scaffolds to collect quality NL questions. An example through our pipeline is shown in Figure 1.
We show that by using schema-inspired topic and context-dependent sampling instead of blind enumeration, SQL queries analogous to real-world queries can be constructed, and effective fine-tune datasets for domain transfer can be built. Our experiment shows that even a modest amount of our data facilitates domain transfer across a range of semantic parsers, raising accuracy in novel domains by up to 1% 2 .

Related Work
Alternative Supervision To resolve the difficulty in gathering supervised data for semantic parsing, various methods have been proposed from different perspectives.
Numerous approaches have explored distant supervision to bypass the use of expensive annotated data. Kwiatkowski et al. (2013); Berant et al. (2013); Yao and Van Durme (2014); Berant and Liang (2014) used question-answer pairs as supervision instead of logical form annotations; Reddy et al. (2014) used web-scale corpora of descriptive sentences, formalizing semantic parsing as a graph matching problem. These methods perform well on factoid questions; however, for more complex questions, it is harder to infer the underlying queries from the answers alone.
Later on, semi-automatic data collection methods, which reduce the cost of annotation, have been given considerable attention. Wang et al. 2 Our code and data will be released at https: //github.com/Teddy-Li/SemiAuto_Data_ Text_SQL Figure 1: A example of the data construction pipeline, from the topic of "concerts and stadiums" to the final results: the SQL query and the paired natural language question.
(2015) propose to compose logical forms by combining sub-components through a grammar, and translate the resulting trees into NL questions via rigid Pseudo-NL questions and crowdparaphrasing. Cheng et al. (2018) further replace the Pseudo-NL questions with question sequences to simplify the annotation for individual sentences.
While those approaches pioneered semiautomatic data collection, the query construction method based on exhaustive enumeration has its weaknesses. Herzig and Berant (2019) showed that there exists a significant mismatch between the distributions of created logical forms and real query logical forms, and between the language style of paraphrases and real questions.
Text-to-SQL Text-to-SQL as a semantic parsing task, has attracted increasing interest, where multiple large-scale datasets have been released. Zhong et al. (2017) created a large single-table Text-to-SQL dataset, WikiSQL, from Wikipedia entries, upon which many semantic parsers have been trained, achieving high accuracies surpassing 80% (Chang et al., 2020;Lyu et al., 2020;Hwang et al., 2019;He et al., 2019). Yu et al. (2018) proposed SPIDER, another large-scale text-to-SQL dataset with multi-table databases, much wider grammar coverage, and more complex queries. It involves features including GROUP-BY aggregation, ORDER-BY, nested queries. This has made SPIDER a more realistic, and also more challenging parsing task, with SOTA semantic parsers achieving accuracies of above 60%. (Guo et al., 2019;.
Although SPIDER is considered a domaingeneral semantic parsing dataset, semantic parsers trained on it still suffer a sharp performance drop when generalizing to unseen domains 3 .
Thus, additional resource for domain transfer is appealing. However, in this more complex multitable Text-to-SQL task, previous semi-automatic dataset construction methods face an even greater challenge. With multi-table SQL queries with more complex clauses, exhaustive enumeration is intractable in size and prone to mismatches.
More recently, various methods of Text-to-SQL dataset construction have been proposed (Yu et al., 2020;Zhong et al., 2020;Zhang et al., 2021), further automating the SQL-to-NL step with neural question generation. However, for query construction, they either do vanilla grammar-based SQL query sampling (Zhang et al., 2021) or use template sketches from existing datasets (Zhong et al., 2020;Yu et al., 2020). On the other hand, we focus instead on the context-dependent construction of SQL queries that both generalize beyond existing datasets and remain realistic.

Method
Our pipeline takes database schema as input, and outputs a set of aligned pairs of NL questions and SQL queries. We start by selecting a topic of interest from the schema, followed by sampling production rules from a concise SQL query grammar as in figure 2, resulting in a created SQL query. We then convert the query into a sequence of pseudo-NL questions, and finally crowd-paraphrase the sequence into a fluent NL question.
Specifically, we highlight two key features in our SQL query construction algorithm, asserting control to the process of sampling: • We set the topic, namely the attended subset of tables in database with an algorithm based on Pagerank and Prim's MST algorithm; • At each step of sampling, we prune the space of candidates by conditioning the distribution of production rules heuristically on the decision history, namely ancestor nodes, left siblings and left siblings' descendants.

Setting the Topic
For each valid NL question, there is one topic referring to a concrete concept; similarly, for each real SQL query, however complex, there is one topic, the set of entities it attends to, typically specified in 3 https://yale-lily.github.io/spider the 'FROM' clause, that should reflect some concrete concepts. For queries involving one table, the topic is simply the table itself; for queries involving multiple tables, which is an iconic feature of SPI-DER, it is crucial to identify which tables should be bound together and how. Our approach here, which is a novel contribution in this paper, takes inspiration from observations in existing datasets. In SPIDER, 93.9% of 'join-on' clauses are between columns with foreign-key relations, an additional 4.2% share the same name 4 . This is consistent with our intuition about SQL language, where join-on clauses are most closely related to foreign-key relations. The popularity of columns sharing the same name apart from foreignkeys is partly a result of missing foreign-key relations and partly a reflection of the fact that columns with the same names are likely relevant.
We therefore set up a table relation graph for each database to model the probabilities that a topic defined by join-on clauses is meaningful. When two columns have a foreign key relation or share the same name, we add an edge between their corresponding tables (foreign-key relations are given higher weights for frequency). Multiple edges between pairs of tables are reduced by sum. To maintain the completeness of grammar space, we assign a small 'background radiation' weight be-tween each pair of tables.
The topic for each SQL query can then be modelled as a sub-graph of this table relation graph, and the transition distribution given previously chosen tables can be modelled with a stochastic version of Prim's MST algorithm (Prim, 1957), formalized as: where Φ = t 1 , ..., t k−1 is the set of previously chosen tables, e.T o is the candidate table, and e.w are edges' weight and e.T o.w are Pagerank weights of candidate tables. With these transition probabilities, the problem has been reduced to choosing the first table. To do this, we need a prior distribution among all tables. Again, we use Pagerank for this purpose: first, each table is assigned an initial importance according to their columns, then we do Pagerank on the table relation graph with random-jump probability of 0.2 to get a context-aware importance among tables, which is then normalized to a distribution. Note that we turn edges to the reverse direction so that weights would accumulate from the primary side to the foreign side of foreign-key relations and the foreign sides would be more likely chosen as the first table, as we would hope.
In sum, as the first step of constructing an SQL query from scratch, we settle its topic. We first sample an 'initial table' from a prior distribution of tables, which is 'Concert' in the case of our example in Figure 1; then we iteratively expand to other tables until halting after a random number of steps 5 , which in the case of our example, results in the value of Topic row in Figure 1.

Context-dependent Sampling
After sampling a topic from the table relation graph, we move on to sampling the whole SQL query from a concise SQL query grammar as in Figure 2. We start from a root node Z and recursively dive down until all branches have hit terminal nodes (colored red in Figure 2). An example is shown in Figure 3.
We follow depth-first traversal order, and, to create SQL query sets analogous to the queries in real world, we use decision history as condition of candidate distributions at each step. Namely, we assign a larger probability mass to relevant candidates, avoid contradictory or redundant candidates, thereby asserting control to clause structures.
On one hand, we want the resulting SQL queries to make sense in the real world; on the other hand, we don't want their distribution to over-fit to existing domains. Thus, in practice we employ a conservative heuristic approach, set up rules by collecting patterns of 'bad' queries and other domainagnostic patterns from trials, and prune the space by tuning distributions toward 'good' combinations and against 'bad' ones.
For example, the following rule "A column is more likely chosen to 'where' clause if it has been chosen in the last 'where' clause", tunes probabilities against queries like select editors' names from journal committees and their corresponding journals and editors, whose age is smaller than 50 and journal's sales is larger than 1600, in favour of those like select editors' names whose age is larger than 40 and smaller than 50.
Additionally, to reduce redundancy, we validate all candidate clauses at each step by executing them against the databases and collecting responses. We compare query response before and after adding a candidate clause, and screen out clauses that either make no difference or result in empty responses. We present the full set of rules in Appendix B.

From SQL to Pseudo-NL
Following previous work, to translate SQL queries to NL questions, we first use a template-based approach to convert them to pseudo-NL questions. Similarly to Cheng et al. (2018), we deterministically convert complex SQL queries into sequences of pseudo-NL questions to make annotation easier. In practice, with the more complex SQL clause structures, we find it not ideal to split questions into sequences as granular as Cheng et al. (2018), because annotators again get lost in the labyrinth of coreferences between questions in the same sequence. Thus we re-balance the trade-off between the number of sentences and individual complexity towards longer but fewer sentences, so it's not too hard for crowd workers to follow 6 .
Notably, while generally speaking SQL language looks similar to natural language, its FROM clauses with table-joining are very unnatural, and when involving many tables, can make their literal translations impenetrable. Unlike in NL questions where there is an integrated topic, in SQL language the topic defined by 'FROM' clause could be long and confusing. Take the example in Figure 1, its topic 'Concerts and Stadiums' in the form of SQL query becomes 'Concert join Stadium on Concert.stadium id = Stadium.id'. Worse still, multiple tables also make the meaning of wildcard column '*' confusing in clauses such as 'select count(*)'.
Luckily, we have observed that for a pair of tables joined by foreign key relations, we can always consider the foreign side as the 'main' table of the pair, since it is the one from which the primary side is extended. Therefore, we define this directionality for a topic sub-graph of the table relation graph: primary − → foreign as root-wise and foreign − → primary as leaf-wise; for table pairs linked by same-name relation, an edge is kept on both directions.
Then, for multi-table queries, we assume that the table(s) at the root-most position is the "dominant concept" of the topic. Since sub-graphs are predominantly trees, mostly there is one dominant concept for each query. Whenever possible, we replace all pseudo-NL phrases for the table joining, such as the above, with expressions like 'Concerts and their corresponding Stadiums', and replace all phrases for wildcard column 'count(*)' with expressions like 'the number of Concerts'. This way pseudo-NL questions are simplified, and the annotation burden is eased.

From Pseudo-NL to NL
The last part of our pipeline involves crowdparaphrasing these pseudo-NL question sequences into fluent NL questions. We recruit workers on the Amazon Mechanical Turk (AMT) platform, present tasks to AMT workers randomly and pay $0.25 for each task completed.
In each task, we present the workers with a pseudo-NL question sequence paired with examples from DB response. In pilot trials, we found that annotators tend to keep fragments from the pseudo-NL questions even when they're clearly rigid. We hypothesize that this is an exposure effect, that annotators' exposure to the pseudo-NL questions influenced their own style of expression.
As another of our novel contributions, we pose a countermeasure to this exposure effect. First, we engage annotators in the context of helping their foreign friends sound local. Further, we present a personalized example specific to each generated SQL query. These personalized examples are taken from expert annotated datasets in other domains, but can give crowd workers a general idea of what level of naturalness is expected and in which way.
Each personalized example involves a pseudo-NL question sequence, an expert-annotated NL question and an example DB response. To provide the most relevant hint, we retrieve from existing data the most similar entries to each created SQL query, where similarity is measured as the cosine similarity regarding an engineered feature vector 7 . We retrieve the top 10 example queries with smallest distances in the above terms. We then randomly pick one as the personalized example to display.
We employed only the English speaking AMT workers with 95%+ acceptance rate and 50+ acceptance history to restrict this paraphrasing task to a set of competent workers. However, empirically we still found a considerable number of workers submitting nonsensical paraphrases, apparently not understanding the task or giving up on the complex input. Thus, we restricted the access to only the trusted workers who had previously performed well in our task.

Corpus Construction Details
We experiment on the basis of the SPIDER dataset, with data split details described in Table 1 into seen domains and novel domains 8 . Accordingly, we define the seen-domain-dev-set and noveldomain-test-set from the SPIDER dev set, along with the SPIDER train set renamed seen-domaintrain-set. Additionally, we collect two data sets in the novel domains, novel-domain-ours-set and noveldomain-oracle-set. Novel-domain-ours-set is our target dataset with entries constructed from scratch with only the databases and our full pipeline. Novel-domain-oracle-set entries start from the novel domain gold SQL queries, annotated with our SQL-to-NL method. Moreover, we create a new data split called seen-domain-small-set, which has the same size as novel-domain-ours-set, but is randomly sampled from the expert-annotated seendomain-train-set.
We use a linear regression to derive the number of SQL queries to construct for each database, w.r.t the number of TABLES, COLUMNS and FOREIGN-KEY relations. Each created SQL query is paraphrased into natural language by 2 annotators as in SPIDER, paraphrases too short or too long are filtered out. The resulting data sizes are illustrated in Table 1.
In total, the paraphrasing and human evaluation (to be elaborated below) cost us $349.34.

Human Evaluation
To test the effectiveness of our SQL query construction and SQL-to-NL conversion method respectively, we conduct two experiments of human evaluation with participants recruited on AMT.
To evaluate the constructed SQL queries, we use the corresponding computer-generated pseudo-NL as a proxy for SQL queries to involve crowd-8 The 10 selected novel domains are: orchestra, singer, real estate properties, tvshow, battle death, voter 1, student transcripts tracking, concert singer, world 1, and course teach.  sourcing. We present each participant with a Pseudo-NL question, ask them to indicate whether the Pseudo-NL is succinct, sensible, relevant and complex. Each question is randomly chosen either from our created novel-domain-ours-set or from expert-annotated SQL queries in novel-domainoracle-set, where the choice is hidden from participant workers. We evaluate on all entries in the 10 novel domains, with results presented in Table  2. As shown, compared to expert-annotated ones, participants considered a larger proportion of our queries complex, but also a larger proportion concise, sensible and relevant. Although human evaluation scores are subjective and could fluctuate across individuals, this at least shows that from the annotators' point of view, our created SQL queries are comparable to expert annotated ones on some dimensions.
To evaluate our SQL-to-NL conversion method, we compare our crowd-sourced questions from novel-domain-oracle-set with expert annotated NL questions from novel-domain-test-set. Since both are aligned to the same set of gold SQL queries, we show participants pairs of NL questions referring to the same SQL query. We ask the participants how similar the question pairs are, and which of them is smoother in language.
With crowd workers as judges, we cannot di-rectly measure how rigorous our NL questions are in preserving the semantics of SQL queries. However, by taking expert-annotated questions as the gold standard of query semantics, and inspecting the similarity between ours and expert-annotated questions for the same gold-standard SQL queries, we can indirectly measure the level of rigorousness that our conversion method is capable of. Results show that the average similarity between our questions and gold questions is scored 3.78 out of 1 to 5, indicating a good alignment between the meanings of the question pairs. As for preference, 45% of times our question is preferred, while 39% of times gold question is preferred and 16% of times the two are considered equally smooth in expression. This result verifies the validity of our SQL-to-NL conversion method.

Evaluation Setting
To further evaluate our pipeline's effectiveness, we do an extrinsic evaluation in the context of domain transfer. Namely, we test the capability of our created data to help semantic parsers generalize to novel domains. Below we first define 4 dataset settings and 2 training scenarios:

Dataset Settings
• PRETRAIN Trains on seen-domain-train-set 9 , validates on seen-domain-dev-set, tests on noveldomain-test-set. This reflects the process of training a model on seen domains then applying the trained model to novel domains; • OURS Trains on novel-domain-ours-set, tests on novel-domain-test-set. This is our target setting, which reflects training with our semiautomatically constructed in-domain data then test on real-world queries in the same domains; • GOLD Trains on half (220 entries) of the noveldomain-test-set, tests on the other half. This setting approximates a theoretical upper bound, reflecting how much accuracy gain can be achieved with gold in-domain data; • TRAIN(SMALL) Trains on novel-domain-smallset, tests on novel-domain-test-set. This setting is an out-of-domain expert-annotated baseline for OURS setting in RANDOM-INIT scenario, and answers the question "how powerful is our created, in-domain data compared to the expert-annotated 9 For definitions of data splits, please refer to Table 1 but out-of-domain data, in terms of training models from random initialization".

Training Scenarios
• RANDOM-INIT Start training from randomly initialized model parameters. For models with BERT, initialize BERT parameters with pretrained checkpoints 10 ; • FINETUNE Start training from model checkpoints acquired from RANDOM-INIT training on PRETRAIN data. We conducted experiments with 3 popular recent Text-to-SQL semantic parsers: lang2logic (Dong and Lapata, 2016), IRNet-BERT (Guo et al., 2019), and RAT-SQL-BERT . lang2logic is the first semantic parser to employ Seq2Seq paradigm, IRNet-BERT is the first practically effective semantic parser on SPIDER challenge, and RAT-SQL-BERT is the latest reproducible SOTA when using BERT-Large encoder. The lang2logic models were originally written in Lua and are re-implemented with PyTorch; to serve as a vanilla baseline, seq2seq setting is used instead of the more complex seq2tree setting. For RAT-SQL-BERT models, due to memory limits of our 1080 TI GPUs and for a fair comparison with IRNet-BERT, we use the bert-base-uncased version as in IRNet-BERT, instead of the bert-large-wholeword-masking version in the original implementation.
For each parser, we first train them under the RANDOM-INIT scenario with PRETRAIN data; then FINETUNE the pretrained model saperately, with OURS and GOLD data. Additionally, we train each parser under RANDOM-INIT scenario with OURS and TRAIN(SMALL) data respectively, to evaluate our data outside the scope of domain transfer.
For all RANDOM-INIT models we use the same set of hyper-parameters as in their original settings; for FINETUNE models, following the intuition that fine-tuning should have learning rates no larger than pretrain, we do log-uniform sampling through 1/3, 1/10, 1/30 and 1/100 of the original learning rates as well as the original learning rates themselves; for models with BERT encoders, we further grid-search both having the BERT learning rates fixed and aligned with other parameters.
We have attempted to compare with previous work on semi-automatic dataset construction. How-  Table 3: Exact Match accuracy in percentage points. means at learning rate of 1e-4, † means at 3e-4, means at 1e-5. To faithfully reproduce a domain transfer setting, the accuracies of PRETRAIN models are reported on novel-domain-test-set. Therefore, reported accuracies may vary from the ones reported on the development set of SPIDER.
ever, the method of Wang et al. (2015) is restricted to LambdaDCS logical forms and not applicable to our setting of multi-table SQL queries; the data construction method of Cheng et al. (2018) is not contained in their open-source codebase, and the first author was unfortunately not reachable for that implementation. Nonetheless, we discuss Zhang et al. (2021) for comparison, a recent work on large-scale Text-to-SQL data augmentation, with context-free sampling of SQL queries and hierarchical neural automatic NL generation without human intervention.

Results and Discussions
Evaluation results are shown in Table 3. Accuracies are evaluated on the novel-domain-test-set of SPIDER. As shown, extra in-domain data collected with our pipeline improves novel domain accuracy by more than 1% for both lang2logic and IRNet-BERT models.
We also tried applying the same fine-tuning to the IRNet and RAT-SQL parsers without BERT. However, that did not increase the accuracy. We attribute this difference to the fact that while our additional data provides information on the novel domain, its language style is not the same as SPIDER train set. Our crowd-paraphrased questions are bound to be different in style to expert-annotated questions. The conventional recurrent encoders, trained only on the SPIDER train set, fail to capture the meaning of questions in our fine-tune dataset. On the other hand, with BERT contextualizers, which had been trained on texts at the magnitude of billions, the language of our questions looks more familiar to models, and they can more successfully absorb the domain-related information encoded in our data. As BERT-Large models are bigger and more powerful, we would expect the accuracy gain to be larger with RAT-SQL-BERT-Large, and expect the same trend for other semantic parsers in general.
The finetune-gold result provides an approximate upper bound, from expert-annotated data pairs at the same magnitude of our size. It is encouraging that we are able to correct roughly 20% of the correctable errors by this standard. Further gains could reasonably be expected from increasing scale.
Comparisons between models trained from RANDOM-INIT scenario also show interesting findings. Since with both random-init-ours and random-init-train(small), the data sizes are a magnitude smaller than the SPIDER training set used for PRETRAIN, parsers trained under these two settings perform less competitively than their FINETUNE counterparts. But among themselves, training from RANDOM-INIT with OURS data is generally comparable to that of expert-annotated TRAIN(SMALL) setting of the same size, and in the case of RAT-SQL-BERT it even exceeds that of TRAIN(SMALL). Table 3 are for the best finetuned model checkpoint on our test data, the noveldomain-test-set. This choice reflects the scenario of our chosen task: we start with novel domains, having only the databases available for use. We semi-automatically create an in-domain fine-tune dataset and train semantic parsers with it. We envisage deploying the fine-tuned models to arrive at the best model checkpoint by the use of feedback from users, for which optimizing on the novel-domaintest-set is used as a proxy.

The results in
We are nevertheless interested in knowing if we can choose the best model checkpoint independently of our test. In terms of our hypothetical scenario, it reflects whether we can achieve good performance with real queries, by assuming access to a fixed development set, and choosing the best model checkpoint according to performance on that fixed set.
The standard practice here is to validate on a set-aside development set, then report the accuracy on the test set. However, the 440 entries of noveldomain-test-set is all the expert-annotated data that we have for our novel domains. In response to this dilemma, we split the novel-domain-test-set in half, into two subsets A and B. We then do crossvalidation between them: we use one subset for picking the model checkpoint with highest accuracy, assessing its accuracy on the other (and vice versa). Under this setting, we still achieve an accuracy gain of 0.74% with IRNet-BERT and 1.25% with lang2logic. Contemporary with our work, Zhang et al. (2021) generated a much larger set of 50,000+ synthetic training data (which they were able to do by not involving human judges). Under the same evaluation strategy as us in Table 3, and starting from a different IRNet-BERT pretrained baseline of 59.5% 11 , they report an augmented accuracy of 61.7% on the SPIDER development set, obtaining an increase of 2.2%. However, we achieve half of that increase with only around 1% their amount of data. In the absence of access to their code, we have been unable to determine the performance that Zhang et al. (2021) would obtain from a comparably small dataset, but we feel confident that gains comparable to their full dataset could be obtained from our method with more modest increases in scale.

Conclusion
We have presented a novel approach to construct in-domain Text-to-SQL data, following the threestep paradigm of Wang et al. (2015). We randomly select a topic of interest from a table relation graph prior to building the actual query, and sample query clauses in a context-dependent manner. We identify "dominant concept" of the topics to simplify the converted Pseudo-NL, and retrieve personalised examples as annotation scaffold for crowdparaphrasing. Our experiments show that our indomain data is comparable with expert-annotated data, and capable of increasing the accuracy of SOTA IRNet-BERT semantic parser by up to 1%.
For future work, we plan to explore more sophisticated probabilistic models to control SQL query construction, and pair our query construction method with recent work on SQL-to-NL translation, so as to bring our method to larger scale. 11 which is mainly due to their use of SPIDER development set and our use of novel-domain-test-set.  • Subject columns in 'GROUP BY' clauses are likely to be selected.
• 'GROUP BY' clauses always take effect either by aggregating 'SELECT' columns, 'HAVING' conditions or 'ORDER BY' clauses.
• Columns present in 'SELECT' are likely also present in 'ORDER BY' • When two queries are linked together via an 'UNION', 'EXCEPT' or 'INTERSECT', it is likely that the two queries share similar structure, only with one or two different structures such as 'WHERE' conditions.
Example: Foreign key Select avg(T1.killed) from perpetrator as T1 join people as T2 on T1.people id == T2.id where T2.height < 1.8m How many people were killed by perpetrators shorter than 1.8m on average? Example: Same Name SELECT T1.id FROM trip AS T1 JOIN weather AS T2 ON T1.zip code = T2.zip code GROUP BY T2.zip code HAVING avg(T2.mean temperature f) > 60 Give me ids for all the trip that took place in a zip code area with average mean temperature above 60.

C Feature Vector for Personalized Examples
In retrieving the personalized examples, the feature vectors for measuring similarity between SQL queries involve the following feature values. Features related to "SELECT", "FROM", "WHERE", "GROUP BY", "ORDER BY" and Set Operation clauses are listed in Table 6, 7, 8, 9, 10, 11 respectively.
Feature about "SELECT" Weight Number of "SELECT" clauses 1.0 Wildcard "*" in column 4.0 MAX in column 2.0 MIN in column 2.0 COUNT in column 2.0 SUM in column 2.0 AVG in column 2.0 Feature about "FROM" Weight Number of tables in "FROM" 1.0 All tables connected by "JOIN ON" 2.0 Table 7: Feature vector values regarding the "FROM" clauses, with weights specified to the right of each feature. Figure 4: Flowchart illustration of how SQL queries are split into sequences of computer-generated questions. Triangle boxes and connection lines indicate the sequential relationship between these templates, in rectangle boxes are individual templates. In these templates, colored red are the slots to fill in. To build a sequence of Pseudo-NL questions, a program walks through the chart from start to finish, and lists the resulting instantiated templates iteratively as output.
Feature about "WHERE" Weight Empty "WHERE" clause 2.0 Number of "WHERE" clauses 1.0 "*" in column 4.0 MAX in column 1.0 MIN in column 1.0 COUNT in column 1.0 SUM in column 1.0 AVG in column 1.0 Sub-query in clauses 4.0 Column-valued clauses 4.0 'between' as operator 1.0 'equal' as operator 1.0 'larger than' as operator 1.0 'smaller than' as operator 1.0 'not larger than' as operator 1.0 'not smaller than' as operator 1.0 'not equal' as operator 1.0 'in' as operator 1.0 'like' as operator 1.0 Feature about "GROUP BY" Weight Number of "GROUP BY" clauses 1.0 Involves "HAVING" clause 2.0 Involves "HAVING" with Sub-query 2.0 Table 9: Feature vector values regarding the "GROUP BY" clauses, with weights specified to the right of each feature.