Searching for Better Database Queries in the Outputs of Semantic Parsers

The task of generating a database query from a question in natural language suffers from ambiguity and insufficiently precise description of the goal. The problem is amplified when the system needs to generalize to databases unseen at training. In this paper, we consider the case when, at the test time, the system has access to an external criterion that evaluates the generated queries. The criterion can vary from checking that a query executes without errors to verifying the query on a set of tests. In this setting, we augment neural autoregressive models with a search algorithm that looks for a query satisfying the criterion. We apply our approach to the state-of-the-art semantic parsers and report that it allows us to find many queries passing all the tests on different datasets.


Introduction
Generating a database query from a naturallanguage description of the user's intent is a longstanding and important task.In the recent years, most of the community focus was on the Spider dataset (Yu et al., 2018), which poses the task in the zero-shot regime, meaning that a method has to generalize to databases unseen at training.The Spider dataset contains English questions and SQL queries.The progress has been remarkable, and the accuracy has moved from below 30% to above 70%.A part of this success can be attributed to the adoption of pre-trained transformer models like BERT (Devlin et al., 2019) into most of the pipelines.
Given such progress, it is natural to ask whether we are getting closer to solving the problem.Several recent studies have noted that the task might be harder than it looks.Finegan-Dollak et al. (2018) found that many single-database datasets had identical queries in both train and test sets and showed that using such splits effectively reduced the problem to classifying the queries from the train set.Shaw et al. (2021) continued the study in the multidatabase setting and showed that the compositional generalization was hard to achieve, and even to measure it, one should be very careful with splits.In a different line of thought, Suhr et al. (2020) examined how the models trained on Spider generalize to other datasets and reported that generalization was challenging.Even within one dataset, many questions have several interpretations leading to different queries, and annotation policies do not cover these ambiguities or cover them differently.
Acknowledging that the zero-shot setting might be too difficult to tackle as is, we aim to better define and simplify the problem to achieve better results in terms of the number of correctly generated queries.In this task, most modern models produce a distribution over all possible outputs, which can guide the search at the test time.
We observe that if the search algorithm has access to a criterion that can evaluate the output by treating it as a database query, the overall method can produce much better results.We consider the following criteria ordered by their "strength": a query is executed without errors, a query produces output from correct columns, a query produces a correct result on one test database, and a query produces correct results on a set of test databases.We experiment with different search methods and report that the complete anytime beam search (Zhang, 1998) outperforms sampling-based alternatives.
Many practical cases arise when the user is willing to trade off some of their time to improve the output query.Our approach allows the user to obtain a better query by interactively guiding the search via providing the target output columns or the answer on one or more test databases.The user is expected to supply this information without the gold query.Notably, the execution criterion does not require extra user input but relies on executing generated queries.
In addition, the test time database can be out of domain w.r.t. the training set.One can annotate gold queries for fine-tuning on new databases to improve out-of-domain performance, but this can be prohibitively expensive.Our approach gives a way to improve out-of-domain results without extra annotated training data.We view our approach as a way to control the trained model leading to a more reliable and responsible query synthesis.
In this paper, we make several observations.The complete anytime beam search works with different search criteria and with all the three models.Reasonable results can be obtained with the maximum beam size of 100, which fits on a single modern GPU.Searching with the execution criterion can significantly improve the quality of the decoders that generate output as an unconstrained token sequence, e.g., T5, and using such criterion does not require extra user input.Searching for the queries that return a correct output on one database allows finding many queries that provide a correct output on that database.Therefore, such searching w.r.t.one database can result in false positives, and it is important to evaluate the queries on a set of databases.Based on the method of Zhong et al. (2020), we built the test suite of databases for the evaluation.With these test suites, we show that there are multiple false positives among the queries that pass one test, while searching for the queries that pass the test suite produces the outputs of higher quality.
Finally, we experiment with the generalization of the models trained on Spider to the Geo-Query (Zelle and Mooney, 1996), IMDB, Yelp (Yaghmazadeh et al., 2017) and Academic (Li and Jagadish, 2014) datasets.We show that searching w.r.t.different criteria still works under this distribution shift, and searching w.r.t. the criteria with tests is often comparable to fine-tuning the network to a test dataset directly.This paper is organized as follows.In Section 2, we review our setting.In Section 3, we provide the details of our method.Section 4 provides the details of the test-suite construction procedure, Section 5 describes the experimental setup.In Section 6, we provide the experimental results and discussion.We review some related works in Section 7 and conclude in Section 8.

Preliminaries
We consider the problem of generating queries to databases given the description of the user's intent in natural language in the cross-database setting where the train, validation and test splits contain different databases.Models trained in this setting, in theory, can be evaluated on any database.
A typical model for the cross-database setting is an encoder-decoder neural network.Encoders typically consist of a pre-trained BERT-like transformer followed by a specialized encoder that can incorporate the database structure in some form (Guo et al., 2019;Wang et al., 2020;Cao et al., 2021;Cai et al., 2021).Sometimes the BERT part is further fine-tuned on database-related objectives (Yu et al., 2021;Deng et al., 2021).The encoder input is a concatenation of the tokenized question and a sentence representation of the database schema separated by the special delimiter token.The representation of the database schema consists of the tokenized table and column names and values related to the question.These values are commonly extracted by string matching with question tokens (Lin et al., 2020).Decoders are typically autoregressive based on LSTM or transformers.Some decoders do not check the syntactic correctness of the output and its consistency with the database.Some provide output w.r.t. a grammar (Yin and Neubig, 2017); some use post-hoc checks with parsers.
In this paper, we experiment with three models: T5-3B fine-tuned on the Spider dataset by Scholak et al. (2021), BRIDGE (Lin et al., 2020) and SQ-QDMR (Saparina and Osokin, 2021).We provide a detailed description of these models in Section 5.3.

Search with Models
We now describe our approach to searching for queries on top of a learned model.We first generate full query candidates using a search method and then select the first one that passes the selected search criterion.We show possible search criteria in Section 3.1 and search methods in Section 3.2.

Search Criteria
Execution criterion.To avoid syntactically incorrect queries, we can prune the search with the execution criterion.The query passes this criterion if it can be executed on the input database without errors.In particular, the query has to contain valid table and column names.These properties are not guaranteed for the unconstrained decoders as T5.Thus the execution criterion can be extremely useful for such models.
Output column match.With this criterion, we compare the output columns (that the query will select) with the correct ones.Firstly, wrong output columns is a common mistake in the text-to-SQL parsers (Guo et al., 2019;Lin et al., 2020;Suhr et al., 2020).Secondly, the output columns can provide domain knowledge and shed some light on the user intent in a realistic scenario when the input question is ambiguous (Suhr et al., 2020;Lee et al., 2021).
One test.This criterion compares the result of query execution on a given database (one test case) with the correct one.With this criterion, we search for a correct query in terms of execution accuracy on the input database (Section 5.2).
Test suite.This criterion checks if a query passes a set of tests.Each test case corresponds to a particular database, and all test databases share the same schema.This criterion is inspired by a test suite of databases with high code coverage proposed by Zhong et al. (2020).The set of databases is designed to distinguish the correct queries from potential false positives.Searching with this criterion is equivalent to the search for a correct query in terms of test-suite accuracy (Section 5.2) .

Search Methods
Top-k and Top-p (Nucleus) Sampling (Fan et al., 2018;Holtzman et al., 2020) draw samples from the truncated distribution: the probability mass is re-weighted between the k most probable elements in top-k sampling and between the elements with cumulative probability mass exceeding p in top-p sampling (k, p are hyperparameters).
UniqueRandomizer (Shi et al., 2020) is a method to incrementally sample sequences without replacement.The samples are drawn until the stopping condition is reached (one of the search criteria in our case).The probabilities of selected elements are reduced after each iteration of sampling to improve diversity in samples.
Complete Anytime Beam (CAB) Search of Zhang (1998) extends the regular beam search by running it several times with increasing beam sizes.Importantly, the beams produced by beam search In these works, the authors limit the number of hypotheses coming from each element of the previous beam (we will refer to this upper bound as the width of the beam search).Between outer CAB iterations, we also increase the width by a constant value and multiply the beam size by a constant factor.The schedules of the beam size and beam-search width are important hyperparameters.

Test Suite Construction
Testing on one database is generally not enough to ensure the semantic correctness of the generated query, but running the query on too many databases can be computationally inefficient.The inefficiency problem is especially acute in our task due to the large number of query candidates that should be tested and several rounds of the searching process.We build our test suites by modifying the method of Zhong et al. (2020), which relies on generating the so-called neighbor queries from a given set of gold queries and randomly sampling databases to distinguish gold queries from as many neighbors as possible.We observed two key drawbacks of the test databases generated by Zhong et al. (2020).First, the test suites contained many databases, and some were unnecessarily large, which resulted in very long testing on them.Second, outputs of many queries were often empty (or zero for queries with aggregators) on these test databases.If the output of the gold query is empty on all the elements of the test suite, it cannot be distinguished from trivial dummy queries.This effect is more salient if the gold query returns empty output on the original database.We alleviate these issues by independently generating test databases for each gold query, explicitly limiting the number of rows in each table and putting extra effort into generating at least one database where the gold query returns a non-empty output.The details of our procedure are provided in Appendix A.
We compare our test suites with the original ones of Zhong et al. (2020) on the five considered datasets described in Section 5.1.For a fair comparison, we generate the independent sets of neighbor queries for each gold query.These neighbor queries are different from the neighbors generated in the process of creating the test suites.In Table 1, we compare the initial databases (1 test), original test suites and our test suites.It can be seen that, in the space of neighbor queries, our test suites have higher code coverage (Cover) than the original databases and the original test suites.With new test suites, the average query execution time (on all the corresponding tests) is reduced 50x across the datasets.We release the test suites we created. 1 5 Experiment Setup
Spider.We use dev and test sets (451 and 521 examples) from the work of Saparina and Osokin (2021): they are parts of the original Spider dev, but some examples (from dev) were repaired.
GeoQuery, IMDB, Yelp and Academic.We use query splits created by Finegan-Dollak et al. (2018), additionally filtered from duplicates and 1 github.com/ramild/TestSuiteexamples with gold SQL queries that crash or execute longer than 5 minutes with the Python package sqlite3.The dataset statistics are provided in Table 7 of Appendix D.

Evaluation Metrics
Exact-set Match (Yu et al., 2018) is an SQLto-SQL comparison metric that reflects the fraction of the predicted queries matching the ground-truth queries.In the matching process, each query is decomposed into fragments that are compared individually so that the metric is not too sensitive to the ordering of independent clauses.This metric does not take into account predicted values and can give a high score to incomplete queries.As SQ-QDMR model produces queries in SPARQL, we cannot use the exact-set match as a primary evaluation metric.
Execution accuracy is designed to compare the queries by their execution output on an original database.In contrast to the exact-set match, this prevents false-negative queries but leaves space for potential false positives.The version provided by Yu et al. (2018) for Spider evaluation has issues in SPARQL-SQL comparison, so we use the version provided by Saparina and Osokin (2021) unless explicitly mentioned otherwise.
Test-suite accuracy (Zhong et al., 2020) approximates the semantic accuracy of the query synthesis models.This metric refers to the share of predicted queries producing the correct answers on all databases from the test suite.We build the test suites for Spider dev and test sets and for all the queries in the other four considered datasets.

Models
We consider three models: T5-3B fine-tuned on Spider (Scholak et al., 2021), BRIDGE (Lin et al., 2020) and SQ-QDMR (Saparina and Osokin, 2021).These models have top execution accuracy among publicly available models on Spider.We also tried to search under our search criteria on top of the bottom-up semi-autoregressive model of Rubin and Berant (2021), but we could not make the search increase the number of correct queries.
For evaluation on Spider, we use the released checkpoints of the best models.BRIDGE training data included question splits of single-database datasets, so we re-train it on Spider-only data to evaluate on query splits of these datasets.For retraining BRIDGE and fine-tuning all models, we use official implementations (see Appendix E).
T5 (Raffel et al., 2020) (Lin et al., 2020).The output of the T5 model is the sequence of tokens representing the SQL query.Note that this model generates output sequence without explicitly considering the SQL grammar and schema consistency.
BRIDGE (Lin et al., 2020) consists of the BERT-based encoder and pointer-generator decoder.The input sequence is formed from the concatenation of question, table and column names, and relevant database values separated by special token and encoded with BERT.The relevant values are selected with fuzzy string matching between question and database values.Column encodings are further enriched with meta-data features such as primary or foreign keys and data types obtained from the feed-forward layer.The LSTMbased decoder with multi-head attention at each step copies question or schema tokens or generates the SQL keywords.During decoding, model chooses columns only from the predicted table to provide schema consistency.An additional static SQL analyzer filters incorrect output queries.
SQ-QDMR (as we refer to the model of Saparina and Osokin ( 2021)) contains RAT-transformer, GraPPa encoders (Wang et al., 2020;Yu et al., 2021) and a grammar-guided LSTM-based decoder (Yin and Neubig, 2017).The SQ-QDMR decoder produces output in the form of grounded intermediate representations derived from QDMRs of the Break dataset (Wolfson et al., 2020).The grounded QDMRs are not directly related to any execution engine and cannot be executed as is, but Saparina and Osokin (2021) implemented a non-trainable translator from grounded QDMR to the SPARQL query language, in which queries can be executed.We can think of grounded QDMRs augmented with this translator as executable database queries.We compare different decoding strategies in our setting (Table 2): top-k and top-p (nucleus) sampling (Fan et al., 2018;Holtzman et al., 2020), Uni-queRandomizer (Shi et al., 2020) and CAB search (Zhang, 1998).We measure the execution accuracy of these search methods under the 1-test criterion on Spider dev.We use the same sampling budgets (1000 for BRIDGE and SQ-QDMR and 800 for T5-3B due to the memory limits), tune p in top-p sampling and the temperature for all methods, more implementation details in Appendix B).
The results demonstrate that a significant number of output queries pass one test after searching with any of these methods, so different decoding strategies can be compatible with our approach.UniqueRandomizer is very time-consuming since it generates samples sequentially in contrast to other methods that generate beams of samples in parallel.CAB search is demanding in terms of the device memory as it has to process the whole beam jointly.
For further experiments, we choose CAB search because it works best for two models.

Impact of the Search Criteria
We apply search under different selection criteria (execution, output column match, test on one database) to T5-3B, BRIDGE and SQ-QDMR on Spider dataset and compare with the greedy and beam search baselines.Table 3 shows the results measured with execution accuracy.Searching on top of all models with different selection criteria increases execution accuracy in almost all cases.
One exception is the search with the execution criterion on top of BRIDGE and SQ-QDMR, the results of which are close to the greedy decoding.The outputs of these systems are almost always executable because BRIDGE runs a static SQL analyzer for filtering, SQ-QDMR decodes according to the QDMR grammar and both models have schema-consistent decoding.The T5 model, in contrast, does not have any grammar or schema The results obtained with PICARD and with searching under the execution criterion are comparable -our search gets most of the benefit over the baseline.PICARD provides slightly better quality and works with smaller beams but requires more effort to incorporate because it is tightly connected with the decoder output vocabulary and grammar.For both approaches, the percent of output queries with execution errors is around 2% in contrast to the baseline T5-3B decoding with 12%.
The search criterion based on the matching of output columns provides even better results.As Table 3 shows, all models benefit from this criterion: execution accuracy increases by 3-4% nearly everywhere.This criterion largely simplifies the task with extra information at the test time.
Search for the queries that pass one test allows finding a significant number of such queries.Passing one test means correct execution result on the input database, so all the queries found with this criterion are correct in terms of the execution accuracy.Thus, these results indicate that our searching approach works, and we can find the correct queries with the corresponding criterion.However, we ob-  4 confirms our findings: searching with the execution criterion helps T5-3B, and searching for correct output columns improves the results of all models.Search for the queries that pass one test results in a significant number of false-positive queries.The correct queries can be found by searching with the test-suite criterion directly.

Efficiency
Time Measurements.The running time during the search is dominated by the time of the decoder for all three models: executing each considered query takes 3% of the decoder time for T5-3B, which is 0.01 sec per run; 53%, 0.02 sec -for BRIDGE; 72%, 0.03 sec -for SQ-QDMR.The total running time depends on the effective beam size used during the search.
The T5-3B model with the execution criterion on top runs in 1.7 sec compared to 3.1 sec reported by the PICARD paper, where both systems were run on 1 NVIDIA A100 GPU.The main reason is that due to CAB, we do not set one beam size in advance and thus, process at least 70% of examples with an effective beam size of 1.
Impact of the Maximum Size of Beam.The maximum size of the beam is an important parameter.Figure 1 shows the dependence between the obtained test-suite accuracy on the Spider dev set and the maximum beam size in the search under the test suite criterion.For all models, we start with the maximum beam size equal to 1, which is equivalent to the greedy decoding and finish with the maximum beam size allowed by our implementation and hardware: 10k for BRIDGE and SQ-QDMR on 1 NVIDIA V100 GPU and 800 for T5-3B on 8 NVIDIA A100 GPUs.Test-suite accuracy improves as the maximum beam size increases.BRIDGE with the maximum beam size equal to 10k achieves 86% of test-suite accuracy and SQ-QDMR -94%.However, the search works well enough even with smaller beams: with the maximum beam size of 100, BRIDGE achieves almost 80%, SQ-QDMR -92%, and T5-3B achieves 88% (with 800, T5-3B achieves 90%).Importantly, the search with the beam size of 100 does not require multiple GPUs for T5-3B.

Experiments on Single-Database Data
To show more benefits of searching under selection criteria, we evaluate it on single-database datasets, GeoQuery, IMDB, Yelp and Academic, with test-suite accuracy (Table 5; see Appendix F for execution accuracy).We use query splits of Finegan-Dollak et al. (2018) and two model types: trained on Spider only and fine-tuned on a particular dataset.The Academic database is very large, so we cannot evaluate one-test criterion on this dataset and fine-tune SQ-QDMR (other datasets do not have QDMR annotation required for fine-tuning).More fine-tuning details are in Appendix E.
The results show that models trained on Spider struggle to generalize to other datasets, which is consistent with the findings of Suhr et al. (2020).More information about the problem (in the form of additional train data or selection criteria) helps improve the quality.
IMDB, Yelp and Academic are more challenging datasets for cross-database semantic parsers than GeoQuery, but they are significantly smaller (Table 7), and the models are less stable while testing on them (with all random seeds fixed).Stronger criteria, such as passing one test or test suite, do work even with datasets of such difficulty, when weaker criteria fail.On GeoQuery, the search under the one-test and test-suite criteria leads to even better T5-3B 5.9 11.8 17.6 29.4 41.2 + fine-tune 52.9 52.9 52.9 52.9 58.8 BRIDGE 11.8 11.8 17.6 17.6 17.6 + fine-tune 52.9 52.9 52.9 52.9 52.9 SQ-QDMR 5.9 5.9 quality than fine-tuning.Our test-suite criterion is especially useful when one test is difficult to run on the large original database, e.g., on Academic.
As a result, we conclude that criterion-guided search on top of a pre-trained model is a good alternative to fine-tuning in cases when training data is not available, but the user is ready to provide more information on each test question.

Related Works
Search for Database queries.The task of translating NL questions into database queries implies the ability to query databases with natural language.To ensure this, it is essential to generate syntactically correct queries that refer to valid table and column names for the given database schema.Wang et al. (2019) noticed that a partially decoded SQL query can be executed, and thus, the result of this execution can guide the decoding process.At each decoding step, partial queries that crush or give an empty result during the execution are removed from beams.In this work, we also consider the execution criterion of search but apply it to the finished hypotheses, which allows us to search on top of the models with different output formats, including intermediate representations.Lin et al. (2020) generated SQL queries in the execution order to keep the consistency between the predicted database entities and checked output correctness with the static SQL parser.Suhr et al. (2020) executed the top-10 generated queries in beam search to filter the inexecutable ones, which is close to checking the execution criterion in our work but differs by the search method.
Task-specific decoders such as autoregressive grammar-based (Yin and Neubig, 2017;Lin et al., 2018) and tree decoders (Dong and Lapata, 2016), semi-autoregressive decoder (Rubin and Berant, 2021) provide some guarantees as they control the output structure.However, as noticed by Scholak et al. (2021), these decoding methods are incompatible with pre-trained decoders of language models.These pre-trained decoders, like the one of T5, can also be successfully applied to the text-to-SQL task (Shaw et al., 2021).Scholak et al. (2021) proposed to check hypotheses in beam search on the lexical and grammatical levels at each step of the beam search.However, compared to us, their approach required heuristics to prune incomplete queries.
The concurrent work by Wolfson et al. (2022) uses several components similar to ours but in a very different way.They use the QDMRs of Wolfson et al. (2020) with textual arguments as a form of weak supervision to generate SQL queries for the training set.Their synthesis process results in many candidate SQL queries and relies on tests to select the one as an annotation.Such a process is similar to the method of Saparina and Osokin (2021) for constructing groundings of QDMR arguments.However, the search process of Wolfson et al. (2022) is not connected to any neural model and is not used at the test time.
Search for Programs with Neural Networks.Our approach to searching for queries is closely related to the field of program synthesis if we interpret queries as programs.Recently, neural networks have been applied in a wide range of program synthesis tasks, see the excellent work of Chaudhuri et al. (2021) for a recent review.
When programs are synthesized from large language models generating multiple outputs, selecting the one that, e.g., passes some or all tests is a common practice.For example, the Codex model (Chen et al., 2021) for synthesizing Python code includes some sample tests into the input prompt to give the model more information to de-fine the user intent.Chen et al. (2021) following Kulal et al. (2019), among others, also uses the pass@k metric, which effectively means that the model generates k outputs, and the best ones are selected based on tests.The pass@k metric can be interpreted as test-suite accuracy after search w.r.t. the tests with the beam of size k.
Overall, it is widely accepted that tests are useful to precisely define the user intent.However, they are hard to collect at a large scale, especially when coupled with a description in natural language.Because of this, large-scale benchmarks related to code, e.g., CodeXGLUE (Lu et al., 2021), primarily used text-based metrics like BLEU.The attempts to specialize BLEU to code by combining it with abstract syntax trees extracted from code, like CodeBLEU (Ren et al., 2020), are in some sense similar to the SQL-based exact match metric of the Spider dataset (Yu et al., 2018).
Approaches to Simplify the Cross-database Setting.The community has made multiple attempts to modify the cross-domain setting to make the problem easier to solve.Yu et al. (2019b) collected the SParC dataset with coherent question sequences, which can allow sharing of information between the sequences.Yu et al. (2019a) collected the CoSQL dataset with an interactive conversational setting with SQL queries, making it possible to explore user interactions with the system.Lee et al. (2021) collected KaggleDBQA with database documentation in the form of textual description for database columns that can potentially allow language models to provide outputs better corresponding to the user's intent.Our approach provides users with a way to interact with the model by supplying tests.Given the initial question in natural language, the users can provide extra tests until they are satisfied with the generated query.

Conclusion
We studied the search over the outputs of the neural autoregressive models for better database query generation.We considered three state-of-the-art models: T5-3B, BRIDGE and SQ-QDMR.We observed that the search algorithms work with multiple criteria for selecting the output query.We also compared the search-augmented methods with the fine-tuned models on the GeoQuery, IMDB, Yelp and Academic datasets (under the distribution shift) and observed that the method with search can sometimes work even better than fine-tuning.Com-pared to fine-tuning, the search-based method does not require additional training data but relies on additional information on each test example.With such properties, our search based methods might be helpful for use cases like interactive query generation or annotating new datasets.

Limitations
In our experiments, we work only with the datasets where the user question was written in English.This might have simplified the task for T5 as the keywords and entity names of the query languages were also in English.
The test suites we built were still not perfect.In particular, it was hard to generate a test database such that the query SELECT year FROM concert GROUP BY year HAVING count(*) >= 50 had non-empty output because it needed at least 50 rows with the same value in the column year.We also noticed that the value 1 as the gold query output also caused many false positives and should probably have been considered the empty value for the queries outputting the count aggregator.
The results of the search methods were also not perfect for out-of-domain data, even with strong search criteria based on tests.One reason for that was that we started to hit the limitations of the models, which were built with mostly Spider in mind.In particular, the database preprocessing stage to select values for the query was, in some cases, slow and inaccurate.

A Construction of Test Suites
The method of Zhong et al. (2020) relies on generating the so-called neighbor queries from a given set of gold queries Q.A set of neighbors N q is obtained through slightly changing the original query q ∈ Q.Now, a database w distinguishes q and g ∈ N q if their execution results on w are different.The suite S is formed greedily: a new sampled database is added to S if it distinguishes a pair in N = {(q, g) | q ∈ Q, g ∈ N q } that is not distinguished by any database previously added to S.
In the algorithm above, to sample the databases, the original queries are parsed to derive the constant values and the corresponding columns from the WHERE clauses.For instance, in a query SELECT * from cars WHERE mpg > 20, the constants 20 and 21 are assigned some probability to be generated as values of the cars.mpgcolumn.We improve the query parser so that it accounts for common cases when table aliases are used (e.g., SELECT * from cars as T WHERE T.mpg > 20).Still, if a gold query has too many WHERE clauses, its execution on a randomly generated database is likely to be an empty table or, in the case of SELECT-ing an aggregate function, zeros or NULL values.This issue causes many potential false positives.
For this reason, for every query, we first search for a test database on which the query execution is not empty and only then proceed to check if the database distinguishes the query and its neighbors.This way, the chances are higher to sample at least one database yielding non-empty output on a gold query, leading to better code coverage.We also propose to build a separate test suite for each query.This modification allows dramatically reducing the time of the test-suite evaluation of a query.
Additionally, we make several important changes to the original implementation: • We limit the number of rows in the tables to 100.
As a result, the test suite databases are smaller in size and faster in evaluation.• We adjust the types of the columns in the original and the sampled databases so that they match the type of the corresponding values.In the float type columns, we also reduce the precision of numbers to 16-bit.These changes are crucial for SPARQL language, as it is more sensitive to data types than SQL.• If a column has only unique values in the original database and its name contains special substrings such as 'Name', 'ID', and 'Phone', we treat it as a unique key and generate its values accordingly.This heuristic allows performing the GROUP BY operations on any unique key.

B Implementation Details of Search
For the search on top of T5-3B, we use the (Scholak et al., 2021) model provided in the Transformers library (Wolf et al., 2020).To search with large beam sizes, we modify the Transformers implementation of T5 inference: instead of caching keys and values in attention blocks, we cache the attention outputs, which reduces memory usage at the cost of speed.
For working with SQL queries outside the Spider dataset, we had to replace the SQL parsing coming with Spider due to its limited functionality.We use the mo-sql-parsing library instead. 2 We execute all SQL queries in the sqlite33 package for python.For executing SPARQL queries, we use the opensource version of the Virtuoso system. 4 During our search, the system generated many cumbersome queries, so we had to impose a strict time limit for each query and make our implementation robust to the database engine crashes.The time limit for Spider queries is 30 seconds and for other datasets is 300 seconds.
The grid of beam sizes for CAB search on top of T5-3B is 2, 10, 100, 800, the corresponding widths are 2, 2, 2, 5, the grid for BRIDGE is 1, 10, 100, 1000 and the widths are 1, 2, 2, 5, the grid for SQ-QDMR is 1, 100, 1000 and the widths are 1, 5, 10.We infer BRIDGE and SQ-QDMR on one NVIDIA V100 GPU and T5-3B on 8 NVIDIA A100 GPUs (while searching with maximum beam size).For the search with the sampling methods, we use the same schema: we run sampling several times, increasing the number of samples (k in topk) until the selection criterion is passed.The grids for all the methods are the same.For searching with UniqueRandomizer, we run the methods until the selection criterion is passed or the maximum number of iterations is reached (equal to the maximum beam size: 800 for T5-3B and 1000 for BRIDGE and SQ-QDMR).We tune the temperature and p on Spider dev (the grid for temperature tuning was from 0.5 to 3.5 with 0.25 step; the grid for p was 0.85, 0.9, 0.95).The best p = 0.95 and the best temperature values are the following: While testing on single-database data, we use the same temperature parameters for the models trained on Spider only.For the fine-tuned models, we also tune temperature on dev sets and choose the values 2.0 for BRIDGE and T5-3B and 1.0 for SQ-QDMR.
Test-suite Accuracy In Table 6, we show the results of the search on Spider with two criteria: passing one test and passing the test suite.For both criteria, we compute execution accuracy (checks the execution result on one database) and test-suite accuracy (checks the execution results on the test suite).For all the models, test-suite accuracy is significantly lower than execution accuracy for the search with the one-test criterion.

E Fine-tuning Details
For fine-tuning on single-database data, we use official implementations of all the models. 5We experimented with different training strategies: initializing from released checkpoints of the models trained on Spider and training from scratch (in this case, models contain transformers pre-trained on textual data).We choose the best approach for each model and refer to it as fine-tuning.We fine-tune T5-3B pre-trained on textual data (Raffel et al., 2020) for 300 epochs on one NVIDIA A100 GPU with the same parameters as Scholak et al. (2021) used: Adafactor optimizer (Shazeer and Stern, 2018) with learning rate 1e-4 and batch size 625.
The released checkpoint of the BRIDGE model was trained on data that includes question splits of single-database data that we consider.We re-train this model on Spider-only data to evaluate it on query splits of single-database datasets.We use the same training parameters as Lin et al. (2020) used: Adam optimizer (Kingma and Ba, 2014) with the same scheduler (L-inv learning rate decay) and batch size 32.We choose the best checkpoint on the development set as Lin et al. (2020) did in their work (execution accuracy and test-suite accuracy of our and authors' checkpoints are the same on Spider dev and test).We use the same training procedure for fine-tuning on single-database data but create training data from both Spider train set and the train set of a particular dataset.
For fine-tuning SQ-QDMR on GeoQuery, we use the corresponding part of the Break dataset (Wolfson et al., 2020) and generate the 366 train groundings using the automatic annotation model of Saparina and Osokin (2021).We cannot finetune on the Academic dataset because its database is large and preprocessing of Saparina and Osokin (2021) failed.QDMR forms for other datasets were not provided in the Break dataset, so we could not fine-tune on them.For fine-tuning on GeoQuery, we start with the released checkpoint and param-eters saved on 73000 iterations of Spider training and continue up to 81000 iterations with GeoQuery train data.We also use the same parameters as Saparina and Osokin (2021) used: the optimizer is Adam (Kingma and Ba, 2014) with polynomial decay scheduler used by (Wang et al., 2020), the batch size is 24.

F Execution Accuracy of Search on
Single-Database Data Table 8 shows execution accuracy of search on different single-database datasets.We consider two types of models: trained only on Spider data and fine-tuned on single-database data.Comparing with Table 5, the figures are higher because many false-positive queries pass one test (the execution accuracy metric) and do not pass the test suite.For this reason, the results of searching with the testsuite criterion are lower in terms of execution accuracy: if no tested query satisfies the test-suite criterion, the system defaults to the result of the greedy decoding, which may fail one test, while the one-test criterion would select a false positive.

Figure 1 :
Figure1: Test-suite accuracy on the dev set for the T5-3B, BRIDGE and SQ-QDMR models tested with CAB for the test-suite criterion.

Table 1 :
Comparison of the test suites' statistics.NoEmpty is the percentage of SQL queries for which at least one test database with non-empty execution result is found; Cover is the percentage of neighbor queries distinguished by the test databases; Tests is the average number of test databases per query; Time is the average wall-clock execution time per query; Size is the total size of all test databases.
is a pre-trained seq2seq model based on Transformer.Recently, Shaw et al. (2021); Scholak et al. (2021) successfully applied T5 for the text-to-SQL task.The input sequence contains question tokens and tokens of column and table names.The database values matched with the question tokens are appended to the corresponding column names

Table 2 :
Comparing the execution accuracy of different search approaches under the 1-test criterion on the Spider dev split.

Table 3 :
Yu et al. (2018):021))search under different selection criteria (execution, output column match and 1 test) on Spider; beam s. refers to beam search.For T5-3B, we also compare the results of the search with PICARD(Scholak et al., 2021), our search with the execution criterion, and greedy decoding.For a fair comparison withScholak et al. (2021), we use the same data, the official Spider dev set, and the same metrics: exact-set matching accuracy and execution accuracy provided byYu et al. (2018):

Table 4 :
Test-suite accuracy of search under different selection criteria (execution, output column match, 1 test and test suite) on Spider.

Table 5 :
Different search criteria (execution, output column match, 1 test and test suite) on top of pre-trained models on the query test splits of different datasets with the test-suite accuracy.

Table 6 :
Execution (EX) and test-suite (TS) accuracy of search with 1-test and test-suite criteria on Spider dev and test.

Table 7 :
Statistics of single-database data.

Table 8 :
Different search criteria (execution, output column match, 1 test and test suite) on top of pre-trained models on the query test splits of different datasets with execution accuracy.