Diverse Parallel Data Synthesis for Cross-Database Adaptation of Text-to-SQL Parsers

Text-to-SQL parsers typically struggle with databases unseen during the train time. Adapting Text-to-SQL parsers to new database schemas is a challenging problem owing to a vast diversity of schemas and zero availability of natural language queries in new schemas. We present ReFill, a framework for synthesizing high-quality and textually diverse parallel datasets for adapting Text-to-SQL parsers. Unlike prior methods that utilize SQL-to-Text generation, ReFill learns to retrieve-and-edit text queries in existing schemas and transfer them to the new schema. ReFill utilizes a simple method for retrieving diverse existing text, masking their schema-specific tokens, and refilling with tokens relevant to the new schema. We show that this process leads to significantly more diverse text queries than achievable by standard SQL-to-Text generation models. Through experiments on several databases, we show that adapting a parser by finetuning it on datasets synthesized by ReFill consistently outperforms prior data-augmentation methods.


Introduction
Natural Language interface to Databases (NLIDB) that translate text queries to executable SQLs is a challenging task in the field of Semantic Parsing (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Berant et al., 2013).In addition to understanding the natural language and generating an executable output, Text-to-SQL also requires the ability to reason over the schema structure of relational databases.Recently, datasets such as Spider (Yu et al., 2018) comprising of parallel (Text,SQL) pairs over hundreds of schemas have been released, and these have been used to train state-of-the-art neural Text-to-SQL models (Wang et al., 2020;Scholak et al., 2021a;Rubin and Berant, 2021;Scholak et al., 2021b;Xu et al., 2021).However, several studies have independently shown that such Text-to-SQL models fail catastrophically when evaluated on unseen schemas from the real-world databases (Suhr et al., 2020;Lee et al., 2021;Hazoom et al., 2021).Adapting existing parsers to new schemas is challenging due to the lack of parallel data for fine-tuning the parser.
Synthesizing parallel data, that is representative of natural human generated queries (Wang et al., 2015;Herzig and Berant, 2019), is a long-standing problem in semantic parsing.Several methods have been proposed for supplementing with synthetic data, ranging from grammar-based canonical queries to full-fledged conditional text generation models (Wang et al., 2015;Herzig and Berant, 2019;Zhong et al., 2020;Yang et al., 2021;Zhang et al., 2021;Wang et al., 2021).For Text-to-SQL, data-augmentation methods are primarily based on training an SQL-to-Text model using labeled data from pre-existing schemas, and generating data in the new schemas.We show that the text generated by these methods, while more natural than canonical queries, lacks the rich diversity of natural multi-user queries.Fine-tuning with such data often deteriorates the model performance since the lack of diversity leads to a biased model.
We propose a framework called REFILL ( § 2) for generating diverse text queries for a given SQL workload that is often readily available (Baik et al., 2019).REFILL leverages parallel datasets from several existing schemas, such as Spider (Yu et al., 2018), to first retrieve a diverse set of text paired with SQLs that are structurally similar to a given SQL q ( § 2.1).Then, it trains a novel schema translator model for converting the text of the training schema to the target schema of q.The schema translator is decomposed into a mask and fill step to facilitate training without direct parallel examples of schema translation.Our design of the mask module and our method of creating labeled data for the fill module entails non-trivial details that we explain in this paper ( § 2.2).RE-FILL also incorporates a method of filtering-out inconsistent (Text,SQL) pairs using an independent binary classifier ( § 2.3), that provides more useful quality scores, than the cycle-consistency based filtering (Zhong et al., 2020).Our approach is related to retrieve-and-edit models that have been used for semantic parsing (Hashimoto et al., 2018), dialogue generation (Chi et al., 2021), translation (Cai et al., 2021), and question answering (Karpukhin et al., 2020).However, our method of casting the "edit" as a two-step mask-and-fill schema translation model is different from the prior work.
We summarize our contributions as follows: (i) We propose the idea of retrieving and editing natural text from several existing schemas for transferring it to a target schema, obtaining higher text diversity compared to the standard SQL-to-Text generators.(ii) We design strategies for masking schema-specific words in the retrieved text and training the REFILL model to fill in the masked positions with words relevant to the target schema.(iii) We filter high-quality parallel data using a binary classifier and show that it is more efficient than existing methods based on cycle-consistency filtering.(iv) We compare REFILL with prior dataaugmentation methods across multiple schemas and consistently observe that fine-tuning Text-to-SQL parsers on data generated by REFILL leads to more accurate adaptation.

Diverse data synthesis with REFILL
Our goal is to generate synthetic parallel data to adapt an existing Text-to-SQL model to a target schema unseen during training.A Text-to-SQL model M : X , S → Q maps a natural language question x ∈ X for a database schema s ∈ S, to an SQL query q ∈ Q.We assume a Textto-SQL model M trained on a dataset D train = {(x i , s i , q i )} N i=1 consisting of text queries x i for a database schema s i , and the corresponding gold SQLs q i .The train set D train typically consists of examples from a wide range of schemas s i ∈ S train .For example, the Spider dataset (Yu et al., 2018) contains roughly 140 schemas in the train set.We focus on adapting the model M to perform well on a target schema s different from the training schemas in S train .To achieve this, we present a method of generating synthetic data D syn of Text-SQL pairs containing diverse text queries for the target schema s.We fine-tune the model M on D syn to adapt it to the schema s.Our method is agnostic to the exact model used for Text-to-SQL parsing.We assume that on the new schema s we have a workload QW s of SQL queries.Often in Algorithm 1: Data Synthesis with REFILL Dsyn ← Dsyn ∪ Filter(q, {x q r }) 8 Mnew ← fine-tune(M, Dsyn) existing databases a substantial SQL workload is already available in the query logs at the point a DB manager decides to incorporate the NL querying capabilities (Baik et al., 2019).The workload is assumed to be representative but not exhaustive.In the absence of a real workload, a grammar-based SQL generator may be used (Zhong et al., 2020;Wang et al., 2021).
Figure 1 and Algorithm 1 summarizes our method for converting a workload QW s of SQL queries into a synthetic dataset D syn of Text-SQL pairs containing diverse text queries.Given an SQL query q ∈ QW s for the target schema s, our method first retrieves related SQL-Text pairs {q r , x r } R r=1 from D train on the basis of a tree-editdistance measure such that the SQLs {q r } R r=1 in the retrieved pairs are structurally similar to the SQL q ( § 2.1).We then translate each retrieved text query x r so that its target SQL changes from q r to q on schema s ( § 2.2).We decompose this task into two steps: masking out schema specific tokens in x r , and filling the masked text to make it consistent with q using a conditional text generation model B like BART (Lewis et al., 2020).The translated text may be noisy since we do not have direct supervision to train such models.Thus, to improve the overall quality of the synthesized data we filter out the inconsistent SQL-Text pairs using an independent binary classifier ( § 2.3).Finally, we adapt the Text-to-SQL model M for the target schema s by fine-tuning it on the diverse, high-quality filtered data D syn synthesized by REFILL.

Retrieving related queries
Given an SQL q ∈ QW s sampled from SQL workload, we extract SQL-Text pairs {q r , x r } ∈ D train , from the train set such that the retrieved SQLs {q r } are structurally similar to the SQL q.We utilize tree-edit-distance (Pawlik andAugsten, 2015, 2016) between the relational algebra trees of SQLs q and q r -smaller distance implies higher structural similarity.Since the retrieved SQLs come  from different schemas, we modify the tree-editdistance algorithm to ignore the schema names and the database values.The tree-edit-distance is further normalized by the size of the larger tree.We only consider the {q r , x r } pairs where the SQLs {q r } have a distance of less than 0.1 w.r.t. the SQL q.Within datasets like Spider that span hundreds of schemas, it is often possible to find several SQLs structurally similar to a given SQL q.For example, in Spider we found that 76% of the train SQLs contain at least three zero-distance (structurally identical) neighbours in other schemas.In Figure 2, we present more detailed statistics.

Translating text of related queries
Our next goal is to translate the retrieved x r from being a text for SQL q r to a text x for SQL q , where q ≈ q r structurally.However, we do not have a readily labeled dataset to learn a model that translates x r to x while being consistent with q.We therefore decompose this task into two steps: 1) A simpler task of masking schema-specific tokens in x r to get a template x masked r and 2) A conditional text generation model that maps (x masked r , q) to the text x consistent with q, by filling the masked positions in x masked r as per q.We re-purpose D train to get indirect supervision for training the text generation model.We now present each step in detail.
Masking the retrieved text Converting the retrieved text queries {x r } to masked templates {x masked r } is a critical component of REFILL's pipeline since irrelevant tokens like references to schema elements of the original database can potentially misguide the text generation module.Our initial approach was to mask tokens based on a match of text tokens with schema names and manually refined schema-to-text linked annotations as in Lei et al. (2020).However, this approach failed to mask all schema-related terms since their occurrences in natural text often differed significantly from schema names in the database.Table A7 shows some anecdotes.Consequently, we designed a simple frequency-based method of masking that is significantly more effective for our goal of using the masked text to just guide the diversity.For each word that appears in the text queries of the train set, we count the number of distinct databases where that word gets mentioned at least once.For example, common words like {'show', 'what', 'list', 'order'} get mentioned in more than 90% of the schemas, and domain specific words like {'countries', 'government'} occur only in text queries of a few schemas.We mask out all the words that appear in less than 50% of the schemas.The words to be masked are replaced by a special token MASK, and consecutive occurrences of MASK are collapsed into a single MASK token.Thus we obtain masked templates {x masked r } retaining minimal information about their original schema.
Editing and Filling the masked text Given a masked template x masked r , and an SQL query q, we wish to edit and fill the masked portions in x masked r to make it consistent with the SQL q.We utilize a conditional text generation model B like BART (Lewis et al., 2020) for this purpose.We first convert q into a pseudo-English representation q Eng similar to Shu et al. (2021), to make it easier for B to encode q.In addition, we wrap the table, column, or value tokens in q Eng with special tokens to provide explicit signals to the text generation model B that such tokens are likely to appear in the output text x.Next, we concatenate the tokens in x masked r and q Eng for jointly encoding them as an input to B. The output of B's decoder is text x, which is expected to be consistent with the SQL q.
Since we do not have direct supervision to finetune B for this task, we present a method of repurposing D train for fine-tuning B. D train contains SQL-Text pairs (q i , x i ) from various schemas s i .A Naïve way to train B is to provide [x masked i |q Eng i ], the concatenation of x masked i and q Eng i as an input to the encoder and maximize the likelihood of x i in the decoder's output.This way the decoder of B learns to refill the masked tokens in x masked i by attending to q Eng i to recover x i in the output.While useful for learning to refill the masked positions, this Naïve method of training B is mismatched from its use during inference in two ways: (i) For a given SQL q, REFILL might fail to retrieve a similar structure neighbour of q i from D train .In such cases, B should be capable of falling back to pure SQL-to-Text generation mode to directly translate q into x.(ii) During inference, x masked r and q come from different schemas.However, during Naïve training, the masked text x masked i and the SQL q i are derived from the same example (q i , x i ).To address these two limitations, we train B in a more Robust manner as follows: (a) For a random onethird of the train steps we train B in the Naïve way, allowing B to learn the filling of the masked tokens using q Eng i .(b) For another one-third, we pass only q Eng i as an input and maximize the likelihood of x i .This ensures that model is capable of generating the text from the q Eng i alone, if the templates x masked i are unavailable or noisy.(c) For the remaining onethird, we first retrieve an SQL-Text pair (q j , x j ), from a different schema such that the SQL q j is structurally similar to q i ( § 2.1), and the word edit distance between the masked templates x masked i and x masked j is also small.We can then replace x masked i with x masked j and encode [x masked j |q Eng i ] as an input to B and maximize the likelihood of x i in the decoder's output.This step makes the training more consistent with the inference, as x masked j and q Eng i now come from different schemas.In § 5.4, we justify training Robustly compared to Naïve training.

Filtering the Generated Text
Since the data synthesized using REFILL is used to fine-tune a downstream Text-to-SQL parser, we learn a Filtering model F : (X , Q) → R to discard inconsistent examples from the generated dataset.F assigns lower scores to inconsistent Text-SQL pairs.For each SQL q ∈ QW s , we select the top-5 sentences generated by REFILL and discard all the sentences that are scored below a fixed threshold as per the filtering model.Existing work depended on a trained Text-to-SQL parser M to assign cycleconsistency scores (Zhong et al., 2020).However, we show that cycle-consistency filtering favors text on which M already performs well, and hence does not result in a useful dataset for fine-tuning M.
We instead train a filtering model F as a binary classifier, independent of M. The Text-SQL pairs {(x i , q i )} in the training set D train , serve as positive (consistent) examples and we synthetically generate the negative (inconsistent) examples as follows: (i) Replace DB values in the SQL q i with arbitrary values sampled from the same column of the database.(ii) Replace SQL-specific tokens in q i with their corresponding alternates e.g.replace ASC with DESC, or '>' with '<'.(iii) Cascade previous two perturbations.(iv) Replace the entire SQL q i with a randomly chosen SQL q j from the same schema.(v) Randomly drop tokens in the text query x i with a fixed probability of 0.3.(vi) Shuffle a span of tokens in the text query x i , with span length set to 30% of the length of x i .Thus, for a given Text-SQL pair (x i , q i ) we obtain six corresponding negative pairs {(x n j , q n j )} 6 j=1 .Let s i be the score provided by the filtering model for the original pair (x i , q i ) and {s j } 6 j=1 be the scores assigned to the corresponding negative pairs {(x n j , q n j )} 6 j=1 .We supervise the scores from the filtering model using a binary-cross-entropy loss over the Sigmoid activations of scores as in Equation 1.
To explicitly contrast an original pair with its corresponding negative pairs we further add another Softmax-Cross-Entropy loss term.
3 Related Work Retrieve and Edit Methods Our method is related to the retrieve-and-edit framework, which has been previously applied in various NLP tasks.In Semantic Parsing, question and logical-form pairs from the training data relevant to the test-input question are retrieved and edited to generate the output logical forms in different ways (Shaw et al., 2018;Das et al., 2021;Pasupat et al., 2021;Gupta et al., 2021).In machine translation, memory augmentation methods retrieve-and-edit examples from translation memory to guide the decoder's output (Hossain et al., 2020;Cai et al., 2021).Our editing step -masking followed by refilling is similar to style transfer methods that minimally modify the input sentence with help of retrieved examples corresponding to a target attribute (Li et al., 2018).In contrast to learning a retriever, we find simple tree-edit distance to be an effective metric for retrieving the relevant examples for our task.
4 Experimental Set-up1 We adapt pretrained Text-to-SQL parsers on multiple database schemas unseen during the train time.
Here, we describe the datasets, models, and evaluation metrics used in our experiments.
Datasets: We primarily experiment with the Spider dataset (Yu et al., 2018) then used to fine-tune a base Text-to-SQL parser.
We further experiment with four datasets outside Spider in Section 5.6.We work with Geo-Query (Zelle and Mooney, 1996), Academic (Li and Jagadish, 2014), IMDB and Yelp (Navid Yaghmazadeh and Dillig, 2017).We utilize the preprocessed version of these datasets open-sourced by Yu et al. (2018).In appendix Table A2, we present statistics about each of the four datasets.Text-to-SQL parser: We experiment with SM-BOP (Rubin and Berant, 2021) as our base Textto-SQL parser, and utilize author's implementation.The SMBOP model is initialized with a ROBERTA-BASE model, followed by four RAT layers, and trained on the train split of Spider dataset.The dev set used used for selecting the best model excludes data from the four held-out evaluation groups.Edit and Fill model: We utilize a pre-trained BART-BASE as our conditional text generation model for editing and filling the masked text.The model is fine-tuned using the train split of Spider dataset as described in Section 2.2 Filtering Model: We train a binary classifier based on a ROBERTA-BASE checkpoint on Spider's train split to filter out inconsistent SQL-Text pairs as described in Section 2.3.Baselines: For baseline SQL-to-Text generation models, we consider recently proposed models like L2S (Wang et al., 2021), GAZP (Zhong et al., 2020), and SNOWBALL (Shu et al., 2021).All the baselines utilize pre-trained language models like BART (Lewis et al., 2020) or BERT (Devlin et al., 2018) for translating SQL tokens to natural text in a standard seq-to-seq set-up.The baselines mostly differ in the way of encoding SQL tokens as an input to the language model.In Section 3, we reviewed the recent SQL-to-Text methods.Evaluation Metrics We evaluate the Text-to-SQL parsers using the Exact Set Match (EM), and the Exection Accuracy (EX) Yu et al. (2018).The EM metric measures set match for all the SQL clauses and returns 1 if there is a match across all the clauses.It ignores the DB-values (constants) in the SQL query.The EX metric directly compares the results obtained by executing the predicted query q and the gold query q on the database.
We provide more implementation details including the hyperparameter settings in appendix A.5.

Results and Analysis
We first demonstrate the effectiveness of the synthetic data generated using REFILL for fine-tuning Text-to-SQL parsers to new schemas.We compare with the recent methods that utilize SQL-to-Text generation for training-data augmentation ( § 5.1).We then evaluate the intrinsic quality of the synthetic data generated by different methods in terms of the text diversity and the agreement of the generated text with the ground truth ( § 5.2).We demonstrate that higher text diversity results in better performance of the adapted parsers ( § 5.3).We then justify the key design choices related to masking of the retrieved text and training of the schema translator module that improves the quality of REFILL generated text ( § 5.4).Finally, we demonstrate the importance of using an independent binary classifier over cycle-consistency filtering ( § 5.5).

Evaluating adapted parsers
In Table 1, we compare the performance of parsers fine-tuned on Text-SQL pairs generated using RE-FILL and other SQL-to-Text generation baselines.We observe that fine-tuning on high-quality and diverse text generated by REFILL provides consistent performance gains over the base model across all the database groups.On average, REFILL improves the base model by 8.0 EM in comparison to a gain of 2.8 EM by the best baseline (GAZP).We observe that the gains from baseline methods are often small or even negative.REFILL continues to yield positive gains even for smaller workload sizes -In Figure 3, we plot the fraction of the total SQL workload used on the x-axis and the EM of the fine-tuned parsers averaged across all the four groups, on the y-axis.When using the data synthesized by REFILL, the performance of the parser improves steadily with an increasing size of the SQL workload.In contrast, the baseline SQL-to-Text generation methods fail to provide significant improvements.Interestingly, the data synthesized by REFILL using the 30% SQL workload leads to better downstream performance of the adapted parsers than any of the baselines utilizing 70% SQL workload for SQL-to-Text generation.

Quality and Diversity of generated text
We explain our gains over existing methods due to the increased quality and diversity of the generated text.We measure quality using the BLEU score of the set S(q) of generated text for an SQL q, with the gold text of q as reference.To measure diversity we utilize SelfBLEU (Zhu et al., 2018) that measures the average BLEU score among text in S(q).Lower SelfBLEU implies higher diversity.We evaluate on all the gold SQL-Text pairs available in the Spider's dev set.In  scores.To allow baselines to generate more diverse text than the standard beam search, we utilize beam-sampling (Fan et al., 2018;Holtzman et al., 2019).For REFILL, the 10 hypothesis come from using upto 10 retrieved-and-masked templates.We observe that our method of masking and refilling the natural text retrieved from existing datasets allows REFILL to generate higher quality text (+8.4 BLEU) with naturally high text diversity.

Importance of Text Diversity
Retrieving and editing text from multiple existing examples enables REFILL to generate diverse text.
In Figure 4, we show that increased diversity of the generated text leads to improved performance of the fine-tuned parser.We vary the number of retrieved-and-masked templates on the x-axis and plot the performance of the fine-tuned parsers on the y-axis for each group.To maintain the number of synthesized examples the same, the product of beam-samples and the number of retrieved templates is held constant.We observe that fine-tuning the parser on more diverse data generated using 5 retrieved templates per SQL provides consistently superior EM performance across all the four groups than using less diverse data obtained by retrieving just one or two templates per SQL.The consistent drops in EM while increasing the retrieved templates from 5 to 10 is explained by the reduction in text diversity.Using 5 retrieved templates yields a 100 − SelfBLEU score of 46.7, while with 10 retrieved templates 100 − SelfBLEU reduces to 33.8.This reduction is due to the inclusion of more similar templates as we increase their number from 5 to 10. Finally, the drop in REFILL's performance with reduced text diversity reconfirms the worse performance of SQL-to-Text baselines reported in Section 5.1 that do not offer enough text diversity.

Design choices of Schema Translator
In Section 2.2, we described two important design choices: (1) Method of masking schema-relevant tokens and (2) Method of training the Edit-and-Fill model for editing and refilling the masked text.We justify these design choices by comparing the quality of the generated text with each combination of these choices in Table 3. Comparing across rows (Schema-Match Vs Frequency), we observe that Frequency based masking results in 2 to 3 point improvements in BLEU compared to masking by matching schema names.Table A7 shows specific examples where the schema-match method fails to mask sufficiently.In contrast, even though the frequency-based method might over-mask it still suffices for our goal of guiding the text generation model.Comparing across columns (Naïve Train Vs.Robust Train) we observe that specifically training the template filling model for being robust to the input templates also improves quality of the generated text by 3.6 to 4.6 points.

Importance of Filtering model
Cycle-consistency based filtering (Zhong et al., 2020) rejects a synthesized SQL-Text pair (q, x) if the output SQL q generated by the base Text-to-SQL parser for the input text x does not match with the SQL q.We argue that cycle-consistency based filtering is sub-optimal for two reasons: (i)  4, we compare the base Text-to-SQL parser, with models fine-tuned without any filtering, with cycle-consistent filtering, and with using our filtering model.We focus on Group-4 where the base Text-to-SQL parser is significantly weaker compared to other groups, and use REFILL to synthesize data for the 30% query workload.Not using any filtering, or using cycleconsistent filtering results in worse performance, while applying our filtering model offers significant improvements over the base model.

Experiments on datasets outside Spider
We further validate our method on four singledatabase datasets outside Spider, namely Geo-Query (Zelle and Mooney, 1996), Academic (Li and Jagadish, 2014), IMDB and Yelp (Navid Yaghmazadeh and Dillig, 2017).In Table 5, we first report the performance of our base Text-to-SQL parser and observe poor cross-database generalization with an average EM of just 9.7.We adapt

Limitations
This work focuses on synthesizing parallel data containing diverse text queries for adapting pretrained Text-to-SQL models to new databases.Thus, our current effort toward diverse text query generation using REFILL is limited to the Text-to-SQL semantic parsing task.Extending REFILL for data-augmentation in other semantic parsing or question-answering tasks is an exciting direction we hope to explore as part of future work.Our experimental set-up assumes a small workload of real SQL queries.As per Baik et al. ( 2019), a small workload of real SQL queries is a reasonable assumption since SQL query logs are often available for existing in-production databases that are to be supported by a Text-to-SQL service.Synthesizing realistic SQL-query workloads for newly instantiated databases is a challenging and promising direction but different from the diverse textquery generation aspect of our work.

Ethical Considerations
Our goal with REFILL is to synthesize parallel data for adapting Text-to-SQL parsers to new schemas.We believe that the real-world deployment of Textto-SQL or any semantic parser trained on text generated by language models must go through a careful review of any harmful biases.Also, the intended users of any Text-to-SQL service must be made aware that the answers generated by these systems are likely to be incorrect.We do not immediately foresee any serious negative implications of the contributions that we make through this work.

Figure 2 :
Figure 2: Frequency distribution of average tree-editdistance between SQLs and their three nearest neighbours from other schemas within Spider's train set.
SQL-to-Text generation Many prior works perform training data augmentation via pre-trained text generation models that translate SQLs into natural text(Guo et al., 2018;Zhong et al., 2020;Shi et al., 2020;Zhang et al., 2021;Wang et al., 2021;Yang et al., 2021;Shu et al., 2021).For example,Wang et al. (2021) fine-tune BART(Lewis et al., 2020) on parallel SQL-Text pairs to learn an SQLto-Text translation model.Shu et al. (2021) propose a similar model that is trained in an iterativeadversarial way along with an evaluator model.The evaluator learns to identify inconsistent SQL-Text pairs, similar to our filtering model.To retain high quality synthesized dataZhong et al. (2020) additionally filter out the synthesized pairs using a pre-trained Text-to-SQL model based on cycle consistency, that we show to be sub-optimal ( § 5.5).The SQL workload in most of the prior work was typically sampled from hand-crafted templates or a grammar like PCFG induced from existing SQLs, or crawling SQLs from open-source repositoriesShi et al. (2020).However, database practitioners have recently drawn attention to the fact that SQL workloads are often pre-existing and should be utilized(Baik et al., 2019).

Figure 3 :
Figure3: Average EM performance of Text-to-SQL models on the four groups vs. the size of SQL workload ( § 5.1).Data generated by REFILL using 30% SQL workload yields better performance than the data from the existing best baseline utilizing 70% workload.

Figure 4 :
Figure 4: Accuracy of fine-tuned parsers Vs. the number of templates per SQL used by REFILL ( § 5.3).

Table 1 :
Results for finetuning a base semantic parser (SMBOP) on Text-SQL pairs generated using various SQL-to-Text baselines and REFILL ( § 5.1).REFILL provides consistent gains over the base model across all the database groups, while gains from other methods are often negative or small.
Table2, we compare the quality and diversity of the text generated using REFILL with prior SQL-to-Text generation methods.For each method we generate 10 hypotheses per SQL query, and pick the hypothesis with the highest BLEU to report the overall BLEU

Table 3 :
Analyzing the impact of design choices related to Schema Translation, by observing BLEU-4 scores of the text generated by REFILL ( § 5.4).Frequency based masking and Robust training leads to a higher quality of the generated text.

Table A1 :
Number of schemas and statistics of query workload for each group.Related schemas were grouped together in order to obtain larger evaluation sets per group.

Table A3 :
Evaluation on four groups of schemas held out from Spider's dev set, for varying sizes of query workload {30%, 50%, 70%} used for SQL-to-Text translation.

Table A4 :
EM evaluation on four additional datasets outside Spider, for varying sizes of query workload {30%, 50%, 70%} used for SQL-to-Text translation.Since the contents of Acad, IMDB, and Yelp databases were not publicly accessible to us, we are unable to report EX results on these databases.EX results for GeoQuery appear in TableA5.