STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing

In this paper, we propose a novel SQL guided pre-training framework STAR for context-dependent text-to-SQL parsing, which leverages contextual information to enrich natural language (NL) utterance and table schema representations for text-to-SQL conversations. Concretely, we propose two novel pre-training objectives which respectively explore the context-dependent interactions of NL utterances and SQL queries within each text-to-SQL conversation: (i) schema state tracking (SST) objective that tracks and explores the schema states of context-dependent SQL queries in the form of schema-states by predicting and updating the value of each schema slot during interaction; (ii) utterance dependency tracking (UDT) objective that employs weighted contrastive learning to pull together two semantically similar NL utterances and push away the representations of semantically dissimilar NL utterances within each conversation. In addition, we construct a high-quality large-scale context-dependent text-to-SQL conversation corpus to pre-train STAR. Extensive experiments show that STAR achieves new state-of-the-art performance on two downstream benchmarks (SParC and CoSQL), significantly outperforming previous pre-training methods and ranking first on the leaderboard. We believe the release of the constructed corpus, codebase and pre-trained STAR checkpoints would push forward the research in this area. For reproducibility, we release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/star.


Introduction
Text-to-SQL parsing (Zhong et al., 2017;Yu et al., 2018;Wang et al., 2022;Qin et al., 2022b) aims to translate natural language (NL) questions into executable SQL queries, which enables the users Here, each database schema refers to the table/column names of databases and each schema state refers to a slot-value pair, whose slot is a column/table name (e.g., Degrees.campus) and its value is a SQL keyword (e.g., SELECT)."x" indicates that the semantic/intent is switched between Turn2 and Turn3 utterances.
who are unfamiliar with SQL to query databases with natural language.Pre-trained language models (PLMs) have proved to be powerful in enhancing text-to-SQL parsing and yield impressive performances, which benefit from the rich linguistic knowledge in large-scale corpora.However, as revealed in previous works (Yin et al., 2020;Yu et al., 2021a;Qin et al., 2022a), there are intrinsic discrepancy between the distributions of tables and plain texts, leading to sub-optimal performances of general PLMs such as BERT (Devlin et al., 2019), ROBERTA (Liu et al., 2019), ELECTRA (Clark et al., 2020).Recently, some studies (Yu et al., 2021a,b;Shi et al., 2021;Deng et al., 2021;Liu et al., 2021a,b) alleviate the above limitation by designing tailored tabular language models (TaLMs) for text-to-SQL parsing, which simultaneously encode NL questions and tables.Despite the remarkable progress of previous TaLMs, they still suffer from technical challenges in the context-dependent setting.First, existing TaLMs merely explore contextual information to enrich utterance representations without considering the interaction states determined by history 1235 SQL queries, which are relevant to the user intent of current utterance.Nevertheless, the trace and usage of historical SQL information can contribute greatly to model the current SQL query, as SQL conveys user intent in a compact and precise manner.As shown in Figure 1, the second SQL query is more likely to select the contents from the "Compuses" table since the first SQL query mentioned that table.Although tracking schema states is essential to keep track of user requests for context-dependent text-to-SQL parsing, how to model, track and utilize schema states throughout a conversation has not yet been explored in previous TaLMs.Second, context-dependent textto-SQL parsing needs to effectively process context information so as to help the system better parse current NL utterance, since users may omit previously mentioned entities as well as constraints and introduce substitutions to what has already been stated.Taking Figure 1 as an example, the second utterance omit the implicit constraint of "campuses in year 2000" as mentioned in the first utterance.However, most prior TaLMs primarily model stand-alone NL utterances without considering the context-dependent interactions, which result in suboptimal performance.Although SCORE (Yu et al., 2021b) model the turn contextual switch by predicting the context switch label between two consecutive user utterances, it ignores the complex interactions of context utterances and cannot track the dependence between distant utterances.For instance, in Figure 1, SCORE fails to capture the long term dependency between the first and the fourth utterances since there is a switch between the second and the third utterances.
In this paper, we propose a novel pre-training framework STAR for context-dependent text-to-SQL parsing, which explores the multi-turn interactions of NL utterances and SQL queries within each conversation, respectively.First, we propose a schema state tracking (SST) objective to keep track of SQL queries in the form of schema-states, which predicts the value (a SQL keyword) of each schema slot of the current SQL query given the schema-state representation of previously predicted SQL query.By introducing the schema-states to represent SQL queries, we can better capture the alignment between the the historical and current SQL queries, especially for the long and complex SQL queries.Second, we propose an utterance dependency tracking (UDT) objective to capture com-plex semantic dependency of sequential NL questions, which employs weighted contrastive learning to pull together semantically similar NL utterances and push away dissimilar NL utterances within each conversation.A key insight is that the utterance corresponding to similar SQL will be more semantically relevant, as SQL is a highly structured indication of user intent.Concretely, we propose two novel similarity functions (SQL semantic similarity and SQL structure similarity) to comprehensively construct appropriate positive and negative NL question pairs.We summarize our main contributions as follows.
(1) To the best of our knowledge, we are the first to propose a schema state tracking (SST) objective for context-dependent TaLM, which tracks and updates the schema states of the context-dependent SQL queries in the form of schema states.(2) We propose an utterance dependency tracking (UDT) objective to capture complex semantic information of sequential NL questions, which employs weighted contrastive learning with two novel SQL-oriented similarity functions to pull together two semantically similar NL utterances and push away the representations of dissimilar NL utterances within each conversation.(3) We construct a high-quality large-scale context-dependent text-to-SQL conversation corpus to pre-train STAR.Experiments show that STAR achieves new state-of-the-art performance on two downstream benchmarks (SPARC and COSQL) and ranking first on the leaderboard.

Task Definition
In this section, we first provide the formal task definition for context-dependent text-to-SQL parsing.Let U = {u 1 , . . ., u T } denote the utterances in a context-dependent text-to-SQL conversation with T turns, where u i represents the i-th NL question.Each NL sentence u i contains n i tokens, denoted as u i = [w 1 , . . ., w n i ].In addition, there is a corresponding database schema s, which consists of N tables {T i } N i=1 .The number of columns of all tables in the schema is m.We use s i to denote the name of the i-th item in schema s.At current turn t, the goal of text-to-SQL parsing is to generate the SQL query o t given the current utterance u t , historical utterances {u 1 , . . ., u t−1 }, schema s, and the last predicted SQL query o t−1 .STAR primarily consists of a stack of Transformer layer, which converts a sequence of L input tokens x = [x 1 , .

Pre-training Objectives
As illustrated in Figure 2, we propose two novel pre-training objectives SST (Schema State Tracking) and UDT (Utterance Dependency Tracking) to explore the complex context interactions of NL utterances and SQL queries within each text-to-SQL conversation, respectively.In addition, we also employ the MLM (Masked Language Modeling) objective to help learn better contextual representations of the conversations.Next, we will introduce the pre-training objectives in detail.

Schema State Tracking
The usage of context SQL information contributes greatly to model the current SQL query.Inspired by the dialogue state tracking (Ouyang et al., 2020;Wang et al., 2021) , where s i t−1 denotes the schemastate slot, v i t−1 denotes the schema-state value of the slot s i t−1 , and m represents the number of schema.At the t-th turn, the goal of SST is to predict the value v i t of each schema-state slot s i t of the t-th SQL query given all the history utterances {u 1 , . . ., u t−1 }, the current utterance u t and the schema-states {(s i t−1 , v i t−1 )} m i=1 of the late query o t−1 .That is, at the t-th turn, the input I t of the SST task is as: Note that the SQL queries within a conversation share the same schema s, thus the schema-states of the t-th and t − 1-th SQL queries have the same schema-state slots (i.e., s i t−1 = s i t = s i ).
Since each schema state c i t−1 = (s i t−1 , v i t−1 ) contains multiple words, we apply an attentive layer to obtain the representation of c i t−1 = (s i t−1 , v i t−1 ).Concretely, given the output contextualized representation h ) of each schema state c i t−1 , the attentive schema-state representation c i t−1 1237 of the schema state c i t−1 can be calculated as: where v 1 and W 1 are trainable parameters.We use the attentive schema-state representation c i t−1 in the last SQL query to predict the value v i t of the current schema state c i t : where W 2 and b 2 are trainable parameters.
Finally, the pre-training loss function of SST is defined as the cross-entropy between the predicted schema-state value P (v i t |c i t−1 ) and the gold schema-state value v i t as follows: where m is the number of slot (schema).

Utterance Dependency Tracking
We propose an utterance dependency tracking (UDT) objective to capture complex semantic dependency of sequential NL questions within each text-to-SQL conversation.A key challenge behind UDT is how to construct appropriate positive and negative labels by way of self-supervision.Generally, it is intuitive that we can construct negative utterance pairs by selecting NL utterances from different conversations.However, it is nontrivial to construct positive utterance pairs, since the current utterance may be irrelevant to those of the historical utterances with prominent contextual shifts, as the second and third utterances shown in Figure 1.Hence, we treat the NL utterances within the same conversation as positive pairs, which are assigned with different similarity scores.SQL is a highly structured indication of user utterance, so by measuring the similarity of current SQL to historical SQL, pseudo-labels of utterance semantic dependencies can be obtained to guide the STAR in contextual modelling.Here we propose a method to measure SQL similarity from two perspectives.

SQL Semantic Similarity
To compute the similarity of two SQL queries, we first convert each SQL query into m schema-states as described in Section 3.1, where the schema slots are names of all schema and their values are from SQL keywords.As illustrated in Figure 3, given two SQL queries (denotes as o x and o y ), we obtain the schema states the SQL queries o x and o y respectively.Since all the schema-states share the same schema slots, we have s i x = s i y .Then, we adopt the Jaccard similarity (Niwattanakul et al., 2013) to compute the semantic similarity of the SQL queries o x and o y by comparing v i x and v i y .Mathematically, we compute the SQL semantic similarity of o x and o y as: where |ŝ x,y | represents the number of non-duplicate schema states whose values are not [NONE] in o x and o y .Jaccard function computes the ratio of intersection over the union of v i x and v i y .SQL Structure Similarity To take advantage of the tree-structure of SQL queries, we first parse each SQL query o x into a SQL tree G x as illustrated in Figure 3.Given two SQL trees G x and G y for SQL queries o x and o y , we leverage the Weisfeiler-Lehman sub-tree kernel (Shervashidze et al., 2011) to compute the SQL tree-structure similarity score f tree (o x , o y ) as follows: where Norm() is a normalization function, K WL () is the Weisfeiler-Lehman subtree kernel function and K() is the base kernel on graphs.G i x denotes the Weisfeiler-Lehman graph at height i of the tree G x and h is the number of Weisfeiler-Lehman iterations.We refer the readers to Shervashidze et al. (2011) for the implementation details of Weisfeiler-Lehman sub-tree kernel.
Overall, we define the final similarity score of two SQL queries o x and o y as follows: where λ is a hyper-parameter controlling the impact of the two kinds of similarity.
Weighted Contrastive Loss After obtaining the SQL similarity, we employ weighted contrastive learning (Oord et al., 2018) to pull together two semantically similar NL utterances and push away the representations of semantically dissimilar NL utterances within each conversation.We first convert the input sequence  [h 1 t , . . ., h L t ], where L represents the length of the input sequence.We leverage an attention mechanism to learn the input representation ht as: where v 3 and W 3 are trainable parameters.Specifically, we minimize a weighted contrastive loss function L UDT to optimize the network as: where τ is a temperature hyper-parameter.D = {1, . . ., N } denotes the index set of the training utterances.D + x denotes the index set of positive utterances that co-occurs in the same conversation with utterance x.D − x denotes the index set of positive utterances other than x and p, and negative utterances chosen from other conversations.

Masked Language Modeling
In order to jointly learn the contextual representation of utterances and schema, we retain the masking mechanism in the pre-training stage.Concretely, given the input I t (defined in Eq.1) of the t-th turn, masked language modeling (MLM) selects a random set of positions and replaces these positions with [MASK], and then learns to predict the original tokens of the masked-out tokens.We follow the hyperparameters of prior work (Devlin et al., 2019), which randomly masks utterances and schema tokens with a 15% probability.We denote the MLM loss as L MLM , which is computed by minimizing the cross-entropy function on the masked tokens.

Joint Pre-training Objective
In this paper, we combine three pre-training objectives to learn a pre-training framework for contextdependent text-to-SQL parsing.Instead of combin-ing the objectives by simply performing a weighted linear sum of individual losses, we jointly learn three objectives by considering the homoscedastic uncertainty of each objective (Kendall et al., 2018).In this way, we can avoid the huge expense to tune weight hyper-parameters.We define the joint loss function based on homoscedastic uncertainty as: where σ 1 , σ 2 , σ 3 represent the model's observation noise parameters, capturing how much noise we have in the outputs.

Data Construction for Pre-training
The cost of expensive SQL annotation poses a challenge to the construction of large scale pretraining data.Previous work (Yu et al., 2021a,b) resort to data augmentation to address this issue.Typically in a conversational setting, contextdependent data augmentation techniques require two steps: (1) single-turn context-free grammar for utterance-SQL pair generation, and (2) a follow-up context-free grammar to expand single-turn data into context-dependent conversations.SCORE synthesized a total of 435k text-to-SQL conversations following this setup, and we noticed two limitations with it.Firstly, it relies on the template-filling construction to convert SQL to utterances, resulting in rather rigid generated utterances in step (1).Secondly, SPARC is the only data resource employed to induce the follow-up context-free grammar in step (2).Nevertheless, the contextual diversity in SPARC is insufficient to simulate complex contextual dependencies.
To this end, we propose a new pre-training data construction method.Inspired by the SNOWBALL framework (Shu et al., 2021), we harness a generative model, i.e., BART, to bring more diversity to the generated utterances.For the follow-up conversational context-free grammar induction, we consider both COSQL and SPARC datasets and man- ually craft 100 templates.Overall, we synthesize a new large-scale pre-training dataset that consists of about 480K high-quality context-dependent textto-SQL conversations.We provide examples of the induced grammar rules and synthesized procedure in detail in Appendix D.

Experimental Setup
Downstream Datasets We evaluate STAR on two context-dependent semantic parsing benchmarks: SPARC (Yu et al., 2019b) and COSQL (Yu et al., 2019a).SPARC is a collection of crossdomain context-dependent dataset, which consists of about 4.3k question sequences and 12k+ individual questions annotated with SQL queries.COSQL is a conversational text-to-SQL corpus, which contains about 3k dialogues and 10k+ annotated SQL queries.Both SPARC and COSQL query 200 complex databases spanning across 138 domains.We provide more detailed statistics of these two datasets in Appendix B.

Evaluation Metrics
We employ two official evaluation metrics (Yu et al., 2019b,a) to verify the effectiveness of STAR: question match accuracy (QM) and interaction match accuracy (IM).Concretely, QM denotes the exact set match accuracy over SQL templates and IM denotes the ratio of interactions over all correctly predicted questions.
Implementation Details In pre-training, STAR is initialized with ELECTRA (Clark et al., 2020).Similar to ELECTRA, we also employ the replaced token detection objective to further improve the text-to-SQL pre-training.The maximum length of each input sequence is set to 256.The batch size is set to 80 and an Adam optimizer (Kingma and Ba, 2015) is employed for optimization with an initial learning rate of 1e-6.Gradient clipping is applied to STAR with a maximum gradient value of 1.For computing the SQL similarity, the impact factor λ is set to 0.5.We provide more details of implementation in Appendix A.

Model Comparison on Downstream Tasks
In the experiments, we choose LGESQL (Cao et al., 2021) as our base model given its superior performance.Since LGESQL is originally developed for single-turn setting, we extend LGESQL to contextdependent setting by taking as input the concatenation of historical and current utterances.For a fair comparison, the four compared PLMs also leverage LGESQL as the base model.The experimental results on SPARC and COSQL are summarized in Table 1.STAR outperforms all the compared methods on the two datasets by a noticeable margin.First, STAR achieves substantially better results than the four strong PLMs.In particular, STAR surpasses the well-known SCORE by 7.4% QM score and 7.5% IM score on the COSQL dev set.Second, LGESQL+STAR achieves better results than the compared downstream methods which use BERT, ROBERTA, SCORE, GRAPPA as the PLMs, such as the best performing baseline HIE-SQL+GRAPPA.UDT (called w/o SST+UDT) respectively.Table 2 shows the ablation test results on both SPARC and COSQL.We can observe that removing the SST or UDT objective bring the most significant performance drop.Not surprisingly, combining all the three objectives achieves the best results on both datasets.

Effectiveness of SQL Similarity Metrics
To analyze the impact of metrics for calculating the SQL similarity in STAR, we also conduct an ablation test by removing the structural similarity metric (called w/o structural), the semantic similarity metric (called w/o semantic), and both (called w/o UDT), respectively.Table 3 shows the ablation test results on the dev sets of SPARC and COSQL.As expected, both similarity metrics contribute great improvements to STAR.

Effectiveness of Synthesized Pre-training Data
We also analyze the quality of our constructed pretraining data.We compare our pre-training data with the data created by SCORE (Yu et al., 2021b)

Discussion
Model Comparison on Samples with Different Levels of Difficulty The SQL queries in both SPARC and COSQL can be further divided into four levels based on the difficulty of the SQL queries: easy, medium, hard, extra hard, which can be used to better evaluate the model performance on different queries.As shown in Figure 4a-b, STAR achieves better results than the compared methods on the four kinds of data, even on the extra hard samples.

Model Comparison on Samples at Different
Turns Figure 4c-d illustrate the QM results of STAR and compared methods along with the increase of conversation turns on SPARC and COSQL dev sets.The QM results of baselines decrease sharply as the conversation turns increase, while STAR achieves much more stable performance even for the third and fourth turns.This suggests that STAR can better track and explore the interaction states in history utterances to assist the models to better parse current utterance.

Case Study
To evaluate STAR qualitatively, we choose two exemplary conversations from the CoSQL dev set and illustrate the generated SQL queries by SCORE and STAR in Figure 5.In the first case, we observe that STAR can exploit the usage of table information in history queries (e.g., [car_names.Model] to correctly generate the third SQL query, while SCORE fails to track this kind of schema state. In the second case, STAR successfully tracks the long-term utterance dependency between the first and fourth utterances, and generates the correct SQL keyword [SELECT COUNT(*)] in the fourth SQL query by tracking and referring to the query "the number of" in the second utterance.However, SCORE fails to track such long-term dependency with being disturbed by the third utterance.

Limitation Analysis
To better analyze the limitations of STAR, we carry out an analysis of the errors made by STAR on the CoSQL dev dataset.We reveal several reasons of the errors, which can be divided into following categories.First, STAR fails to select the correct names from table schemas in some hard or extra hard samples, where NL questions use synonyms to refer to tables or columns in SQL queries without the explicit correspondence between NL questions and table schemas.One possible solution is to exploit the rich semantic information contained in PLMs to capture the implicit schema linking information via knowledge probing techniques.Second, for some samples, STAR incorrectly inherits part of the previous turn SQL query.One possible solution is to design an additional classifier to predict the changes (e.g.RETAIN, MODIFY, DELETE) between the schema state of the current turn and that of the previous turn.Third, there are some SQL grammar errors such as the redundancy of [WHERE] clause, repetition of table names, structure error of [SELECT NEST].The reason may be that the schema state tracking objective only tracks the state of the database schema in conversation, which do not consider the overall grammatical structure of SQL queries.One possible idea is to add an extra objective to predict the general structure of SQL (e.g., abstract syntax tree) so as to capture the overall grammatical structure information of SQL.
In particular, Zhang et al. (2019) exploited the conversation history by editing the previous predicted SQL to improve the generation quality.The schema interaction graph in IGSQL (Cai and Wan, 2020) and two kinds of interaction states in IST-SQL (Wang et al., 2021) are designed to capture the historical schema evolution in context.Furthermore, Zheng et al. (2022) improve contextual accuracy by incorporating additional SQL encoders to integrate historical SQL into the input.In contrast to above works, STAR focus on the pre-training stage, expecting to extract general knowledge from largescale unsupervised or self-supervised data that will be useful for downstream parsing tasks.

Pre-training Models for Text-to-SQL Parsing
In parallel, tabular language models (TaLMs) have been proposed to simultaneously encode tables and texts, which further improved the results of downstream text-to-SQL parsing tasks.For example, TABERT (Yin et al., 2020) and TAPAS (Herzig et al., 2020) jointly encoded texts and tables with self-supervised or weakly-supervised objectives, which was trained on a large corpus of tables.STRUG (Deng et al., 2021) proposed a structuredgrounded pre-training technique and GAP (Shi et al., 2021) introduced a generation-augmented pre-training framework to capture the alignment relationship of utterance and table.Similarly, GRAPPA (Yu et al., 2021a) introduced a grammaraugmented pre-training framework for text-to-SQL parsing, which explored the schema linking by encouraging the model to identify table schema components that could be grounded to logical form constituents.SCORE (Yu et al., 2021b) was the state-of-the-art pre-training approach for context-dependent text-to-SQL parsing designed to induce representations that captured the switch between the adjacency turns.Unlike these TaLMs, STAR is the first to leverage both historical SQL and complex utterance dependency in the pre-training stage.

Conclusion
In this paper, we proposed STAR, a pre-trained TaLM, which could jointly learn user utterance and

A More Implementation Details
In the pre-training, STAR is initialized with ELEC-TRA (Clark et al., 2020).Similar to ELECTRA which is consist of a generator G and a discriminator D, we also employ the replaced token detection objective to further improve the text-to-SQL pretraining.Concretely, given the input I t (defined in Eq. ( 1)) of the t-th turn, the generator with masked language modeling (MLM) selects a random set of positions and replaces these positions with [MASK], and then learns to predict the original tokens of the masked-out tokens.The chance of each token being masked out is 15%.We denote the loss function of the generator as L MLM .In addition, we also train the discriminator to predict whether the each token is the same as the original token.We denote the loss function for training the discriminator as L Dis .Finally, we combine the loss functions of the generator G and the discriminator D to form the overall objective function for replaced token detection (RTD) as: We refer the readers to (Clark et al., 2020) for the implementation details of the RTD objective.γ is a hyperparameter controlling the impact of L MLM .In this work, the impact factor γ is set to 5. Our codebase is built on huggingface library (Wolf et al., 2019).
We use LGESQL as our downstream model.For a fair comparison, all LGESQL experiments are trained for 100 epoch.The learning rate is 1e-4 and weight decay is 0.1.And we adopt a more carefully optimization for our STAR encoder with layer-wise learning rate decay coefficient 0.8.Batch size is 10 and the maximum gradient norm is 5.Other hyperparameters are the same as in (Cao et al., 2021).

B Details of SPARC and COSQL
We evaluate the effectiveness of STAR on two context-dependent text-to-SQL parsing benchmarks: SPARC (Yu et al., 2019b) and COSQL (Yu et al., 2019a).Concretely, SPARC is a collection of cross-domain context-dependent dataset, which consists of about 4.3k question sequences and 12k+ individual questions annotated with SQL queries.COSQL is a conversational text-to-SQL corpus, which contains about 3k dialogues and 10k+ annotated SQL queries.Both SPARC and COSQL query 200 complex databases spanning across 138

C Generalization of STAR
We also evaluate the generalization of our pretraining objectives by using ROBERTA as our initialization model, rather than applying ELECTRA.The experimental results are shown in Table 6.In a similar trend, STAR that is initialized with ROBERTA performs significantly better than the original ROBERTA, which to some extent verifies the generalization of the proposed pre-training objectives, no matter what initialization models are used to train STAR.

D Details of Data Construction
In this paper, we synthesize a new large-scale pre-training dataset which consists of about 480K high-quality context-dependent text-to-SQL conversations.Specifically, we first generate singleturn question-SQL pairs by exploiting the SPIDER, SPARC and COSQL datasets.

D.1 Single-turn Question-SQL Pairs
To obtain sufficient high-quality single-turn question-SQL pairs, we carefully examine currently available sources and generate question-SQL pairs from SPIDER, SPARC and COSQL datasets.Specifically, we collect the original single-turn question-SQL pairs from the dataset SPIDER which is one of the largest single-turn cross-domain textto-SQL corpora.For the context-dependent text-to- SQL datasets SPARC and COSQL, we generate a new question for each SQL query instead of using the original NL questions since they may contain ellipsis and anaphora that refers to earlier items in the conversations, resulting in low-quality question-SQL pairs.In particular, we employ the SNOW-BALL framework (Shu et al., 2021) with BART to generate the question based on each SQL query, which employs an iterative training procedure by recursively augmenting the training set with quality control.

D.2 Context-dependent Text-to-SQL Conversations
To expand the single-turn question-SQL pairs to context-dependent text-to-SQL conversations, we first convert SQL queries into their structured formats.Then, following (Yu et al., 2021b), we study 600 examples from the training set of both SPARC and CoSQL datasets, and induce about 100 follow-up question-grammar templates.Each template consists of a pair of (i) a context-free question template (e.g., "Could you please tell me the [COLUMN0] of those?")where the typed slot [COLUMN0] represents the mention of schema, and (ii) its corresponding operation grammar (e.g., "replaced select column") that contains context switch labels of the question templates.
Finally, for a single-turn question-SQL pair constructed in Section D.1 with database d, we randomly choose a created question-grammar template.We sample the values for typed slots in the template and get the synthesized NL question as well as its corresponding SQL query if the previous SQL query satisfies the constraints in the sampled template (e.g., the SQL query contains the mentioned schema); otherwise, another questiongrammar temple is sampled until we successfully synthesize the next question-SQL pair.Then, we consider the synthesized question-SQL pair as a new start and repeat the above process until we obtain the context-dependent text-to-SQL conversation consisting of T turns of question-SQL pairs.Figure 6 shows an example of synthetic text-to-SQL conversation with five turns.

Figure 1 :
Figure 1: An example of cross-domain contextdependent Text-to-SQL conversation.Here, each database schema refers to the table/column names of databases and each schema state refers to a slot-value pair, whose slot is a column/table name (e.g., Degrees.campus) and its value is a SQL keyword (e.g., SELECT)."x" indicates that the semantic/intent is switched between Turn2 and Turn3 utterances.

Figure 2 :
Figure 2: The overview of the proposed STAR framework consisting of two novel pre-training objectives: (a) the utterance dependency tracking and (b) the schema state tracking.For brevity, we do not show the masked language modeling objective here.tations h = [h 1 , . . ., h L ].

Figure 3 :
Figure 3: Two metrics for calculating SQL similarity, including semantic similarity and structure similarity.

Figure 4 :
Figure 4: The results of STAR and baselines on SPARC and COSQL dev sets (a-b) by varying the difficulty levels of the data and (c-d) by varying the conversation turns.

Figure 5 :
Figure 5: Two cases on the COSQL dev dataset.

Figure 6 :
Figure 6: An example of synthetic text-to-SQL conversation.

Last Schema State (a) Utterance Dependency Tracking (b) Schema State Tracking Current Schema State pull pull pull push push push
.., x L ] into a sequence of contextualized vector represen- o1: SELECT …… …… o2: SELECT …… …… o3: SELECT name FROM teacher ORDER BY age LIMIT 1 o4: SELECT age FROM teacher ORDER BY age LIMIT 1

BY accelerate DESC LIMIT 1 SELECT weight FROM cars_data ODER BY weight DESC LIMIT 1
What is the highest accelerate ?What is the highest weight in the cars_data table ?
x L t ] into a sequence of contextualized vectors h t = 1238 SELECT accelerate FROM cars_data ODER SQL Semantic Similarity: 0.3 SQL Structure Similarity: 0.8

Table 1 :
Experimental results of various methods in terms of question match (QM) accuracy and interaction match (IM) accuracy on both SPARC and COSQL datasets."-" means that the test results are not accessible since the test accuracy needs to be officially evaluated and only two models can be submitted every two months.

Table 2 :
Ablation study of STAR in terms of question match accuracy (QM) and interaction match accuracy (IM) on the dev sets of both SPARC and COSQL.

Table 3 :
Results of STAR on the dev sets of SPARC and COSQL by using different metrics for calculating SQL similarity.

Table 4 :
Effectiveness of Pre-training Objectives We conduct ablation test to investigate the effectiveness of each pre-training objective in STAR.We report the results of removing the MLM loss (called w/o MLM), the SST loss (called w/o SST), the UDT loss (called w/o UDT), and both SST and Results of STAR on the dev set of COSQL with MLM and SST objectives by using different pretraining data.
How many Volvo cars are there?Which Volvo car has the least accelerate?How many cylinders does that car have?SELECT cars_data.Cylinders FROM car_names JOIN cars_data ORDER BY cars_data.Accelerate ASC LIMIT 1 SELECT cars_data.Cylinders FROM car_names JOIN cars_data WHERE car_names.Model = "value" ORDER BY cars_data.Accelerate ASC LIMIT 1 MLM + L SST (denoted as STAR w/ MLM + SST) as the pre-training objectives in the experiments.As shown in Table4, our pre-training data is more effective than the pre-training data created by SCORE.

Table 5 :
Details of SParC and CoSQL Dataset.

Table 6 :
Results of STAR which is initialized with ROBERTA on the dev sets of both SPARC and COSQL.domains.Table5reports the statistics of SPARC and COSQL datasets in detail.