Augmenting Multi-Turn Text-to-SQL Datasets with Self-Play

The task of context-dependent text-to-SQL aims to convert multi-turn user utterances to formal SQL queries. This is a challenging task due to both the scarcity of training data from which to learn complex contextual dependencies and to generalize to unseen databases. In this paper we explore augmenting the training datasets using self-play, which leverages contextual information to synthesize new interactions to adapt the model to new databases. We first design a SQL-to-text model conditioned on a sampled goal query, which represents a user's intent, that then converses with a text-to-SQL semantic parser to generate new interactions. We then filter the synthesized interactions and retrain the models with the augmented data. We find that self-play improves the accuracy of a strong baseline on SParC and CoSQL, two widely used cross-domain text-to-SQL datasets. Our analysis shows that self-play simulates various conversational thematic relations, enhances cross-domain generalization and improves beam-search.


Introduction
Multi-turn text-to-SQL translation is a powerful semantic parsing paradigm that converts natural language user utterances into executable SQL queries in a conversational environment.Compared to regular text-to-SQL tasks such as Spider (Yu et al., 2018b) and GeoQuery (Zelle and Mooney, 1996), conversational text-to-SQL requires interpreting coreference and omission phenomena that frequently appear in human conversations.To be effective, text-to-SQL models must uncover complex contextual dependencies while grounding user utterances in task specific database schemas.
Numerous architectures and pretraining methods have been proposed for tackling context-dependent text-to-SQL (Suhr et al., 2018;Zhang et al., 2019;Hui et al., 2021;Scholak et al., 2021;Yu et al., 2021;Xie et al., 2022).However, the size of the datasets used has been limited due to the high cost of annotating multi-turn dialogue and SQL pairs, which often requires trained experts.Existing multi-turn text-to-SQL datasets, such as SParC (Yu et al., 2019b) and CoSQL (Yu et al., 2019a), require text-to-SQL parsers to generalize to unseen databases at test time, but doing so is difficult with limited training context.
In this paper we propose the use of self-play to augment multi-turn text-to-SQL datasets in order to achieve more robust generalization.Self-play simulates interactions between multiple artificial agents in order to generate a training signal in addition to supervised data.It has been successfully applied in a wide range of tasks, e.g. board games (Silver et al., 2016(Silver et al., , 2018) ) and multiplayer battle games (Vinyals et al., 2019;Berner et al., 2019).It has also been applied in dialogue simulations, during which a dialogue model converses with a user simulator to generate synthetic dialogues (Schatzmann et al., 2006;Gür et al., 2018;Tseng et al., 2021).In our work, we extend self-play to semantic parsing.
Although self-play has been adopted in taskoriented dialogue, the need to pre-define a domain specific ontology of slot-value pairs (e.g. the slot value "price=expensive" for a restaurant booking) (Henderson et al., 2014;Wen et al., 2016;Budzianowski et al., 2018) prevents self-play from simulating interactions in a new domain.Adding a new domain for task-oriented dialogue is difficult and labor-intensive.On the other hand, textto-SQL tasks (Yu et al., 2018b(Yu et al., , 2019b,a) ,a) use a domain-independent formalism, i.e.SQL queries.We demonstrate that self-play is well-suited to simulating interactions in a new domain given a database schema, improving cross-domain generalization.
We use PICARD (Scholak et al., 2021) as the Figure 1: Multi-turn text-to-SQL with self-play.We transform an interaction from SParC on the left to seq2seq formats (top: text-to-SQL, bottom: SQL-to-text).User utterances, SQL queries, databases, and user goals are concatenated with a " | " symbol and shown in green, blue, yellow, and purple respectively.We use self-play to generate synthetic interactions.The synthetic interactions are filtered and used to retrain the text-to-SQL and SQL-to-text models.
base of our text-to-SQL model.When generating a new interaction, we first sample a SQL query with Zhong et al. (2021) as the goal query and condition the SQL-to-text model on this sampled SQL.
The text-to-SQL model converses with the SQL-totext model to simulate a new interaction.We filter out the interactions that are not grounded to the sampled goals and employ self-training (Yarowsky, 1995;Zoph et al., 2020) to retrain the text-to-SQL model and the SQL-to-text model.We conduct extensive experiments on SParC and CoSQL.Our main findings are: • Self-play helps the text-to-SQL model learn various conversational thematic relations ( §5.3) and improves cross-domain generalization ( §5.1).
• Self-play improves the performance on the majority of SQL types.Models after selfplay perform particularly well on queries of medium difficulty ( §5.1).
• Self-play improves beam search.Models after self-play are less sensitive to the beam size and can perform well with even small beam sizes ( §5.2).

Preliminary
In this section, we formally define the multiturn text-to-SQL task and introduce the PICARD (Scholak et al., 2021) model, which we use as our baseline.PICARD obtains state-of-the-art results on several text-to-SQL tasks.

Task Definition
In context-dependent text-to-SQL tasks, we are given interactions between a user and a system.Each interaction spans multiple turns.The user ends the interaction when the query returns the required information from the database.Formally, at each turn t (where 1 ≤ t ≤ T ), multi-turn textto-SQL produces a valid and executable SQL query Q t given a database D, a current user utterance U t , and a dialogue context C t (which is usually the previous user utterances U <t ): (1)

Baseline: PICARD
We use PICARD (Scholak et al., 2021) as our baseline conditional model for Equation 1. PI-CARD serializes the database schema D into a sequence following Lin et al. (2020).An example of the input and output format is shown in Figure 1.PICARD finetunes T5 (Raffel et al., 2019), a sequence-to-sequence transformer, with input and output sequences.PICARD proposes an incremental parsing method for constrained decoding during beam search.Specifically, it rejects inadmissible tokens at each beam search step subject to parsing rules that encode lexical and grammatical constraints.Only the beam hypotheses that pass all the constraint checks are kept.PICARD also leverages SQL schema information, such as the column names of each table, to impose checks on the validity of the generated SQL.PICARD greatly reduces the likelihood of decoding invalid SQL queries.

Method
Here we introduce how we use self-play for data augmentation.We first design a SQL-to-text model ( §3.1).Next, we describe how to use self-play to generate synthetic interactions ( §3.2).Finally, we explain how we incorporate the generated data for self-training ( §3.3).

The SQL-to-Text Model
We design a user simulator, which is a SQL-to-text model, to converse with the text-to-SQL model to generate synthetic interactions.Specifically, at each turn t we would like the user simulator to produce a meaningful question that would naturally be asked by a human user.In each interaction, a user has a goal to achieve.We explicitly condition the SQL-to-text model on a user goal, G, to encourage the user simulator to ask questions that are grounded to this goal.Formally, the SQL-to-text model calculates the following conditional at each turn: where the context C t contains the previous user utterances U <t .During training, G is the SQL query of the final turn T , i.e.Q T .During inference we adopt Zhong et al. (2021) to sample a new goal query as shown in §3.2.We employ the seq2seq approach and parameterize the SQL-to-text model (Eq.2) with T5.We concatenate the user goal G, the last SQL query Q t−1 , the previous user utterances U <t , and the serialized schema D to predict the next user utterance U t .For example, one input would be: "user goal | previous utterances | last SQL query | serialized database".Its target label is the correct user utterance for the next turn.We pad the last utterance with a special stop-of-interaction symbol.In SQL-to-text, there could be multiple reasonable questions to ask for the next turn, i.e. a one-to-many relation.A well-trained SQL-totext model can generate new questions, thereby increasing the diversity of user dialogue flows in the dataset and improving generalization.

Self-Play
We pretrain both the text-to-SQL and SQL-to-text models on the gold training data by minimizing the negative log likelihood: where N is the number of training examples, V is the sequence length, and each y i j is a token in the reference sequence.With the models pretrained on the gold dialogues, we can generate synthetic interactions using self-play.First, we need to specify a SQL query as the eventual goal G of the interaction.We adopt the query sampling method proposed in Zhong et al. (2021) for synthesizing a goal G. Zhong et al. (2021) first builds and samples coarse SQL templates with the SQLs in the training set by replacing the column and value mentions in the queries with typed slots.For example, SELECT T1.id, T2.name is converted to the template SELECT key1, text1.To adapt the models to an unseen environment, they sample an unseen database and fill in the typed slots with columns and values from the sampled database to form a new SQL query.We follow this approach to synthesize goals in new domains for cross-domain generalization.The complete sampling procedure is given in Appendix A.1.We concatenate the sampled goal G with an empty context and the serialized schema as shown in Eq. 2 and feed it into the SQL-to-text model to produce the first user utterance.Then, the text-to-SQL model and SQL-to-text model can continue the interaction with Eq. 1 and Eq. 2 until the end.A synthetic interaction ends whenever the SQL-to-text model decodes the stop-of-interaction symbol.
Filtering Synthetic conversations generated by self-play may diverge from the sampled goals.To filter these low-quality conversations, we compare the generated SQL query Q T from the last turn T with the sampled goal G (see §3.2) using a similarity score score(Q T , G).We follow Yu et al. (2018b) and decompose the SQL queries Q T and G into SQL substructures Q Ts , G s (e.g.select, where, group_by, order_by statements) and calculate the accuracy on each substructure.We let score(Q T , G) be the average of the accuracy over all the substructures.We keep a synthetic conversation if score(Q T , G) is larger than a threshold value w.A high score means that the synthetic conversation is grounded to the sampled goal.

Self-Training
We re-train a new text-to-SQL model and a new SQL-to-text model with both the gold training data and the filtered synthetic interactions.Algorithm 2 shows the overall procedures.The complete selfplay and self-training steps are shown in Figure 1.Our method is an instance of self-training as the models are re-trained with their own outputs.To retrain the text-to-SQL and SQL-to-text models, we can either combine the filtered synthetic data with gold interactions, or pretrain on the synthetic interactions before fine-tuning on the gold interactions.We employ the second approach as we observe that the second approach performs slightly better than the first one.

Datasets and Main Results
In this section, we evaluate the performance of selfplay on cross-domain multi-turn semantic parsing.We first introduce the datasets ( §4.1), then detail the evaluation metrics ( §4.2), and finally we show the main results ( §4.3).

Datasets
We evaluate our method on two large-scale benchmark datasets, SParC (Yu et al., 2019b) and CoSQL (Yu et al., 2019a).Table 1 summarises the statistics of the two datasets.Following PICARD, we additionally pretrain the text-to-SQL model on a single-turn text-to-SQL dataset Spider (Yu et al., 2018b).All these datasets require generalization to new domains as they contain different databases for training, development, and testing, respectively, to evaluate the cross-domain performance.We discuss SParC and CoSQL in detail.
SParC SParC is a multi-turn text-to-SQL dataset that spans 200 databases in which the tables cover 138 different domains.Each question in an interaction belongs to one of the four thematic relations: refinement, theme-entity, theme-property, and answer refinement (Bertomeu et al., 2006).For example, given a question "Which major has the fewest students?", the next query can be an "refinement" query, "What is the most popular one?", which asks for the same entity as the previous question but with a different constraint.
CoSQL CoSQL is the dialogue version of SParC.In CoSQL, besides a SQL query, the system also generates a natural language response.It is collected with the Wizard-of-Oz setting (Budzianowski et al., 2018).The dataset is used for three tasks including state-tracking, user act prediction, and response generation.We use this dataset for state-tracking, where the goal is to map user utterances into a SQL query at each turn.

Settings and Evaluation Metrics
Following Yu et al. (2018b), we measure the performance with question match (QM) and interaction match (IM), both of which are based on the exact set match accuracy.The exact set match is computed by decomposing the predicted SQLs into clauses such as SELECT, WHERE, GROUP BY and calculating the set matching score on each.QM is 1 if the exact set match for a question in an interaction is 1.IM is 1 if the exact set matches for all questions in an interaction are 1.The number of the self-play generated training data for SParC (CoSQL) before filtering is 100,000 (100,000) and 49,623 (48,291) after filtering.Appendix A.2 shows implementation details of our experiments.

Main Results
We report the main results in Table 2.We observe that the configuration "w/ PICARD w/ self-play" achieves the best results on both datasets (measured by QM and IM).This demonstrates the benefit of self-play.The improvement brought by self-play is more salient on SParC than on CoSQL, while T5-Large w/ PICARD w/ self-play outperforms the vanilla T5-3B reported by Scholak et al. (2021).Therefore, we conclude that self-play is an effective data augmentation method to improve performance on cross-domain context-dependent text-to-SQL.Appendix A.3 shows the system's performance under different configurations of the generated synthetic data.

Analysis
In this section we take SParC and systematically analyze the effect of self-play.First, to gain more insight into how a question's position or the query template affect the models, we examine self-play performance stratified by different turn number and SQL templates in §5.1.Then, we study whether self-play improves decoding during beam search ( §5.2).We further conduct a case study of self-play interactions in §5.3.

Turn and Template Analysis
We first plot the distribution of interaction lengths in Figure 2. Self-play produces shorter interactions with a mean length of 2.53, whereas the mean of the training data is 2.97. Figure 3 shows Question Match (QM) accuracy stratified by question turns.The performance after self-play increases on the turn numbers ≤ 3 and decreases on the turn number 4. This is because self-play does not generate enough long interactions as shown in Figure 2.
Next we compare the performance of the models with and without self-play stratified by the difficulty of the SQL template.We first convert SQLs into templates using the method in Zhong et al. (2021).To get a sense of the overlap of the templates in self-play and training, 85% of self-play templates occur in the training templates.That is to say, 15% of self-play templates are new templates that are unseen during training.As shown in Figure 4, self-play interactions have higher proportions of easy and extra hard templates and lower proportions of medium and hard templates.We compare the performance before and after self-play on the SParC validation set in Table 3. Self-play brings the largest improvement to interactions of medium difficulty, followed by hard and easy ones.
On manually inspecting the performance for templates we observe that the performance on most is improved after self-play.Of the 72 unique templates in the SParC validation set, there are only 12 query templates whose performance decreases.The performance on the templates with the operator "select counts" improves significantly (on average an increase of 12 for the 11 templates with "select counts"), possibly because the word "count" appears more often in the gen- erated ones than training.For example, "select count (*_col_0)" is one of the top templates in the generated dataset as shown in Table 4.We find that the accuracy of the template "select sum (number_col_0)" increases from 50 to 100 after self-play.Self-play also reduces hallucination to some extent.For example, when asked to display certain record companies, the model would hallucinate the constraint "having count(*)>2" before self-play, but the system gives the correct result after self-play.These results confirm the effectiveness of self-play.Appendix A.4 shows more examples of generated templates and the improvement brought by self-play.

Beam Search Analysis
In this section we study whether self-play improves beam search by increasing the recall of the correct SQL.We first define "Recall at beam size k" as the probability that the ground-truth SQL is contained in the hypotheses of the beam search.This metric measures the recall of the ground-truth SQL when using the beam size k.As shown in Figure 5, we plot "Recall at beam size k" with k from 1 to 20.
We observe that the model after self-play has higher recall compared to the model before self-play at different beam sizes.For example, the recall is 76.1 before self-play and 79.0 after self-play when k is 20.This shows that self-play can improve the recall of the ground-truth SQL.The recall at beam size 20 after self-play is 13.5 higher than the corresponding exact match score (79.0 vs. 65.5),demonstrating that the model has a high recall of the ground-truth SQL yet the ground-truth SQL may not have the highest beam score.As shown in Figure 6, we further plot the exact match score of the T5-Large model with 4 configurations at different beam sizes to understand the effect of beam size on model performance.In general, the exact match score improves with larger beam sizes.We observe that (1) the model with self-play outperforms the model without self-play in all configurations; (2) the models with PICARD are more sensitive to beam sizes because the performance improves significantly with increasing beam sizes;2 (3) the models with self-play are less sensitive to beam sizes as they can obtain high exact match scores with even small beam sizes; (4) selfplay can improve the performance with/without PICARD.

Case Study of Self-Play Interactions
Table 5 shows successful (5a) and failed interactions (5b, 5c) generated with self-play.In Table 5a, given the sampled goal (the same as the final system query from turn 3), the SQL-to-text model  can decompose it into smaller questions over multiple turns.After asking for the locations of gas stations in the first turn, the SQL-to-text model asks for the company names of the gas stations, a theme-entity question, and then proceeds to query the assets of these companies in descending order, an answer refinement question.This demonstrates that different thematic relations are learned by the SQL-to-text model.Meanwhile, the text-to-SQL model produces the correct SQL queries in a context-dependent way.
Next, we analyze the failure cases.In Table 5b, the user utterances are not grounded to the sample query in the course of the dialogue, e.g. the question in the final turn does not match the semantics of the goal query.Although the final system query matches the sampled goal, language drift happens in the middle of the conversation.For example, the user utterance mentions templates in the second turn, but the text-to-SQL model ignores this keyword in the SQL query.Another failure case is the repetition of user utterances.Figure 7 shows the proportion of generated interactions in which the user utterances repeat.Repetition happens more frequently with increasing interaction lengths.Table 5c shows an example of a repetitive interaction.Although the SQL-to-text model produces the sampled goal in the first turn, the stop-of-interaction symbol does not appear in the user utterance.As a result, the conversation continues, and the user simply repeats its first question in the third turn.

Related Work
Data Augmentation for Semantic Parsing Data augmentation (Feng et al., 2021) is an effective strategy to increase the diversity of training data without manually collecting new data.Data augmentation has been applied in NLP (Jia and Liang, 2016) on various tasks such as paraphrase extraction (Barzilay and McKeown, 2001), machine translation (Sennrich et al., 2016;Liu et al., 2021), and question-answering (Longpre et al., 2019).Data augmentation is also widely-adopted in semantic parsing tasks (Jia and Liang, 2016;Hou et al., 2018;Yu et al., 2020;Zhong et al., 2021;Wang et al., 2021;Yang et al., 2022).Most previous work on data augmentation for text-to-SQL tasks use single-turn datasets such as SPIDER.Yu et al. (2018a) create cross-domain augmentation data with question-SQL patterns extracted from Spider.Guo et al. (2018) use a syntax-and-table-aware semantic parser and a copy-based latent variable model to generate SQLs and questions, respectively.Wu et al. (2021) apply the abstract syntax tree grammar for SQL generation and a hierarchical SQL-toquestion generation model to generate questions.
Data augmentation for multi-turn SQL-to-text datasets is under-explored.The task is more challenging compared to single-turn datasets as it requires sequential generation that takes into consideration complex contextual dependencies and user goals.Zhong et al. (2021) combine a forward semantic parser with a backward utterance generator to generate multi-turn interactions.The cedure that happens when the model is only exposed to ground-truth interactions without being conditioned on its own predicted interactions.Several methods (Ranzato et al., 2015;Shen et al., 2016;Leblond et al., 2017;Welleck et al., 2019) have been proposed to bridge this train and test time discrepancy.Our model demonstrates the benefits of using the high-quality predicted interactions to retrain the original model and is a reasonable way to condition the model on its own prediction and mitigate the exposure bias issue.

Conclusion
We explore using self-play as a data augmentation method for generating synthetic dialogues in the cross-domain conversational semantic parsing task to address the challenge of data scarcity and cross-domain generalization.Self-play learns various thematic relations in dialogues, improves beam search, and encourages the model's generalization to different domains.Experiments on a T5 textto-SQL semantic parser demonstrate the benefit of our proposed method.In the future, we will study using rewards in a RL setting to guide self-play to produce better synthetic dialogues.

Limitations
Although the filtered dialogues after self-play are mostly grounded to the sampled user goal, some synthetic dialogues are unnatural as illustrated in section §5.3 Therefore, a more controlled generation of self-play that penalizes producing repetitive questions, and encourages dialogues that last longer turns would be desirable.The experiments require large GPU resources and restrict us to run selfplay for one round.Running self-play with the retrained models iteratively for multiple rounds may possibly improve the results more.Dataset-wise, in the real world, as humans do not ask questions in a controlled setting as in SParC and CoSQL, the data distribution may be more noisy and complicated.Self-play is not able to generate synthetic dialogues that diverge from the training data to simulate real-world scenarios.In our experiments, filtering is applied to discard low-quality synthetic interactions that diverge from the user's goals.We find that training on lowquality interactions gives negative effects for the final performance.We study the effect of changing the filter threshold value w, as shown in Table 7.The final threshold w for filtering interactions is set to 0.5 as a larger threshold aggressively filters most synthetic dialogues that are of hard/extra-hard difficulties.We further study if conditioning on a user goal G when generating interactions is necessary.When we ablate the user goal, the QM score on SParC drops from 62.4 to 59.8.We argue that it is important to condition on the user goal to obtain grounded interactions.
We also reimplement the method used in Zhong et al. ( 2021) by ablating both the user goal and the context.The QM score on SParC drops from 62.4 to 58.3.We argue that it is important to condition on the user goal and the full context to obtain grounded interactions.

Algorithm 2 :
Self-training.Input :Gold interactions I, # iteration k for synthetic data generation, threashold w.Output :A text-to-SQL model and a SQL-to-text model.Pretrain a text-to-SQL model p(Qt|Ut, Ct, D) and a SQL-to-text model p(Ut|Qt−1, Ct, G, D) on I.I ′ = ∅ for i in (1, ..., k) do Sample a goal query G. Generate a synthetic interaction IS by self-play between text-to-SQL and SQL-to-text.Calculate score(QT , G) on IS. if score(QT , G) > w then Add IS to I ′ Retrain p(Qt|Ut, Ct, D) and p(Ut|Qt−1, Ct, G, D) on I ∪ I ′ .return the retrained text-to-SQL model and the SQL-to-text model.

Figure 2 :
Figure 2: The distribution of interaction lengths of gold interactions and self-play interactions.

Figure 5 :
Figure 5: Recall at k plot.Models after self-play have higher recall at different beam sizes.

Figure 6 :
Figure 6: Question Match with different beam sizes.

Figure 7 :
Figure 7: The proportion of repetition in generated interactions grouped by interaction length.
The QM score on SParC validation set with T5-base trained with different filter value w.

Table 1 :
Comparison of cross-domain context-dependent text-to-SQL datasets.

Table 2 :
Main results.Models after self-play outperform the baselines under different configurations.

Table 3 :
The performance before and after self-play on SParC validation set grouped by template difficulties.

Table 4 :
The top templates and their proportions in the SParC training and generated data.

Table 6 :
The QM score after self-play on SParC validation set with T5-base trained on different number of synthetic data before filtering.

Table 8 :
Examples of generated templates unseen in SParC train.