SeaD: End-to-end Text-to-SQL Generation with Schema-aware Denoising

In text-to-SQL task, seq-to-seq models often lead to sub-optimal performance due to limitations in their architecture. In this paper, we present a simple yet effective approach that adapts transformer-based seq-to-seq model to robust text-to-SQL generation. Instead of inducing constraint to decoder or reformat the task as slot-filling, we propose to train seq-to-seq model with Schema aware Denoising (SeaD), which consists of two denoising objectives that train model to either recover input or predict output from two novel erosion and shuffle noises. These denoising objectives acts as the auxiliary tasks for better modeling the structural data in S2S generation. In addition, we improve and propose a clause-sensitive execution guided (EG) decoding strategy to overcome the limitation of EG decoding for generative model. The experiments show that the proposed method improves the performance of seq-to-seq model in both schema linking and grammar correctness and establishes new state-of-the-art on WikiSQL benchmark. The results indicate that the capacity of vanilla seq-to-seq architecture for text-to-SQL may have been under-estimated.


Introduction
Text-to-SQL aims at translating natural language into valid SQL query. It enables layman to explore structural database information with semantic question instead of dealing with the complex grammar required by logical form query. Though being a typical seq-to-seq (S2S) task, auto-aggressive models (LSTM, Transformer, etc.) that predict target sequence token by token fail to achieve state-of-theart results for text-to-SQL. Previous works attribute the sub-optimal results to three major limitations. First, SQL queries with different clause order may have exact same semantic meaning and return same results by execution. The token interchangeability 1 https://github.com/salesforce/WikiSQL hich week had an attendance of 53,677 <col0> week <col1> data <col3> opponent … Figure 1: SeaD regards text-to-SQL as S2S generation task. During inference, given natural language question and related database schema, SeaD directly generates corresponding SQL sequence in an auto-aggressive manner.
may confusion model that based on S2S generation. Second, the grammar constraint induced by structural logical form is ignored during auto-aggressive decoding, therefore the model may predict SQL with invalid logical form. Third, schema linking, which has been suggested to be the crux of text-to-SQL task, is not specially addressed by vanilla S2S model. Recent years, there have been various works proposed to improve the performance of S2S model with sketch-based slot-filling, constraint (structuralawared) decoder or explicitly schema linking module, that try to mitigate these limitations respectively. Though most of these works share the same encoder-decoder architecture with S2S model, they only consider it as a simple baseline. On the other side, generative model exhibits huge potential in other tasks that also require structural output, which shares similar property with text-to-SQL. This raised the question: Is the capacity of generative model underestimated or not?
In this paper, we investigate this question based on the Transformer architecture. Instead of build-ing extra module or placing constraint on model output, we propose novel schema-awared denoising objectives trained along with the original S2S task. These denoising objectives deal with the intrinsic attribute of logical form and therefore facilitate schema linking required for text-to-SQL task. The inductive schema-awared noises can be categorized into two types: erosion and shuffle. Erosion acts on schema input by randomly permute, drop and add columns into the current schema set. The related schema entity in target SQL query will be jointly modified according to the erosion result. Shuffle is applied via randomly re-ordering the mentioned entity and values in NL or SQL with respect to the schema columns. During training procedure, shuffle is performed during monolingual self-supervision that trains model to recover original text given the noised one. Erosion is applied to S2S task that trains model to generate corrupted SQL sequence, given NL and eroded schema as input. These proposed denoising objectives are combined along with the origin S2S task to train a SeaD model. In addition, to deal with the limitation of execution-guided (EG) decoding, we propose a clause-sensitive EG strategy that decide beam size with respect to the clause token that is predicted.
We compare the proposed method with other top-performing models on WikiSQL benchmark. The results show that the performance of our model surpasses previous work and establish new state-ofthe-art for WikiSQL. It demonstrate the effectiveness of the schema-aware denoising approach and also shad lights on the importance of task-oriented denoising objective.

Related Work
Semantic Parsing The problem of mapping natural language to meaningful executable programs has been widely studied in natural language processing research. Logic forms (Zettlemoyer and Collins, 2012;Zettlemoyer, 2011, 2013;Cai and Yates, 2013;Reddy et al., 2014;Liang et al., 2013;Quirk et al., 2015;Chen et al., 2016) can be considered as a special instance to the more generic semantic parsing problem. As a sub-task of semantic parsing, the text-to-SQL problem has been studied for decades. (Warren and Pereira, 1982;Popescu et al., 2003;Li et al., 2006;Giordani and Moschitti, 2012;Bodik). Slot-filling model (Hwang et al., 2019;He et al., 2019a;Lyu et al., 2020) translates the clauses of SQL into subtasks, (Ma et al., 2020) treat this task as a two-stage sequence labeling model. However, the convergence rate between subtasks is inconsistent or the interaction between multiple subtasks may lead to the model may not converge well. Like lots of previous work (Dong and Lapata, 2016;Lin et al., 2018;Zhong et al., 2017;Suhr et al., 2020;Raffel et al., 2019), we treat text-to-SQL as a translation problem, and taking both the natural language question and the DB as input. Hybrid Pointer Networks Proposed by (Vinyals et al., 2015), copying mechanism (CM) uses attention as a pointer to copy several discrete tokens from input sequence as the output and have been successfully used in machine reading comprehension (Wang and Jiang, 2016;Trischler et al., 2016;Kadlec et al., 2016;Xiong et al., 2016), interactive conversation (Gu et al., 2016;Yu and Joty, 2020;He et al., 2019b), geometric problems (Vinyals et al., 2015) and program generation (Zhong et al., 2017;Xu et al., 2017;Dong and Lapata, 2016;Yu et al., 2018;McCann et al., 2018;Hwang et al., 2019). In text-to-SQL, CM can not only facilitate the condition value extraction from source input, but also help to protect the privacy of the database. In this paper, We use a Hybrid Pointer Generator Network which is similar to (Jia and Liang, 2016;Rongali et al., 2020) to generate next step token. Denoising Self-training Language model pretraining (Devlin et al., 2018;Yang et al., 2019;Lan et al., 2019) has been shown to improve the downstream performance on many NLP tasks and brought significant gains. (Radford et al., 2018;Peters et al., 2018;Song et al., 2019) are beneficial to S2S task, while they are problematic for some tasks. While ) is a denoising S2S pre-training model, which is effective for both generative and discriminative tasks, reduces the mismatch between pre-training and generation tasks. Inspired by this, we propose a denosing selftraining architecture in training to learn mapping corrupted documents to the original.

Methodology
Given natural language question Q and a schema S, our goal is to obtain the corresponding SQL query Y . Here the natural question Q = {q 1 , ..., q |Q| } denotes a word sequence, the schema S = {c 1 , ..., c |S| } is composed of a set of columns, where each column c i = {c 1 , ..., c |c i | } is a sequence of words. Y = y 1 , ..., y |Y | denotes the  Figure 2: The proposed schema-aware denoising procedure. (a) Erosion denoising randomly drops, adds and repermutes schema columns. The related column entities in ground-truth SQL sequence will be jointly modified or masked out with respect to the erosion results of the current schema set. Erosion objective trains model to predict the modified SQL sequence under noised input. (b) Shuffle denoising objective re-permutes the mentioned entities in SQL or NL sequence, and trains model to reconstruct the sequence with the correct entity order.
token-wise raw SQL sequence. We approach this task with directly auto-aggressive generation, i.e., predicting the SQL sequence token by token. We choose Transformer as our base architecture, which is a widely adopted in S2S translation and generation tasks. In this section, we first present the sample formulation that transform text-to-SQL into typical S2S task, followed by a brief introduce of the Transformer architecture with pointer generator. Then we describe the proposed schema-aware denoising method and clause-sensitive EG decoding strategy.

Sample Formulation
The structural target sequence and unordered schema set require re-formulate to perform textto-SQL task through S2S generation. For schema formulation, each column name is prefixed with a separate special token <coli>, where i denotes the i-th column in the schema set. The column type of each column is also append to the name sequence to form the template for a schema column <coli> [col name] : [col type]. All columns in schema is formulated and concatenated together to obtain the representing sequence for schema. The schema sequence is further concatenated with the NL sequence for model input. For SQL sequence, we initialize it with raw SQL query and perform several modifications on it: 1) surrounding entities and values in SQL query with a "'" token, and dropping other surroundings if exist; 2) replacing col entities with their corresponding separate token in schema; 3) inserting spaces be-tween punctuation and words. The formulated SQL sequence is illustrated in Figure 1. The formatting procedure improves consistency between tokenized sequences of source and target, and contributes to the identification and linking of schema entities.

Transformer with Pointer
Follow the previous works on S2S semantic parsing, transformer was used in our architecture proposed by (Vaswani et al., 2017). As we know, a complete standard SQL syntax not only needs to contain inherent keywords, but also needs to extract text span from the source input especially for condition value generation. Different from the traditional seq2seq architecture, the target words are generated from the decoder hidden states through a linear affine transformation that obtains unnormalized scores over a target vocabulary distribution. We use a Hybrid Pointer Generator Network to generate target words inspired by (Jia and Liang, 2016), that is say words can be picked from the target vocabulary V = {Q v , S v , V sql }, and words that are simply pointers to the source sequence. Q v denotes words from corpus, S v denotes words from column, V sql denotes SQL keywords (like SELECT, MAX, MIN, etc.).
In decoding process, for each input sequence I source , we use the transformer encoder to encode it to the hidden states H target . First, the transformer decoder produces the hidden states h t in step t based on previously generated sequence and encoded output as described in (Vaswani et al., 2017), then we use a affine transformation on h t to obtain scores scores vocab = {s 1 , ..., s |V | } over target vocabulary V for each word. As well as we use h t to compute unnormalized attention scores score source = {i 1 , ..., i |input| } with the encoded sequence. Concatenating scores vocab and score source directly to get hybrid scores score hybrid = {s 1 , ..., s |V | , i 1 , ..., i |input| } like in (Rongali et al., 2020), the first |V | positions are the output distribution of the target vocabulary and the last |input| positions are the words pointing to the source tokens. We compute the final probability distribution by P = softmax(score hybrid ). P is used in loss function and next token generation respectively while training and inference.

Schema-aware Denoising
Similar to masked language modeling and other denoising task, we propose two schema-aware objectives, erosion and shuffle, that train model to either reconstruct the origin sequence from noising input or predict corrupted output otherwise. The denoising procedure is illustrated in Figure 2.

Erosion
Given input sequence X = {Q, S}, where Q denote the NL sequence, erosion corrupts the schema sequence S with a serial compositions of three noising operations: Permutation Re-order the concatenation sequence of schema columns during schema formulation. Removal For each column, remove it with a dropping probability p drop . Addition With a addition probability p add , extract a column from another schema that exists in the training database and insert it into current schema set. During all operations above, the order of separating special tokens remains unchanged, therefore the corresponding anonymous entities in SQL query should be updated along with the erosion operations in schema sequence. In particular, if a column entity mentioned in SQL query is removed during erosion, we substitute the corresponding column token in SQL with a masking token <unk> to cope with the absence of the schema information. With such joint modification for schema and SQL sequence, the model is required to identify the schema entities that are truly related to the NL question and learns to raise an unknown exception whenever the schema information is insufficient to compose the target SQL.
Algorithm 1: Training procedure for schema-aware denoising Input :training corpus

Shuffle
Given input sequence X = {Q, S}, where Q = {Q, Y }, the shuffle noise reorders the mentioning sequence of entities in the source query while the schema sequence S is fixed. The denoising objective trains model to reconstruct the query sequence Q with entities in correct order. The objective of recovering shuffled entity orders trains model to capture the inner relation between different entities and therefore contributes to the schema linking performance. It is also notable that, as a selfsupervision objective, both Q and Y are engaged in this denoising task and get trained separately. Though we dependent on the SQL query to identify the value entities in NL query, order shuffling with only column entities is sufficient to obtain promising performance. Since no parallel data is required, additional corpus with monolingual data for both SQL and NL could help with the re-order task and will be one of the further direction of this work.

Training Procedure
Inspired by previous works on denoising selftraining (Song et al.; Lewis et al.), we propose to train the schema-aware denoising objectives along with the main S2S task. During training, for each training sample, we apply a nosing pipeline to it before feeding it into the model. The noises with different type are applied to the sample individually. Through the control of activate probability, they could share the same weights in the overall objective. Such continual noising pipeline generates random-wise corrupted samples during training. It prevents the model from fast over-fitting and could yield results with better generalization (Siddhant et al.). The whole procedure is summarized in Algorithm 1.

Clause-sensitive EG Decoding
During the inference of text-to-SQL task, the predicted SQL may contain errors related to inappropriate schema linking or grammar. EG decoding  is proposed to amend these errors through an executor-in-loop iteration. It is performed by feeding SQL queries in the candidate list into the executor in sequence and discarding those queries that fail to execute or return empty result. Such decoding strategy, while effective, suggests that the major disagreement in the candidate list focuses on schema linking or grammar. Directly perform EG to the candidates generated with beam search leads to trivial improvement, as the candidates consist of redundant variations focuses on selection or schema naming, etc. This problem can be addressed by setting the beam length of most of the predicted tokens to 1 and releasing those tokens related to schema linking (e.g., WHERE). We also notice that there are cases that combine incorrect schema linking with some aggregation in SELECT clause, which return some trivial results such as 0, thus suppress the EG filter. To mitigate the issue, we suggest to drop aggregate operator in SELECT during EG to maximize the effectiveness of it. Note that with such strategy, the condition with inequation in WHERE clause should be dropped together to ensure the validity of the ground-truth SQL results.

Experiment
To demonstrate the effectiveness of the proposed method, we evaluate the proposed model on Wik-iSQL dataset and compare it to other state-of-theart models.

Dataset
The and inference respectively. All ground-truth SQL queries are guaranteed with at least one query result. Each SQL contains SELECT clause with at most one aggregation operator and WHERE clause with at most 4 conditions that connected by AND. Each SQL is associated with a schema in database.

Implementation details
We implement our method using AllenNLP (Gardner et al.) and Pytorch (Paszke et al.). For the model architecture, we use Transformer with 12 layers in each of the encoder and decoder with a hidden size of 1024. We initialize the model weight with bart-large pretrained model provided by Huggingface community (Wolf et al.) and fine-tune it on training dataset for 20 epochs. The batch size during training is set to 8 with a gradient accumulation step of 2. We choose Adam (Kingma and Ba) as the optimizer and set the learning rate to 7e − 5 with a warm-up step ratio of 1%. The weight decay for regulation is set to 0.01. We set the activation probability P swap = 0.5 and P shuf f le = 0.3 to balance the weight between self-supervision and S2S objective. P drop for column removal in erosion is set to 0.1. The early stop patience is set to 5 with respect to the BLUE metric (Papineni et al.) on validation set. The overall training procedure spend around 3 hours on an Ubuntu server with 8  NVIDIA V100 GPUs.

Comparison with State-of-the-art Models
The comparison results are summarized in Table 1. Models suffixed with ♣ leverage additional annotation of the dataset. Models suffixed with ♦ utilize database content during training procedure. Without using EG, SeaD significantly outperforms all models without the auxiliary of table content or schema linking annotation. When combined with EG decoding, SeaD achieve best performance even compared to those models that utilize additional training information. It indicates the effectiveness of the proposed denoising objectives on modeling text-to-SQL through vanilla S2S. Notably, the annotation noise makes aggregation prediction a major challenge for WikiSQL. Previous works suggested to improve AGG prediction via rule-based annotation amendment. As shown in   less human effort is involved. Combined with the AGG dropped clause-sensitive EG, the SeaD model establishes new state-of-the-art on WikiSQL benchmark.
To analysis the detailed improvement for SeaD on text-to-SQL task, in Table 3 we report the accuracy on WikiSQL test set with respect to several SQL components with and without EG decoding. SeaD shows promising results on column selection, aggregation, where column and where value prediction. It outperforms all method except SDSQL, which leverages rule-based annotation of schema linking. After applying EG decoding, SeaD achieves best performance on four out of five components among all competitors.

Ablation Study
To evaluate the contribution of each proposed objective, we perform ablation study to SeaD (4) with WikiSQL dataset. We start from the Bart model and add components to it in sequence. The pointer net contributes to 1.2% absolute improvement of Acc lf on test set. Combine text infilling, an effective denoising objective utilized by Bart, into training procedure brings 0.3 absolute Acc lf improvement. On the other hand, erosion and shuffle objectives contribute to 1.5% and 0.6% absolute Acc lf improvement for SeaD on test set respectively. It demonstrates the effectiveness of the schema-aware denoising objective for improving S2S generation in text-to-SQL task.

Conclusions
In this paper, we proposed to train model with novel schema-aware denoising objectives, which could improve performance of S2S generation for text-to-SQL task. The proposed SeaD model outperforms previous works and achieves state-of-the-art performance on WikiSQL benchmark. The success of the SeaD highlights the potential of utilizing task-oriented denoising objective for S2S model enhancement.