Error Detection for Text-to-SQL Semantic Parsing

Despite remarkable progress in text-to-SQL semantic parsing in recent years, the performance of existing parsers is still far from perfect. Specifically, modern text-to-SQL parsers based on deep learning are often over-confident, thus casting doubt on their trustworthiness when deployed for real use. In this paper, we propose a parser-independent error detection model for text-to-SQL semantic parsing. Using a language model of code as its bedrock, we enhance our error detection model with graph neural networks that learn structural features of both natural language questions and SQL queries. We train our model on realistic parsing errors collected from a cross-domain setting, which leads to stronger generalization ability. Experiments with three strong text-to-SQL parsers featuring different decoding mechanisms show that our approach outperforms parser-dependent uncertainty metrics. Our model could also effectively improve the performance and usability of text-to-SQL semantic parsers regardless of their architectures. (Our implementation is available at https://github.com/OSU-NLP-Group/Text2SQL-Error-Detection)


Introduction
Recent years have witnessed a renewed interest in text-to-SQL semantic parsing (Bogin et al., 2019;Lin et al., 2020;Wang et al., 2020;Rubin and Berant, 2021;Cao et al., 2021;Gan et al., 2021;Scholak et al., 2021;Qi et al., 2022;Li et al., 2023), which allows users with a limited technical background to access databases through a natural language interface.Although state-of-the-art semantic parsers have achieved remarkable performance on Spider (Yu et al., 2018), a large-scale cross-domain text-to-SQL benchmark, their performance is still far from satisfactory for real use.While syntax errors can be automatically caught by SQL execution engines, detecting semantic errors in executable SQL queries can be non-trivial and time-consuming even for experts (Jorgensen and Shepperd, 2007;Weiss et al., 2007).Therefore, an accurate error detector that can flag parsing issues and accordingly trigger error correction procedures (Chen et al., 2023) can contribute to building better natural language interfaces to databases.
Researchers have proposed multiple approaches for error detection in text-to-SQL parsing.Yao et al. (2019Yao et al. ( , 2020) ) detect errors by setting a threshold on the prediction probability or dropout-based uncertainty of the base parser.However, using these parser-dependent metrics requires the base parser to be calibrated, which limits their applicability.Several interactive text-to-SQL systems detect parsing errors based on uncertain span detection (Gur et al., 2018;Li et al., 2020;Zeng et al., 2020).Despite having high coverage for errors, this approach is reported to be of low precision.Finally, text-to-SQL re-rankers (Yin and Neubig, 2019;Kelkar et al., 2020;Bogin et al., 2019;Arcadinho et al., 2022), which estimate the plausibility of SQL predictions, can be seen as on-the-fly error detectors.Nevertheless, existing re-rankers are trained on in-domain parsing errors, limiting their generalization ability.
In this work, we propose a generalizable and parser-independent error detection model for textto-SQL semantic parsing.Since syntax errors can be easily detected by an execution engine, we focus on detecting semantic errors in executable SQL predictions.We start developing our model with CodeBERT (Feng et al., 2020), a language model pre-trained on multiple programming languages.On top of that, we use graph neural networks to capture compositional structures in natural language questions and SQL queries to improve the performance and generalizability of our model.We train our model on parsing mistakes collected from a realistic cross-domain setting, which is indispensable to the model's strong generalization ability.Furthermore, we show that our model is versatile and can be used for multiple tasks, including er-ror detection, re-ranking, and interaction triggering.To summarize, our contributions include: • We propose the first generalizable and parserindependent error detection model for text-to-SQL parsing that is effective on multiple tasks and different parser designs without any taskspecific adaptation.Our evaluations show that the proposed error detection model outperforms parser-dependent uncertainty metrics and could maintain its high performance under cross-parser evaluation settings.
• Our work is the first comprehensive study on error detection for text-to-SQL parsing.We evaluate the performance of error detection methods on both correct and incorrect SQL predictions.In addition, we show through simulated interactions that a more accurate error detector could significantly improve the efficiency and usefulness of interactive textto-SQL parsing systems.
2 Related Work

Text-to-SQL Semantic Parsing
Most existing neural text-to-SQL parsers adopt three different decoding mechanisms.The first one is sequence-to-sequence with constrained decoding, where a parser models query synthesis as a sequence generation task and prunes syntactically invalid parses during beam search.Several strong text-to-SQL parsers apply this simple idea, including BRIDGE v2 (Lin et al., 2020), PICARD (Scholak et al., 2021), and RESDSQL (Li et al., 2023).Another popular decoding mechanism is grammar-based decoding (Yin and Neubig, 2017), where parsers first synthesize an abstract syntax tree based on a pre-defined grammar and then convert it into a SQL query.Parsers using intermediate representations, such as IR-Net (Guo et al., 2019) and NatSQL (Gan et al., 2021) also fall into this category.Grammar-based decoding ensures syntactic correctness but makes the task harder to learn due to the introduction of non-terminal syntax tree nodes.Different from the above autoregressive decoding strategies, SmBoP (Rubin and Berant, 2021) applies bottom-up decoding where a SQL query is synthesized by combining parse trees of different depths using a ranking module.We evaluate our model with semantic parsers using each of these three decoding strategies and show that our model is effective on all of them.

Re-ranking for Text-to-SQL Parsing
Noticing the sizable gap between the accuracy and beam hit rate of semantic parsers, researchers have explored building re-ranking models to bridge this gap and improve parser performance.Global-GNN (Bogin et al., 2019) re-ranks beam predictions based on the database constants that appear in the predicted SQL query.This re-ranker is trained together with its base parser.More recently, Bertrand-DR (Kelkar et al., 2020) and T5QL (Arcadinho et al., 2022) fine-tune a pre-trained language model for re-ranking.However, both report directly re-ranking all beams using re-ranker scores hurts performance.To get performance gain from re-ranking, Bertrand-DR only raises the rank of a prediction if its re-ranking score is higher than the preceding one by a threshold.T5QL combines re-ranking score and prediction score by a weighted sum.Both approaches require tuning hyper-parameters.In contrast, when directly using the proposed parser-independent error detection model as re-rankers, we observe performance improvement on BRIDGE v2 and NatSQL without any constraint, showing that our approach is more generalizable and robust.

Interactive Text-to-SQL Parsing Systems
Interactive text-to-SQL parsing systems improve the usability of text-to-SQL semantic parsers by correcting potential errors in the initial SQL prediction through interactive user feedback.MISP (Yao et al., 2019(Yao et al., , 2020) ) initiates interactions by setting a confidence threshold for the base parser's prediction probability.While this approach is intuitive, it requires the base parser to be well-calibrated when decoding, which does not hold for most modern parsers using deep neural networks.In addition, this design can hardly accommodate some recent parsers, such as SmBoP (Rubin and Berant, 2021), whose bottom-up decoding mechanism does not model the distribution over the output space.Several other interactive frameworks (Gur et al., 2018;Li et al., 2020;Zeng et al., 2020) trigger interactions when an incorrect or uncertain span is detected in the input question or predicted SQL query.While these approaches have high coverage for parsing errors, they tend to trigger unnecessary interactions for correct initial predictions.For example, PIIA (Li et al., 2020) triggers interactions on 98% of the questions on Spider's development set when its base parser has an accuracy of 49%.Compared to these methods, the proposed method strikes a better balance between performance and efficiency, and thus could improve the user experience of interactive text-to-SQL parsing systems.
3 Parser-independent Error Detection predicted by a text-to-SQL parser, the error detection model estimates the probability of ŷ being correct, denoted by s: We perform error detection and action triggering by setting a threshold for s.For re-ranking, we directly use s as the ranking score without modification.

Cross-domain Error Collection
We consider two factors that could lead to text-to-SQL parsing errors: insufficient training data and the cross-domain generalization gap.To simulate such errors, we collect data from weak versions of base parsers in a cross-domain setting.More specifically, we split the Spider training set into two equal-sized subsets by databases and train the base parser on each subset.Then we perform inference on the complementary subset and collect beam predictions as data for error detection.We keep executable SQL queries and label them based on execution accuracy.We use a fixed version of Spider's official evaluation script (Appendix A) and keep up to five parser predictions for each question after deduplication.The collected samples are divided into training and development sets by an 80:20 ratio according to databases as well.In this way, we get high-quality training data for our error detection model in a setting that approximates the real cross-domain testing environment.For testing, we train each base parser on the full Spider training set and collect executable beam predictions on the Spider development set.Beams with un-executable top predictions are skipped.We report the number of beams, total number of question-SQL pairs, and average number of such pairs per beam for each split in Table 1.Following existing literature (Kelkar et al., 2020), we refer to correct SQL queries in the beam as beam hits, and incorrect ones as beam misses.
We notice that BRIDGE v2 generates significantly fewer executable SQL queries on all data splits.This is due to its unconstrained decoder with rule-based filtering.In addition to that, BRIDGE v2 generates a default SQL query that counts the number of entries in the first table when there are no valid predictions in the beam.Although this mechanism ensures the parser generates at least one executable query, such negative samples do not fit the overall error distribution and may harm the error detection model.NatSQL and SmBoP take into account grammatical constraints of SQL during decoding and thus could generate more executable queries than BRIDGE v2.Table 1 also shows that NatSQL and SmBoP produce a similar amount of beam hits and beam misses on the training and development splits.However, the number of executable beam misses generated by SmBoP on the test split is noticeably lower, while the behavior of NatSQL is more consistent.

Model Architecture
Figure 1 illustrates the architecture of the proposed error detection models.We use Code-BERT (Feng et al., 2020) as our base encoder to jointly encode the input question and SQL query.Following CodeBERT's input construction during pre-training, we concatenate questions and SQL queries with special tokens, namely as input and obtain their contextualized representations h X and h ŷ.We only use question and SQL as input since we found in preliminary experiments that adding database schema information (table and column names) in the input hurts performance.
In light of the compositional nature of questions and SQL queries, we propose to model their structural features via graph neural networks.For natural language questions, we obtain their dependency parse trees and constituency parse trees from Stanza (Qi et al., 2020) and merge them together.This is possible since edges in dependency parse trees are between two actual tokens, which corresponds to leaf nodes in constituency parse trees.For SQL queries, we extract their abstract syntax trees via Antlr4. 2 To make the input graphs more compact and lower the risk of overfitting, we further simplify the parse trees by removing non-terminal nodes that only have one child in a top-down order.Additionally, for SQL queries, we remove the subtrees for join constraints which do not carry much semantic information but are often quite long.At last, we add sequential edges connecting the leaf nodes in the parse trees by their order in the original questions and SQL queries to preserve natural ordering features during graph learning.
We initialize the representations of parse tree leaf nodes with CodeBERT's contextualized representations and randomly initialize representations of other nodes according to their types in the parse tree.The two input graphs are encoded by two separate 3-layer graph attention networks (Brody et al., 2022).Then we obtain the global representation of each graph via average pooling and concatenate them to get an aggregated global representation: We denote models with graph encoders as Code-BERT+GAT in Section 4. When simply fine-tuning CodeBERT, Finally, a 2-layer feed-forward neural network with tanh activation is used to score the aggregated representation v.The score s for each input question-SQL pair is: where y * is the gold SQL query and σ represents the sigmoid function.We train our model by minimizing a binary cross entropy loss: During training, we supply the model with samples from K beams at each step, where K is the batch size.

Experiments
In this section, we first evaluate the performance (Section 4.2.1) and generalization ability (Section 4.2.2) of our error detection model on the binary error detection task.Then we investigate our model's effectiveness when used for re-ranking (Section 4.2.3) and action triggering (Section 4.2.4).

Experiment Setup
Baseline Methods We compare our parserindependent error detectors with parser-dependent uncertainty metrics, including prediction probability and dropout-based uncertainty.Since SmBoP (Rubin and Berant, 2021) uses bottom-up decoding which separately scores and ranks each candidate prediction, we deduplicate SmBoP's beam predictions by keeping the maximum score and perform softmax on the deduplicated beam to get a probability distribution over candidate predictions, which can be seen as a reasonable approximation to its confidence.BRIDGE v2 (Lin et al., 2020) and Nat-SQL (Gan et al., 2021) use autoregressive decoders, and we directly use the log probability of its prediction as its confidence score.Probability-based methods are denoted by superscript p.In terms of dropout-based uncertainty, we follow MISP (Yao et al., 2019) and measure the standard deviation of the scores (SmBoP) or log probability (BRIDGE v2 and NatSQL) of the top-ranked prediction in 10 passes.Dropout-based uncertainty is denoted by superscript s.

Evaluation Metrics
We first evaluate our model on the error detection task.After that, we test performance when it is used for re-ranking and action triggering.For error detection, we report precision, recall, and F1 scores for each method on both positive and negative samples.However, these metrics depend on the threshold used.To more comprehensively evaluate the overall discriminative ability of each method, we present the area under the receiver operating characteristic curve (AUC), which is not affected by the choice of threshold.We apply 5fold cross-validation and report performance using the threshold that maximizes the accuracy of each method.Test samples are partitioned by databases.
For the re-ranking task, we evaluate on the final beam predictions of fully trained base parsers on Spider's development set and report top-1 accuracy.
For action triggering, we evaluate system performance under two settings: answer triggering and interaction triggering.In answer triggering, we measure system answer precision when answering different numbers of questions.In interaction triggering, we measure system accuracy using different numbers of interactions.
Error detection and re-ranking results are average performance over 3 different random seeds.For action triggering, we evaluate checkpoints with the highest accuracy on the development split of our collected data.
Implementation Our models are trained with a batch size of 16 and are optimized by the AdamW (Loshchilov and Hutter, 2019) optimizer with default parameters.Training lasts 20 epochs with a learning rate of 3e-5 following a linear decay schedule with 10% warm-up steps.All models are trained on an NVIDIA RTX A6000 GPU.

Error Detection
To evaluate error detection methods in a realistic setting, we use final SQL predictions made by Sm-BoP, BRIDGE v2, and NatSQL on Spider's development set that are executable as test datasets.As shown in model yields the largest performance gain in both accuracy and AUC with NatSQL and reasonable gains with the other two parsers, possibly due to the higher quality of its training data and better behavior consistency on the test split.

Cross-parser Generalization
We evaluate our models' cross-parser generalization ability by training error detectors on data collected from one parser and testing on the other two following the same 5-fold cross-validation setting.Table 3 summarizes cross-parser transfer performance on each parser.Even in this setting, our error detectors could still outperform parser-dependent metrics except for SmBoP, where our models fall short slightly in AUC.On all parsers, we observe better performance on models trained with stronger parsers.For example, on SmBoP, the CodeBERT+GAT model trained with NatSQL is better than the one trained with BRIDGE v2 by 2.1% in accuracy and 1.2% in AUC.Meanwhile, the models trained with SmBoP perform the best on BRIDGE v2 and NatSQL in negative F1, accuracy, and AUC.We hypothesize errors made by stronger parsers are more diverse and of higher quality and thus allow models trained on them to generalize better to weaker parsers.We found that the errors generated by BRIDGE v2 and NatSQL, two autoregressive parsers, are more likely to share prefixes and differ in simple operations, such as the choice of columns, aggregation functions, or logic operators (examples in Appendix E).In contrast, the bottom-up decoder of SmBoP generates more diverse errors with complex structures, such as subqueries and set operations.The higher diversity of SmBoP's predictions increases the coverage of the data collected from it, which contributes to the stronger generalization ability of the corresponding error detectors.We evaluate the re-ranking performance of our error detection models in two settings.In reranking-all (RR), we re-rank all beams based on the score assigned by the error detector.In error detection then re-ranking (ED+RR), we only re-rank the beams whose top-ranked prediction has a score below a given threshold.For simplicity, we use a decision threshold of 0.5 for error detection.

Re-ranking
As shown in Table 4, our error detectors can improve the performance of BRIDGE v2 and NatSQL in both settings without training on any re-ranking supervision.Compared with existing re-rankers, our model does not need extra hyper-parameters for performance gain, even in the re-ranking-all setting.However, re-ranking hurts the performance of SmBoP.We attribute this to the larger train-test discrepancy due to the bottom-up nature of SmBoP's decoder.As discussed in Section 3.2 and Section 4.2.2,SmBoP produces more diverse beam predictions, but its behavior is less consistent on the test split.While the diversity benefits the quality of data for training error detectors, the inconsistency makes re-ranking on the test split harder.Although SmBoP is the strongest parser among the three, state-of-the-art text-to-SQL parsers predominantly use autoregressive decoders.Therefore, we still expect our approach to be generally applicable.We further perform 0-shot re-ranking evaluation on the more challenging KaggleDBQA (Lee et al., 2021) dataset (Appendix B).CodeBERT+GAT improves BRIDGE v2's accuracy from 20.5% to 21.8%, showing good generalization to unseen datasets.

Action Triggering in Interactive Systems
In this section, we evaluate the potential gain of using our error detection model as an answer trig-ger and interaction trigger in interactive semantic parsing systems.
Answer triggering When using error detectors for answer triggering, the interactive semantic parsing system restrain from answering the user's question when an error is detected.The upper half of Figure 2 demonstrates the change of precision when varying the decision threshold.In general, a high threshold p (or lower s) reduces the number of questions answered for higher precision.Conversely, a lower p (or higher s) encourages the system to answer more questions at the cost of making more mistakes.
Because of the high precision on positive samples, the proposed error detectors outperform both baseline methods and allow the system to answer more questions at higher precision.As shown by Table 5, when maintaining a precision of 95%, our error detectors allow the system to answer 76% to 175% more questions compared to parserdependent metrics.Table 7: The number of interactions each parser needs with interaction triggering to reach an accuracy of 95%.
Interaction triggering We simulate the potential gain of more accurate interaction triggers by assuming oracle error correction interactions, where any detected error would be fixed through interactions with users.Ideally, we would want to get higher system accuracy with fewer interactions.The lower half of Figure 2 illustrates the change of accuracy at different interaction budgets.Our parser-independent models consistently improve upon parser-dependent metrics, resulting in more efficient interactive semantic parsing systems.Due to higher precision and recall on erroneous base predictions, systems using our models could correct more errors and avoid unnecessary interactions.As shown by Table 7, depending on the base parser, our model brings a 3.4% to 33% reduction to the number of interactions required for reaching an accuracy of 95%.

Ablation
We perform ablation studies on the impact of crossdomain error collection and graph learning using the CodeBERT+GAT model.We report the models' accuracy, AUC, and re-ranking performance in the re-rank all setting (RR) on the test split of NatSQL.We also test the models on BRIDGE v2 and SmBoP to evaluate their generalization ability.

Cross-domain error collection
We train a Nat-SQL model using the full Spider training set and perform inference on the same data set to get its beam predictions.Then we create training data for error detection following the procedure described in Section 3.2.In this way, we collect in-domain parsing errors in the same setting as Bertrand-DR and T5QL.As shown by Table 6, the error detector trained on in-domain errors significantly underperforms the one trained on cross-domain errors.The performance of NatSQL deteriorates after reranking, which is consistent with the findings of previous re-rankers.Thus, we conclude that collecting high-quality parsing errors in a realistic crossdomain setting is critical to building an accurate and generalizable error detector.

Simplified graph input
In this setting, we do not simplify constituency parse trees and SQL abstract syntax trees when constructing input graphs for graph neural networks.Table 6 shows that the model without graph simplification slightly outperforms the one using simplified graphs in AUC.Despite that, its re-ranking and cross-parser generalization performance are lower.We hypothesize that graph simplification could maintain important structural features of the input and improve the model's generalization ability by alleviating overfitting during training.

Conclusion
In this work, we propose the first generalizable parser-independent error detection model for textto-SQL semantic parsing.Through learning compositional structures in natural language and SQL queries, the proposed model significantly outperforms parser-dependent uncertainty metrics and could generalize well to unseen parsers.We further demonstrate the versatility of our approach in error detection, re-ranking, and action triggering through a case study with three state-of-the-art text-to-SQL parsers featuring different decoding mechanisms.
Our experiments highlight the important role of structural features and cross-domain training data in building strong and generalizable error detec-tors for semantic parsing.Potential future work includes (1) developing more advanced architectures to better evaluate the semantic correctness of synthesized SQL queries, (2) exploring data synthesis strategies to automatically create high-quality training data for error detection models.

Limitations
This work is the first attempt towards building a versatile error detector for text-to-SQL semantic parsing.Although our model is parser-independent, the current data collection process depends on the choice of base parsers.As a result, the collected data may inherit certain biases in the base parsers.Our experiments show that data collected from stronger base parsers helps the model to generalize to weaker parsers.However, how to collect high-quality training data for error detection with stronger base parsers like SmBoP remains an open problem.A promising future direction may be developing a comprehensive data synthesis approach to improve the quality of training data.Grappa (Yu et al., 2021) uses context-free grammar to synthesize SQL queries for pre-training Transformer encoders for text-to-SQL parsing.This approach could be adapted to generate syntactically correct but semantically incorrect SQL queries in a controllable way.
Another major limitation is that our current model does not consider database schema information.Since SQL queries are grounded in databases, in principle database schema (tables, columns, and foreign-key relationships) should be an important part of error detection.The common practice in text-to-SQL semantic parsing is to linearize the database schema and concatenate all table and column names to the input to the Transformer encoder.However, our preliminary experiments show that this operation actually hurts the error detection performance.A similar observation is also reported by Kelkar et al. (2020).Nevertheless, our approach performs strongly for error detection as it can still effectively capture semantic errors that are free from schema linking mistakes.This can be explained by the high column mention rate in Spider (Pi et al., 2022).Future work could develop more effective entity linking mechanisms to extend our model to more challenging testing environments where schema linking errors are more common.The original grammar for select_core: Notice the excessive use of * in the original grammar that fails to represent the hierarchical relationship between the SELECT statement and each clause.

E Qualitative Beam Examples
As mentioned in Section 3.2, the three text-to-SQL parsers behave differently.We present their beam predictions on two samples in our training split in Table E.4.We can observe that SmBoP and NatSQL could generate more executable SQL queries than BRIDGE v2.Both SmBoP and Nat-SQL are capable of generating diverse errors, but NatSQL's beam predictions are more likely to share prefixes.As an example, SmBoP generates diverse SELECT clauses on both samples, while the SELECT clauses predicted by BRIDGE v2 and NatSQL do not change.
Figure 1: Architecture of our error detection models.

Figure 2 :
Figure 2: Performance in simulated interactive semantic parsing with three base parsers.

Table 2 :
Error detection performance with three base parsers on Spider's development set.We highlight the best performance with each parser in bold.

Table 3 :
Cross-parser generalization performance with three base parsers on Spider's development set.We highlight the best performance with each target parser in bold.

Table 5 :
The number of questions each parser could answer when maintaining a precision of 95%.

Table 6 :
Ablation results using the CodeBERT + GAT model trained on data collected from NatSQL.We report accuracy, AUC, re-ranking-all (RR) performance on NatSQL's test split as in-domain evaluation and report accuracy and AUC when tested on SmBoP and BRIDGE v2 as generalization evaluation.NatSQL p is included for reference.