TURING: an Accurate and Interpretable Multi-Hypothesis Cross-Domain Natural Language Database Interface

A natural language database interface (NLDB) can democratize data-driven insights for non-technical users. However, existing Text-to-SQL semantic parsers cannot achieve high enough accuracy in the cross-database setting to allow good usability in practice. This work presents TURING, a NLDB system toward bridging this gap. The cross-domain semantic parser of TURING with our novel value prediction method achieves 75.1% execution accuracy, and 78.3% top-5 beam execution accuracy on the Spider validation set (Yu et al., 2018b). To benefit from the higher beam accuracy, we design an interactive system where the SQL hypotheses in the beam are explained step-by-step in natural language, with their differences highlighted. The user can then compare and judge the hypotheses to select which one reflects their intention if any. The English explanations of SQL queries in TURING are produced by our high-precision natural language generation system based on synchronous grammars.


Introduction
Today a vast amount of knowledge is hidden in structured datasets, not directly accessible to nontechnical users who are not familiar with the corresponding database query language like SQL or SPARQL. Natural language database interfaces (NLDB) enable everyday users to interact with databases (Zelle and Mooney, 1996;Popescu et al., 2003;Li and Jagadish, 2014;Zeng et al., 2020). However, correctly translating natural language to executable queries is challenging, as it requires resolving all the ambiguities and subtleties of natural utterances for precise mapping. Furthermore, quick deployment and adoption for NLDB require zero-shot transfer to new databases without an indomain text-to-SQL parallel corpus, i.e. crossdatabase semantic parsing (SP), making the translation accuracy even lower. Finally, unlike in other NLP applications where partially correct results can still provide partial utility, a SQL query with a slight mistake could cause negative utility if trusted blindly or confusing to users.
The recent Spider benchmark (Yu et al., 2018a) captures this cross-domain problem, and the stateof-the-art methods merely achieve around 70% execution accuracy at the time of this submission 2 . Meanwhile, generalization to datasets collected under different protocols is even weaker (Suhr et al., 2020). Finally, users generally have no way to know if the NLDB made a mistake except in very obvious cases. The high error rate combined with the overall system's opacity makes it hard for users to trust any output from the NLDB.
Our key observation is that our model's top-5 accuracy on Spider is 78.3%, significantly higher than the previous best single-model method at around 68%, and our own top-1 accuracy. Top-5 accuracy is the proportion of times when one of the top five hypotheses from beam-search inference is correct (in execution accuracy evaluation). For top-5 accuracy to be relevant in practice, a nontechnical user needs to be able to pick the correct hypothesis from the candidate list. To this end, we design a feedback system that can unambiguously explain the top beam-search results while presenting the differences intuitively and visually. Users can then judge which, if any, of the parses correctly reflects their intentions. The explanation system uses a hybrid of two synchronous context-free grammars, one shallow and one deep. Together, they achieve good readability for the most frequent 299 query patterns while near-complete coverage overall.
Our system, TURING, is not only interpretable, but also a highly accurate cross-domain NLDB. Our semantic parser is based on the one in Xu et al. (2020), which does not handle value prediction like many other previous state-of-the-art models on Spider. Compared to previous executable semantic parsers, we achieve significant gains with a number of techniques, but predominantly by drastically simplifying the learning problem in value prediction. The model only needs to identify the text span providing evidence for the ground-truth value. The noisy long tail text normalization step required for producing the actual value is offloaded to a deterministic search phase in post-processing.
In summary, this work presents two steps towards a more robust NLDB: 1. A state-of-the-art text-to-SQL parsing system with the best top-1 execution accuracy on the Spider development set.
2. A way to relax usability requirement from top-1 accuracy to top-k accuracy by explaining the different hypotheses in natural language with visual aids.

System Overview
As shown in Figure 1, TURING's interface has two main components: the database browser showing schema and selected database content, and the search panel where the users interact with the parser. Figure 1 caption describes the typical user interaction using an example. Behind the front-end interface, TURING consists of an executable cross-domain semantic parser trained on Spider that maps user utterances to SQL query hypotheses, the SQL execution engine that runs the queries to obtain answers, and the explanation generation module that produces the explanation text and the meta-data powering explanation highlighting. The next sections will describe the semantic parsing and explanation modules.

Semantic Parser
The backbone of TURING is a neural semantic parser which generates an executable SQL query T given a user question Q and the database schema S. We follow the state-of-the-art system (Xu et al., 2020), but extend it to generate executable SQL query instead of ignoring values in the SQL query, like many other top systems (Wang et al., 2019;Guo et al., 2019) on the Spider leaderboard.
On the high-level, our SP adopts the grammarbased framework following TranX (Yin and Neubig, 2018) with an encoder-decoder neural architecture. A grammar-based transition system is designed to turn the generation process of the SQL abstract syntax tree (AST) into a sequence of tree-constructing actions to be predicted by the parser. The encoder f enc jointly encodes both the user question Q = q 1 . . . q |Q| and database schema S = {s 1 , . . . , s |S| } consisting of tables and columns in the database. The decoder f dec is a transition-based abstract syntax decoder, which uses the encoded representation H to predict the target SQL query T . The decoder also relies on the transition system to convert the AST constructed by the predicted action sequences to the executable surface SQL query.
To alleviate unnecessary burden on the decoder, we introduce two novel modifications to the transition system to handle the schema and value decoding. With simple, but effective value-handling, inference and regularization techniques applied on this transition system, we are able to push the execution accuracy much higher for better usability.

Transition System
Our transition system has four types of action to generate the AST, including (1) ApplyRule[r] which applies a production rule r to the latest generated node in the AST; (2) Reduce which completes the generation of the current node; (3) SelectColumn[c] which chooses a column c from the database schema S; (4) CopyToken[i] which copies a token q i from the user question Q.
There are two key distinctions of our transition system with the previous systems. First, our transition system omits the action type SelectTable used by other transition-based SP systems (Wang et al., 2019;Guo et al., 2019). This is made possible by attaching the corresponding table to each column, so that the tables in the target SQL query can be deterministically inferred from the predicted columns. Second, we simplify the value prediction by always trying to copy from the user question, instead of applying the GenToken[v] action (Yin and Neubig, 2018) which generates tokens from a large vocabulary or choose from a pre-processed picklist . Both of the changes constrain the output space of the decoder to ease the Figure 1: TURING system in action: the user selected database "Dog kennels"; the left and top panels show the database schema and table content. The user then entered "What is the average age of the dogs who have gone through any treatments?" in the search box. This question is run through the semantic parser producing multiple SQL hypotheses from beam-search, which are then explained step-by-step as shown. The differences across the hypotheses are highlighted. The tokens corresponding to table and columns are in bold. If there were more valid hypotheses, a "Show more" button would appear to reveal the additional ones. learning process, but the latter change unrealistically assumes that the values are always explicitly mentioned in the question. To retain the generation flexibility without putting excessive burden on the decoder, we propose a conceptually simple but effective strategy to handle the values next.

Handling Values
Value prediction is a challenging, but crucial component of NLDBs, however, only limited efforts are committed to handling values properly in the current cross-domain SP literature. Value mentions are usually noisy, if mentioned explicitly at all, requiring commonsense or domain knowledge to be inferred. On the other hand, the number of possible values in a database can be huge, leading to sparse learning signals if the model tries to choose from the possible value candidates.
Instead of attempting to predict the actual values directly, our SP simply learns to identify the input text spans providing evidence for the values. As mentioned earlier, we introduce the CopyToken action to copy an input span from the user question, indicating the clues for this value. The ground-truth CopyToken[i] actions are obtained from a tagging strategy based on heuristics and fuzzy string matching between the user question and the gold values. As a result, the decoder is able to focus on understanding the question without considering other complexities of the actual values which are difficult to learn. If the values are only implicitly mentioned in the user question, nothing is copied from the user question. We leave the identification of the actual values to a deterministic search-based inference in post-processing, after the decoding process. This yields a simpler learning task as the neural network does not need to perform domain-specific text normalization such as mapping "female" to "F" for some databases.
Given the schema, the predicted SQL AST and the database content, the post-processing first identifies the corresponding column type (number, text, time), operation type (like, between, >, <, =, ...), and aggregation type (count, max, sum, ...). Based on these types, it infers the type and normalization required for the value. If needed, it then performs fuzzy-search in the corresponding column's values in the database. When nothing is copied, a default value is chosen based on some heuristics (e.g., when there exist only two element "Yes" and "No" in the column, the default value is "Yes"); otherwise, the most frequent element in the column is chosen. Searching the database content can also be restricted to a picklist for privacy reasons like previous works (Zeng et al., 2020;.
Another benefit of this simple value handling strategy is the ease to explain. The details are presented in the Sec. 4.
The encoder, f enc , maps the user question Q and the schema S to a joint representation H = {φ q 1 , . . . , φ q |Q| } ∪ {φ s 1 , . . . , φ s |S| }. It contextualizes the question and schema jointly through both the RoBERTA-Large model similar to (Guo et al., 2019), as well as through the additional sequence of 24 relation-aware transformer (RAT) (Wang et al., 2019) layers. As mentioned in Section 3.1, tables are not predicted directly but inferred from the columns, so we augment the column representations by adding the corresponding table representations after the encoding process.
We use a LSTM decoder f dec to generate the action sequence A.
Formally, the generation process can be formulated as Pr(A|H) = t Pr(a t |a <t , H) where H is the encoded representations outputted by the encoder f enc .
The LSTM state is updated following Wang et al. (2019): m m m t , h h h t = f LSTM ([a a a t−1 z z z t−1 h h h pt a a a pt n n n pt ], m m m t−1 , h h h t−1 ), where m m m t is the LSTM cell state, h h h t is the LSTM output at step t, a a a t−1 is the action embedding of the previous step, z z z t−1 is the context representation computed using multi-head cross-attention of h h h t−1 over H, p t is the step corresponding to the parent AST node of the current node, and n n n is the node type embedding. For ApplyRule[r], we compute Pr(a t = ApplyRule[r]|a <t , H) = softmax r (g(z z z t )) where g(·) is a 2-layer MLP. For SelectColumn[c], we use the memory augmented pointer network following Guo et al. (2019). For CopyToken[i], a pointer network is employed to copy tokens from the user question Q with a special token indicating the termination of copy.

Column Label Smoothing
One of the core challenges for cross-domain SP is to generalize to unseen domains without overfitting to some specific domains during training. Empirically, we observe that applying uniform label smoothing (Szegedy et al., 2016) on the objective term for predicting SelectColumn[c] can effectively address the overfitting problem in the cross-domain setting. Formally, the cross-entropy for a ground-truth column c * we optimize becomes (1 − ) * log p(c * ) + K * c log p(c), where K is the number of columns in the schema, is the weight of the label smoothing term, and p(·) Pr(a t = SelectColumn[·]|a <t , H).

Weighted Beam Search
During inference, we use beam search to find the high-probability action sequences. As mentioned above, column prediction is prone to overfitting in the cross-domain setting. In addition, value prediction is dependent on the column prediction, that is, if a column is predicted incorrectly, the associated value has no chance to be predicted correctly. As a result, we introduce two hyperparameters controlling influence based on the action types in the beam, with a larger weight α > 1 for SelectColumn and a smaller weight 0 < β < 1 for CopyToken.

Explanation Generation
The goal of the explanation generation system is to unambiguously describe what the semantic parser understands as the user's command and allow the user to easily interpret the differences across the multiple hypotheses. Therefore, unlike a typical dialogue system setting where language generation diversity is essential, controllability and consistency are of primary importance. The generation not only needs to be 100% factually correct, but the differences in explanation also need to reflect the differences in the predicted SQLs, no more and no less. Therefore, we use a deterministic rule-based generation system instead of a neural model.
Our explanation generator is a hybrid of two synchronous context-free grammar (SCFG) systems combined with additional heuristic post-processing steps. The two grammars trade off readability and coverage. One SCFG is shallow and simple, covering the most frequent SQL queries; the other is deep and more compositional, covering the tail of query distribution that our SP can produce for completeness. The SCFG can produce SQL and English explanation parallel. Given a SQL query, we parse it under the grammar to obtain a derivation, which we then follow to obtain the explanation text. At inference time, for a given question, if any of the SQL hypotheses cannot be parsed using the shallow SCFG, then we move onto the deep one.

Details of the Grammars
Using the deep SQL syntax trees allows almost complete coverage on the Spider domains. However, these explanations can be unnecessarily verbose as the generation process faithfully follows the re-ordered AST without 1.) compressing repeated mentions of schema elements when possible 2.) summarizing tedious details of the SQL query into higher level logical concepts. Even though these explanations are technically correct, practical explanation should allow users to spot the difference between queries easily. To this end, we design the shallow grammar similarly to the template-based explanation system in Elgohary et al. (2020), which simplifies the SQL parse trees by collapsing large subtrees into a single tree fragment. In the resulting shallow parses production rules yield non-terminal nodes corresponding to 1.) anonymized SQL templates 2.) UNION, INTERSECT, or EXCEPT operations of two templates 3.) or a template pattern followed by ORDER-BY-LIMIT clause. Our shal-low but wide grammar has 64 rules with those nonterminal nodes. The pre-terminal nodes are placeholders in the anonymized SQL queries such as Table name, Column name, Aggregation operator and so on. Finally, the terminal nodes are the values filling in the place holders. The advantage of this grammar is that each high-level SQL template can be associated with an English explanation template that reveals the high level logic and abstracts away from the details in the concrete queries. To further reduce the redundancy, we make assumptions to avoid unnecessarily repeating table and column names. Table. 1 showcases some rules from the shallow SCFG and one example of explanation. In practice, around 75% of the examples in the Spider validation set have all beam hypotheses from our SP model parsable by the shallow grammar, with the rest handled by the deep grammar. The deep grammar has less than 50 rules. But because it is more compositional, it covers 100% of the valid SQLs that can be generated by our semantic parser. Some sample explanation by the deep grammar can be found in Table. 2.
Finally, whenever the final value in the query differs from original text span due to post-processing, a sentence in the explanation states the change explicitly for clarity. For example, "'Asian' in the question is matched to 'Asia' which appears in the column Continent."

Quantitative Evaluations
Implementation Details. We apply the DT-Fixup technique from (Xu et al., 2020) to train our semantic parser and mostly re-use their hyperparamters. The weight of the column label smoothing term is 0.2. Inference uses a beam size of 5 for the beam search. We set the column weight as α = 3 and the value weight as β = 0.1. Dataset. We use Spider (Yu et al., 2018b), a complex and cross-domain Text-to-SQL semantic parsing benchmark, which has 10, 180 questions, 5, 693 queries covering 200 databases in 138 domains. All our experiments are evaluated based on the development set. We use the execution match with values (Exec) evaluation metrics. Results on Spider. We compare TURING with the top systems on the Spider execution leaderboard that have published reports with execution accuracy on the development set as well. As seen from step 1: find the average of product price in the products table step 2: find the different values of the product type code in the products table whose average of the product price is greater than the results of step 1 Table 1: Sample shallow grammar production rules and one example explanation.
Step 1: find the entries in the employee table whose age is less than 30.0.
Step 2: among these results, for each city of the employee table, where the number of records is more than 1, find city of the employee table. ---------------"30" in the question is converted to 30. "one" in the question is converted to 1.
Step 1: find combinations of entries in the employee
Ablation Study. Table 4 shows an ablation study of various techniques in TURING. We can see that removing the value post-processing decreases the accuracy significantly, showing that copying alone is not enough due to the mismatch in linguistic variation and the schema specific normalization. The effectiveness of the proposed column label smoothing and weighted beam search are also reflected by the Exec accuracy on Spider. Furthermore, simply adding more hypotheses in the beam can significantly boost the coverage of the correct predictions, leading to 4.5% accuracy gain over the top one accuracy. By combining all these techniques together, TURING achieves an overall performance gain above 10% over the previous best single model system (68.0% of Bridge v2). 3

Related Work
Executable Cross-database Semantic Parsing.
Early NLDB systems use rule-based parsing (Zelle and Mooney, 1996; Li and Jagadish, 2014) and cannot handle the diversity of natural language in practice. Neural semantic parsing is more promising for coverage but is still brittle in real-world applications where queries can involve novel compositions of learned patterns (Finegan-Dollak et al., 2018;. Furthermore, to allow plug-and-play on new databases, the underlying semantic parser may not be trained on in-domain parallel corpus but needs to transfer across domains in a zero-shot fashion. Executable cross-database semantic parsing is even more challenging. Many of the previous work only tackle the cross-domain part, omitting the value prediction problem required for executable queries (Guo et al., 2019;Wang et al., 2019;Choi et al., 2020;Xu et al., 2020). Unlike the output space of predicting the SQL sketch or columns, the value prediction output space is much less constrained. The correct value depends on the source question, the SQL query, the type information of the corresponding column, as well as the database content. This complexity combined with limited training data in standard benchmark datasets like Spider makes the task very difficult. Some previous works directly learn to predict the values (Yin and Neubig, 2018;Guo and Gao, 2020) on Wik-iSQL (Zhong et al., 2017), but does not generalize in cross-domain settings. On Spider, Zeng et al. (2020) and Lin et al. (2020) build a candidate list of values first and learn a pointer network to select from the list. TURING instead learns a pointer network to identify the input source span that provides evidence for the value instead of directly the value as previously described. Identification of the actual value is offloaded to post-processing. From a system perspective, it is also simpler for a power user of the NLDB to upload a domain-specific term description/mapping which can extend the heuristicsearch-based value post-processing instantly rather than relying on re-training.
Query Explanation. Explaining structured query language has been studied in the past (Simitsis and Ioannidis, 2009;Koutrika et al., 2010;Ngomo et al., 2013;Xu et al., 2018). Full NLDB systems can leverage explanations to correct mistakes with user feedback (Elgohary et al., 2020), or to prevent mistakes by giving clarifications (Zeng et al., 2020). However, these methods can only handle cases where the mistake or ambiguity is about the table, column, or value prediction. There is no easy way to resolve structural mistakes or ambiguities if the query sketch is wrong. TURING, on the other hand, offers the potential to recover from such mistakes if the correct query is among the top beam results. This is an orthogonal contribution that could be integrated with other user-interaction modes. Finally, the NaLIR system (Li and Jagadish, 2014) has a similar feature allowing the user to pick from multiple interpretations of the input question. However, NaLIR's interpretation is based on syntactical parses of the question rather than interpreting the final semantic parses directly. A rule-based semantic parser then maps the selected syntactic parse to SQL. As the syntactic parse is not guaranteed to be mapped to the correct SQL, this interpretation does not completely close the gap between what the NLDB performs and what the user thinks it does.

Conclusion
We presented TURING, a natural language interface to databases (NLDB) that is accurate, interpretable, and works on a wide range of domains. Our system explains its actions in natural language so that the user can select the right answer from multiple hypotheses, capitalizing on the much higher beam accuracy instead of top-1 accuracy. TURING provides a complementary way to resolve mistakes and ambiguities in NLDB.