LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations

This work aims to tackle the challenging heterogeneous graph encoding problem in the text-to-SQL task. Previous methods are typically node-centric and merely utilize different weight matrices to parameterize edge types, which 1) ignore the rich semantics embedded in the topological structure of edges, and 2) fail to distinguish local and non-local relations for each node. To this end, we propose a Line Graph Enhanced Text-to-SQL (LGESQL) model to mine the underlying relational features without constructing meta-paths. By virtue of the line graph, messages propagate more efficiently through not only connections between nodes, but also the topology of directed edges. Furthermore, both local and non-local relations are integrated distinctively during the graph iteration. We also design an auxiliary task called graph pruning to improve the discriminative capability of the encoder. Our framework achieves state-of-the-art results (62.8% with Glove, 72.0% with Electra) on the cross-domain text-to-SQL benchmark Spider at the time of writing.


Introduction
The text-to-SQL task (Zhong et al., 2017;Xu et al., 2017) aims to convert a natural language question into a SQL query, given the corresponding database schema. It has been widely studied in both academic and industrial communities to build natural language interfaces to databases (NLIDB, Androutsopoulos et al., 1995).
One daunting problem is how to jointly encode the question words and database schema items (including tables and columns), as well as various relations among these heterogeneous inputs. Typically, previous literature utilizes a node-centric graph neural network (GNN, Scarselli et al., 2008)   to aggregate information from neighboring nodes. GNNSQL (Bogin et al., 2019a) adopts a relational graph convolution network (RGCN, Schlichtkrull et al., 2018) to take into account different edge types between schema items, such as T-HAS-C relationship 1 , primary key and foreign key constraints. However, these edge features are directly retrieved from a fixed-size parameter matrix and may suffer from the drawback: unaware of contextualized information, especially the structural topology of edges. Meta-path is defined as a composite relation linking two objects, which can be used to capture multi-hop semantics. For example, in Figure 1(a), relation Q-EXACTMATCH-C and C-BELONGSTO-T can form a 2-hop meta-path indicating that some table t has one column exactly mentioned in the question.
Although RATSQL (Wang et al., 2020a) introduces some useful meta-paths such as C-SAMETABLE-C, it treats all relations, either 1-hop or multi-hop, in the same manner (relative position embedding, Shaw et al., 2018) in a complete graph. Without distinguishing local and non-local neighbors, see Figure 1(b), each node will attend to all the other nodes equally, which may lead to the notorious over-smoothing problem (Chen et al., 2020a). Besides, meta-paths are currently constructed by domain experts or explored by breadthfirst search (Kong et al., 2012). Unfortunately, the number of possible meta-paths increases exponentially with the path length, and selecting the most important subset among them is an NP-complete problem (Lao and Cohen, 2010).
To address the above limitations, we propose a Line Graph Enhanced Text-to-SQL model (LGESQL), which explicitly considers the topological structure of edges. According to the definition of a line graph (Gross and Yellen, 2005), we firstly construct an edge-centric graph from the original node-centric graph. These two graphs capture the structural topology of nodes and edges, respectively. Iteratively, each node in either graph gathers information from its neighborhood and incorporates edge features from the dual graph to update its representation. As for the node-centric graph, we combine both local and non-local edge features into the computation. Local edge features denote 1-hop relations and are dynamically provided by node embeddings in the line graph, while non-local edge features are directly extracted from a parameter matrix. This distinction encourages the model to pay more attention to local edge features while maintaining information from multihop neighbors. Additionally, we propose an auxiliary task called graph pruning. It introduces an inductive bias that the heterogeneous graph encoder of text-to-SQL should be intelligent to extract the golden schema items related to the question from the entire database schema graph.
Experimental results on benchmark Spider (Yu et al., 2018b) demonstrate that our LGESQL model promotes the exact set match accuracy to 62.8% (with GLOVE, Pennington et al. 2014) and 72.0% (with pretrained language model ELEC-TRA, Clark et al. 2020). Our main contributions are summarized as follows: • We propose to model the 1-hop edge features with a line graph in text-to-SQL. Both nonlocal and local features are integrated during the iteration process of node embeddings.
• We design an auxiliary task called graph prun-ing, which aims to determine whether each node in the database schema graph is relevant to the given question.
• Empirical results on dataset Spider demonstrate that our model is effective, and we achieve state-of-the-art performances both without and with pre-trained language models.

Preliminaries
Problem definition Given a natural language question Q = (q 1 , q 2 , · · · , q |Q| ) with length |Q| and the corresponding database schema S = T ∪ C, the target is to generate a SQL query y. The database schema S contains multiple tables T = {t 1 , t 2 , · · · } and columns C = {c t 1 1 , c t 1 2 , · · · , c t 2 1 , c t 2 2 , · · · }. Each table t i is described by its name and is further composed of several words (t i1 , t i2 , · · · ). Similarly, we use word phrase (c t i j1 , c t i j2 , · · · ) to represent column c t i j ∈ t i . Besides, each column c t i j also has a type field c t i j0 to constrain its cell values (e.g. TEXT and NUMBER).
The entire input node-centric heterogeneous graph G n = (V n , R n ) consists of all three types of nodes mentioned above, that is V n = Q ∪ T ∪ C with the number of nodes |V n | = |Q| + |T | + |C|, where |T | and |C| are the number of tables and columns respectively.
Meta-path As shown in Figure 1(a), a meta-path represents a path τ 1 where the target vertex type of previous relation r i−1 equals to the source vertex type τ i of the current relation r i . It describes a composite relation r = r 1 • r 2 · · · • r l between nodes with type τ 1 and τ l+1 . In this work, τ i ∈ {QUESTION,TABLE,COLUMN}. Throughout our discussion, we use the term local to denote relations with path length 1, while nonlocal relations refer to meta-paths longer than 1. The relational adjacency matrix R n contains both local and non-local relations, see Appendix A for enumeration.
Line Graph Each vertex v e i , i = 1, 2, · · · , |V e | in the line graph G e = (V e , R e ) can be uniquely mapped to a directed edge r n st ∈ R n , or v n s → v n t , in the original node-centric graph G n = (V n , R n ). Function f maps the source and target node index tuple (s, t) into the "edge" index i = f (s, t) in G e . The reverse mapping is f -1 . In the line graph G e , a directed edge r e ij ∈ R e exists from node v e i to v e j , iff the target node of edge r n f -1 (i) and the source node of edge r n f -1 (j) in G n are exactly the same node. Actually, r e ij captures the information flow in meta-path r n f -1 (i) • r n f -1 (j) . We prevent backtracking cases where two reverse edges will not be connected in G e , illustrated in Figure 2.
We only utilize local relations in R n as the node set V e to avoid creating too many nodes in the line graph G e . Symmetrically, each edge in R e can be uniquely identified by the node in V n . For example, in the upper right part of Figure 2, the edge between nodes "e1" and "e2" in the line graph can be represented by the middle node with double solid borderlines in the original graph.

Method
After constructing the line graph, we utilize the classic encoder-decoder architecture (Sutskever et al., 2014;Bahdanau et al., 2015) as the backbone of our model. LGESQL consists of three parts: a graph input module, a line graph enhanced hidden module, and a graph output module (see Figure 3 for an overview). The first two modules aim to map the input heterogeneous graph G n into node embeddings X ∈ R |V n |×d , where d is the graph hidden size. The graph output module retrieves and transforms X into the target SQL query y.

Graph Input Module
This module aims to provide the initial embeddings for both nodes and edges. Initial local edge features Z 0 ∈ R |V e |×d and non-local edge features Z nlc ∈ R (|R n |−|V e |)×d are directly retrieved from a parameter matrix. For nodes, we can obtain their representations from either word vectors GLOVE (Pennington et al., 2014) or a pre-trained language model (PLM) such as BERT (Devlin et al., 2019).
GLOVE Each word q i in the question Q or schema item t i ∈ T or c t i j ∈ C can be initialized by looking up the embedding dictionary without considering the context. Then, these vectors are passed into three type-ware bidirectional LSTMs (BiL-STM, Hochreiter and Schmidhuber, 1997) respectively to attain contextual information. We concatenate the forward and backward hidden states for each question word q i as the graph input x 0 q i . As for table t i , after feeding (t i0 , t i1 , t i2 , · · · ) into the BiLSTM (special type t i0 = "table", ∀i), we concatenate the last hidden states in both directions as the graph input x 0 t i (similarly for column c t i j ). These node representations are stacked together to form the initial node embeddings matrix X 0 ∈ R |V n |×d .
PLM Firstly, we flatten all question words and schema items into a sequence, where columns belong to the same table are clustered together 2 : [CLS]q 1 q 2 · · · q |Q| [SEP]t 10 t 1 c t 1 10 c t 1 1 c t 1 20 c t 1 2 · · · t 20 t 2 c t 2 10 c t 2 1 c t 2 20 c t 2 2 · · · [SEP]. The type information t i0 or c t i j0 is inserted before each schema item. Since each word w is tokenized into sub-words, we append a subword attentive pooling layer after PLM to obtain word-level representations. Concretely, given the output sequence of subword features w s 1 , w s 2 , · · · , w s |w| for each subword w s i in w, the word-level representation w is 3 where v s and W s are trainable parameters. After obtaining the word vectors, we also feed them into three BiLSTMs according to the node types and get the graph inputs X 0 for all nodes.

Line Graph Enhanced Hidden Module
It contains a stack of L dual relational graph attention network (Dual RGAT) layers. In each layer l, two RGATs (Wang et al., 2020b) capture the structure of the original graph and line graph, respectively. Node embeddings in one graph play the role of edge features in another graph. For example, the edge features used in graph G n are provided by the node embeddings in graph G e . We use X l ∈ R |V n |×d to denote the input node embedding matrix of graph G n in the l-th layer, l ∈ {0, 1, · · · , L − 1}. As for each specific node v n i ∈ V n , we use x l i . Similarly, matrix Z l ∈ R |V e |×d and vector z l i are used to denote node embeddings in the line graph. Following RAT-SQL (Wang et al., 2020a), we use multi-head scaled dot-product (Vaswani et al., 2017) to calculate the attention weights. For brevity, we formulate the entire computation in one layer as two basic modules: where Z nlc is the aforementioned non-local edge features in the original graph G n .

RGAT for the Original Graph
Given the node-centric graph G n , the output representation x l+1 i of the l-th layer is computed bỹ where represents vector concatenation, matrices are trainable parameters, H is the number of heads and FFN(·) denotes a feedforward neural network. N n i represents the receptive field of node v n i and function ψ(r n ji ) returns a d-dim feature vector of relation r n ji . Operator [·] H h first evenly splits the vector into H parts and returns the h-th partition. Since there are two genres of relations (local and nonlocal), we design two schemes to integrate them: Mixed Static and Dynamic Embeddings If r n ji is a local relation, ψ(r n ji ) returns the node embedding z l f (j,i) from the line graph 4 . Otherwise, ψ(r n ji ) directly retrieves the vector from the non-local embedding matrix Z nlc , see Figure 4. The neighborhood function N n i for node v n i returns the entire node set V n and is shared across different heads. Multi-head Multi-view Concatenation An alternative is to split the muli-head attention module into two parts. In half of the heads, the neighborhood function N n i of node v n i only contains nodes that are reachable within 1-hop. In this case, ψ(r n ji ) returns the layer-wise updated feature z l f (j,i) from Z l . In the other heads, each node has access to both local and non-local neighbors, and ψ(·) always returns static entries in the embedding matrix Z nlc ∪ Z 0 , see Figure 5 for illustration. In either scheme, the RGAT module treats local and non-local relations differently and relatively manipulates the local edge features more carefully.

RGAT for the Line Graph
Symmetrically, given edge-centric graph G e , the updated node representation z l+1 i from z l i is calculated similarly with little modifications: Here φ(r e ji ) returns the feature vector of relation r e ji in G e . Since we only consider local relations in the line graph, N e i only includes 1-hop neighbous and φ(r e ji ) equals to the source node embedding in X l of edge v e i . Attention that the relational feature is added on the "query" side instead of the "key" side when computing attention logitsβ h ji cause it is irrelevant to the incoming edges. For example, in Figure 3, the connecting nodes of two edge pairs (1-4, 4-5) and (2-4, 4-5) are the same node with The output matrices of the final layer L are the desired outputs of the encoder: X = X L , Z = Z L .

Graph Output Module
This module includes two tasks: one decoder for the main focus text-to-SQL and the other one to perform an auxiliary task called graph pruning. We use the subscript to denote the collection of node embeddings with a specific type, e.g., X q is the matrix of all question node embeddings.

Text-to-SQL Decoder
We adopt the grammar-based syntactic neural decoder (Yin and Neubig, 2017) to generate the abstract syntax tree (AST) of the target query y in depth-first-search order. The output at each decoding timestep is either 1) an APPLYRULE action that expands the current non-terminal node in the partially generated AST, or 2) SELECTTABLE or SELECTCOLUMN action that chooses one schema item x s i from the encoded memory X s = X t ∪ X c . Mathematically, P (y|X) = j P (a j |a <j , X), where a j is the action at the j-th timestep. For more implementation details, see Appendix B.

Graph Pruning
We hypothesize that a powerful encoder should distinguish irrelevant schema items from golden schema items used in the target query. In Figure 6, the question-oriented schema sub-graph (above the shadow region) can be easily extracted. The intent c2 and the constraint c5 are usually explicitly mentioned in the question, identified by dot-product attention mechanism or schema linking. The linking nodes such as t1, c3, c4, t2 can be inferred by the 1-hop connections of the schema graph to form a connected component. To introduce this inductive bias, we design an auxiliary task that aims to classify each schema node s i ∈ S = T ∪ C based on its relevance with the question and the sparse structure of the schema graph. Firstly, we compute the context vectorx s i from the question node embeddings X q for each schema node s i via multi-head attention.
where W h sq , W h sk , W h sv ∈ R d×d/H and W so ∈ R d×d are network parameters. Then, a biaffine (Dozat and Manning, 2017) binary classifier is used to determine whether the compressed context vectorx s i and the schema node embedding x s i are correlated.
The ground truth label y g s i of a schema item is 1 iff s i appears in the target SQL query. The training object can be formulated as . This auxiliary task is combined with the main text-to-SQL task in a multitasking way. Similar ideas (Bogin et al., 2019b;Yu et al., 2020) and other association schemes are discussed in Appendix C.

Experiments
In this section, we evaluate our LGESQL model in different settings. Codes are public available 5 .

Experiment Setup
Dataset Spider (Yu et al., 2018b) is a largescale cross-domain zero-shot text-to-SQL benchmark 6 . It contains 8659 training examples across 146 databases in total, and covers several domains from other datasets such as Restaurants (Popescu et al., 2003), GeoQuery (Zelle and Mooney, 1996), Scholar (Iyer et al., 2017), Academic (Li and Jagadish, 2014), Yelp and IMDB (Yaghmazadeh et al., 2017) datasets. The detailed statistics are shown in Table 1. We follow the common practice to report the exact set match accuracy on the validation and test dataset. The test dataset contains 2147 samples with 40 unseen databases but is not public available. We submit our model to the organizer of the challenge for evaluation. Implementations We preprocess the questions, table names, and column names with toolkit Stanza (Qi et al., 2020) for tokenization and lemmatization. Our model is implemented with Pytorch (Paszke et al., 2019), and the original and line graphs are constructed with library DGL (Wang et al., 2019a). Within the encoder, we use GLOVE (Pennington et al., 2014) word embeddings with dimension 300 or pretrained language models (PLMs), BERT (Devlin et al., 2019) or ELEC-TRA (Clark et al., 2020), to leverage contextual information. With GLOVE, embeddings of the most frequent 50 words in the training set are fixed during training while the remaining will be finetuned. The schema linking strategy is borrowed from RATSQL (Wang et al., 2020a), which is also our baseline system. During evaluation, we adopt beam search decoding with beam size 5.
Hyper-parameters In the encoder, the GNN hidden size d is set to 256 for GLOVE and 512 for PLMs. The number of GNN layers L is 8. In the decoder, the dimension of hidden state, action embedding and node type embedding are set to 512, 128 and 128 respectively. The recurrent dropout rate (Gal and Ghahramani, 2016) is 0.2 for decoder LSTM. The number of heads in multi-head attention is 8 and the dropout rate of features is set to 0.2 in both the encoder and decoder. Throughout the experiments, we use AdamW (Loshchilov and Hutter, 2019) optimizer with linear warmup scheduler. The warmup ratio of total training steps is 0.1. For GLOVE, the learning rate is 5e-4 and the weight decay coefficient is 1e-4; For PLMs, we use smaller leaning rate 2e-5 (base) or 1e-5 (large), and larger weight decay rate 0.1. The optimization of the PLM encoder is carried out more carefully with layer-wise learning rate decay coefficient 0.8. Batch size is 20 and the maximum gradient norm is 5. The number of training epochs is 100 for GLOVE, and 200 for PLMs respectively.
Taking one step further, we compare more finegrained performances of our model to the baseline system RATSQL (Wang et al., 2020a) classified by the level of difficulty in Table 3. We observe that LGESQL surpasses RATSQL across all subdivisions in both the validation and test datasets regardless of the application of a PLM, especially at the Medium and Extra Hard levels. This validates the superiority of our model by exploiting the structural relations among edges in the line graph.

Ablation Studies
In this section, we investigate the contribution of each design choice. We report the average accuracy on the validation dataset with 5 random seeds.  RGATSQL is our baseline system where the line graph is not utilized. It can be viewed as a variant of RATSQL with our tailored grammar-based decoder. From Table 4, we can discover that: 1) if non-local relations or meta-paths are removed (w/o NLC), the performance will decrease roughly by 2 points in LGESQL, while 3 points drop in RGAT-SQL. However, our LGESQL with merely local relations is still competitive. It consolidates our motivation that by exploiting the structure among edges, the line graph can capturing long-range relations to some extent. 2) graph pruning task con-tributes more in LGESQL (+1.2%) than RGAT-SQL (+0.7%) on account of the fact that local relations are more critical to structural inference. 3) Two strategies of combining local and non-local relations introduced in § 3.2.1 (w/ MSDE or MMC) are both beneficial to the eventual performances of LGESQL (2.0% and 2.1% gains, respectively). It corroborates the assumption that local and nonlocal relations should be treated with distinction. However, the performance remains unchanged in RGATSQL, when merging a different view of the graph (w/ MMC) into multi-head attention. This may be caused by the over-smoothing problem of a complete graph.  In this part, we analyze the effects of different pre-trained language models in Table 5. From the overall results, we can see that: 1) by involving the line graph into computation, LGESQL outperforms the baseline model RGATSQL with different PLMs, further demonstrating the effectiveness of explicitly modeling edge features. 2) large series PLMs consistently perform better than base models on account of their model capacity and generalization capability to unseen domains. 3) Task adaptive PLMs especially ELECTRA are superior to vanilla BERT irrespective of the upper GNN architecture. We hypothesize the reason is that ELECTRA is pre-trained with a tailored binary classification task, which aims to individually distinguish whether each input word is substituted given the context. Essentially, this self-supervised task is similar to our proposed graph pruning task, which focuses on enhancing the discriminative capability of the encoder.

Case Studies
In Figure 7, we compare the SQL queries generated by our LGESQL model with those created by the baseline model RGATSQL. We notice that LGESQL performs better than the baseline system, especially on examples that involve the JOIN operation of multiple tables. For instance, in the second case where the connection of three tables are included, RGATSQL fails to identify the existence of table flights. Thus, it is unable to predict the WHERE condition about the destination city and does repeat work. In the third case, our LGESQL still successfully constructs a connected schema sub-graph by linking table "template" to "documents". Sadly, the RGATSQL model neglects the occurrence of "documents" again. However, in the last case, our LGESQL is stupid to introduce an unnecessary table "airports". It ignores the situation that table "flights" has one column "source airport" which already satisfies the requirement.

Related Work
Encoding Problem for Text-to-SQL To tackle the joint encoding problem of the question and database schema, Xu et al. (2017) proposes "column attention" strategy to gather information from columns for each question word. TypeSQL (Yu et al., 2018a) incorporates prior knowledge of column types and schema linking as additional input features. Bogin et al. (2019a) and  deal with the graph structure of database schema via GNN. EditSQL (Zhang et al., 2019b) considers "co-attention" between question words and database schema nodes similar to the common practice in text matching (Chen et al., 2017). BRIDGE (Lin et al., 2020) further leverages the database content to augment the column representation. The most advanced method RATSQL (Wang et al., 2020a), utilizes a complete relational graph attention neural network to handle various pre-defined relations. In this work, we further consider both local and non-local, dynamic and static edge features among different types of nodes with a line graph.
Heterogeneous Graph Neural Network Apart from the structural topology, a heterogeneous graph (Shi et al., 2016) also contains multiple types of nodes and edges. To address the heterogeneity of node attributes,  designs a type-based content encoder and Fu et al. (2020) utilizes a type-specific linear transformation. For edges, relational graph convolution network (RGCN, Schlichtkrull et al., 2018) and relational graph attention network (RGAT, Wang et al., 2020b) have been proposed to parameterize different relations. HAN (Wang et al., 2019b) converts the original heterogeneous graph into multiple homogeneous graphs and applies a hierarchical attention mechanism to the meta-path-based sub-graphs. Similar ideas have been adopted in dialogue state tracking (Chen et al., 2020b(Chen et al., , 2019a, dialogue policy learning (Chen et al., 2018) and text matching (Chen et al., 2020c;Lyu et al., 2021) to handle heterogeneous inputs. In another branch, Chen et al. (2019b), Zhu et al. (2019) and Zhao et al. (2020) construct the line graph of the original graph and explicitly model the computation over edge features. In this work, we borrow the idea of a line graph and update both node and edge features via iteration over dual graphs.

Conclusion
In this work, we utilize the line graph to update the edge features in the heterogeneous graph for the text-to-SQL task. Through the iteration over the structural connections in the line graph, local edges can incorporate multi-hop relational features and capture significant meta-paths. By further integrating non-local relations, the encoder can learn from multiple views and attend to remote nodes with shortcuts. In the future, we will investigate more useful meta-paths and explore more effective methods to deal with different meta-path-based neighbors.

A Local and Non-Local Relations
In this work, meta-paths with length 1 are local relations, and other meta-paths are non-local relations. Specifically, Table 6 provides the list of all local relations according to the types of source and target nodes. Notice that we preserve the NO-MATCH relation because there is no overlapping between the entire question and any schema item in some cases. This relaxation will dramatically increase the number of edges in the line graph. To resolve it, we remove edges in the line graph that the source and target nodes both represent relation types of MATCH series. In other words, we prevent information propagating between these bipartite connections during the iteration of the line graph.
The checklist in Table 6 is only a subset of all relations defined in RATSQL (Wang et al., 2020a). For the remaining relations, we treat them as nonlocal relations for a fair comparison to the baseline system RATSQL.

B.1 ASDL Grammar
The complete grammar used to translate the SQL into a series of actions is provided in Figure 8.
Here are some criteria when we design the abstract syntax description language (ASDL, Wang et al., 1997) for the target SQL queries: 1. Keep the length of the action sequence short to prevent the long-term forgetting problem in the auto-regressive decoder. To achieve this goal, we remove the optional operator "?" defined in Wang et al. (1997) and extend the number of constructors by enumeration. For example, we expand all solutions of type sql unit according to the existence of different clauses.
2. Hierarchically, group and re-use the same type in a top-down manner for parameter sharing. For example, we use the same type col unit when choosing columns in different clauses and create the type val unit such that both the SELECT clause and CON-DITION clauses can refer to it.
3. When generating a list of items of the same type, instead of emitting a special action RE-DUCE as the symbol of termination (Yin and Neubig, 2017), we enumerate all possible number of occurrences in the training set (see the constructors for type select and from in Figure 8). Then, we generate each item based on this quantitative limitation. Preliminary experimental results prove that thinking in advance is better than a lazy decision.
Our grammar can cover 98.7% and 98.2% cases in the training and validation dataset, respectively.

B.2 Decoder Architecture
Given the encoded memory X = [X q ; X t ; X c ] ∈ R |V n |×d , where |V n | = |Q|+|T |+|C|, the goal of a text-to-SQL decoder is to produce a sequence of actions which can construct the corresponding AST of the target SQL query. In our experiments, we utilize a single layer ordered neurons LSTM (ON-LSTM, Shen et al., 2019) as the auto-regressive decoder. Firstly, we initialize the decoder state h 0 via attentive pooling over the memory X.
where v 0 is a trainable row vector and W 0 , W 1 are parameter matrices. Then, in the structured ON-LSTM decoder, the hidden states at each timestep j is updated as where m j is the cell state of the j-th timestep, a j−1 is the embedding of the previous action, a p j is the embedding of parent action, h pt is the embedding of parent hidden state, and n j denotes the type embedding of the current frontier node 7 . Given the current decoder state h j , we adopt multi-head attention (8 heads) mechanism to calculate the context vectorh j over X. This context vector is concatenated with h j and passed into a 2-layer MLP with tanh activation unit to obtain the attention vector h att j . The dimension of h att j is 512. For APPLYRULE action, the probability distribution is computed by a softmax classification layer: y is the next word of x. C C FOREIGNKEY y is the foreign key of x.
T C

HAS
The column y belongs to the table x.

PRIMARYKEY
The column y is the primary key of the table x.

Q T NOMATCH
No overlapping between x and y. PARTIALMATCH x is part of y, but the entire question does not contain y. EXACTMATCH x is part of y, and y is a span of the entire question.
Q C

NOMATCH
No overlapping between x and y. PARTIALMATCH x is part of y, but the entire question does not contain y. EXACTMATCH x is part of y, and y is a span of the entire question. VALUEMATCH x is part of the candidate cell values of column y. For SELECTTABLE action, we directly copy the table t i from the encoded memory X t .
To be consistent, we also apply the multi-head attention mechanism here with H = 8 heads. The calculation of SELECTCOLUMN action is similar with different network parameters.

C Graph Pruning
Similar ideas have been proposed by Bogin et al. (2019b) and Yu et al. (2020). Our proposed task differs from their methods in two aspects: Prediction target Yu et al. (2020) devises several syntactic roles for schema items and performs multi-class classification instead of binary discrimination. Based on our assumption, the encoder is responsible for the discrimination capability while the decoder organizes different schema items and components into a complete semantic frame. Thus, we simplify the training target into binary labels.
Combination method Bogin et al. (2019b) utilizes another RGCN to calculate the relevance score for each schema item in Global-GNNSQL. This score is incorporated into the encoder RGCN as a soft input coefficient. Different from this cascaded method, graph pruning is employed in a multitasking manner. We have tried different approaches to combine this auxiliary module with the primary text-to-SQL model in our preliminary experiments, such as: 1) Similar to Bogin et al. (2019b), we utilize a separate graph encoder to conduct graph pruning firstly, and use another refined graph encoder (the same architecture, e.g., RGAT) to jointly encode the pruned schema graph and the question. These two encoders can share network parameters of only the embeddings or more upper GNN layers. If they share all 8 layers, the entire encoder will degenerate from the pipelined mode into our multitasking fashion. Empirical results in Table 7 demonstrate that when these two encoders share more layers, the performance of the text-to-SQL model is better.  Table 7: Variation of performances when gradually increasing the number of layers shared between the pruning and the main encoders.
2) We can constrain the text-to-SQL decoder to only attend and retrieve schema items from the pruned encoded memory when calculating attention vectors and select columns or tables. In other words, the graph pruning module and the text-to-SQL decoder are connected in a cascaded way. Through pilot experiments, we observe the flagrant training-inference inconsistency problem. The text-to-SQL decoder is trained upon the golden