SmBoP: Semi-autoregressive Bottom-up Semantic Parsing

The de-facto standard decoding method for semantic parsing in recent years has been to autoregressively decode the abstract syntax tree of the target program using a top-down depth-first traversal. In this work, we propose an alternative approach: a Semi-autoregressive Bottom-up Parser (SmBoP) that constructs at decoding step t the top-K sub-trees of height ≤ t. Our parser enjoys several benefits compared to top-down autoregressive parsing. From an efficiency perspective, bottom-up parsing allows to decode all sub-trees of a certain height in parallel, leading to logarithmic runtime complexity rather than linear. From a modeling perspective, a bottom-up parser learns representations for meaningful semantic sub-programs at each step, rather than for semantically-vacuous partial trees. We apply SmBoP on Spider, a challenging zero-shot semantic parsing benchmark, and show that SmBoP leads to a 2.2x speed-up in decoding time and a ~5x speed-up in training time, compared to a semantic parser that uses autoregressive decoding. SmBoP obtains 71.1 denotation accuracy on Spider, establishing a new state-of-the-art, and 69.5 exact match, comparable to the 69.6 exact match of the autoregressive RAT-SQL+Grappa.


Introduction
Semantic parsing, the task of mapping natural language utterances into programs (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Clarke et al.;Liang et al., 2011), has converged in recent years on a standard encoder-decoder architecture. Recently, meaningful advances emerged on the encoder side, including developments in Transformer-based architectures (Wang et al., 2020a) and new pretraining techniques (Yin et al., 2020;Herzig et al., 2020;Yu et al., 2020;Deng et al., 2020;Shi et al., 2021). Conversely, the decoder has remained roughly constant for years, where the abstract syntax tree of the target program is autoregressively decoded in a top-down manner (Yin and Neubig, 2017;Krishnamurthy et al., 2017;Rabinovich et al., 2017).
Bottom-up decoding in semantic parsing has received little attention (Cheng et al., 2019;Odena et al., 2020). In this work, we propose a bottom-up semantic parser, and demonstrate that equipped with recent developments in Transformer-based (Vaswani et al., 2017) architectures, it offers several advantages. From an efficiency perspective, bottom-up parsing can naturally be done semiautoregressively: at each decoding step t, the parser generates in parallel the top-K program sub-trees of depth ≤ t (akin to beam search). This leads to runtime complexity that is logarithmic in the tree size, rather than linear, contributing to the rocketing interest in efficient and greener artificial intelligence technologies (Schwartz et al., 2020). From a modeling perspective, neural bottom-up parsing provides learned representations for meaningful (and executable) sub-programs, which are sub-trees computed during the search procedure, in contrast to top-down parsing, where hidden states represent partial trees without clear semantics. Figure 1 illustrates a single decoding step of our parser. Given a beam Z t with K = 4 trees of height t (blue vectors), we use cross-attention to contextualize the trees with information from the input question (orange). Then, we score the frontier, that is, the set of all trees of height t + 1 that can be constructed using a grammar from the current beam, and the top-K trees are kept (purple). Last, a representation for each of the new K trees is generated and placed in the new beam Z t+1 . After T decoding steps, the parser returns the highest-scoring tree in Z T that corresponds to a full program. Because we have gold trees at training time, the entire model is trained jointly using maximum likelihood. We evaluate our model, SMBOP 1 2 (SeMi- What are the names of actors over 60? Prune frontier Represent-beam Figure 1: An overview of the decoding procedure of SMBOP. Z t is is the beam at step t, Z t is the contextualized beam after cross-attention, F t+1 is the frontier (κ, σ, ≥ are logical operations applied on trees, as explained below), F t+1 is the pruned frontier, and Z t+1 is the new beam. At the top we see the new trees created in this step. For t = 0 (depicted here), the beam contains the predicted schema constants and DB values.
autoregressive Bottom-up semantic Parser), on SPI-DER (Yu et al., 2018), a challenging zero-shot text-to-SQL dataset. We implement the RAT-SQL+GRAPPA encoder (Yu et al., 2020), currently the best model on SPIDER, and replace the autoregressive decoder with the semi-autoregressive SM-BOP. SMBOP obtains an exact match accuracy of 69.5, comparable to the autoregressive RAT-SQL+GRAPPA at 69.6 exact match, and to current state-of-the-art at 69.8 exact match (Zhao et al., 2021), which applies additional pretraining. Moreover, SMBOP substantially improves state-of-theart in denotation accuracy, improving performance from 68.3 → 71.1. Importantly, compared to autoregressive semantic parsing , we observe an average speed-up of 2.2x in decoding time, where for long SQL queries, speed-up is between 5x-6x, and a training speed-up of ∼5x. 3

Background
Problem definition We focus in this work on text-to-SQL semantic parsing. Given a training set , where x (i) is an utterance, y (i) is its translation to a SQL query, and S (i) is the schema of the target database (DB), our goal is to 3 Our code is available at https://github.com/ OhadRubin/SmBop learn a model that maps new question-schema pairs (x, S) to the correct SQL query y. A DB schema S includes : (a) a set of tables, (b) a set of columns for each table, and (c) a set of foreign key-primary key column pairs describing relations between table columns. Schema tables and columns are termed schema constants, and denoted by S.

RAT-SQL encoder
This work is focused on decoding, and thus we implement the state-of-the-art RAT-SQL encoder (Wang et al., 2020b), on top of GRAPPA (Yu et al., 2020), a pre-trained encoder for semantic parsing. We now briefly review this encoder for completeness.
The RAT-SQL encoder is based on two main ideas. First, it provides a joint contextualized representation of the utterance and schema. Specifically, the utterance x is concatenated to a linearized form of the schema S, and they are passed through a stack of Transformer (Vaswani et al., 2017) layers. Then, tokens that correspond to a single schema constant are aggregated, which results in a final contextualized representation (x, s) = (x 1 , . . . , x |x| , s 1 , . . . , s |s| ), where s i is a vector representing a single schema constant. This contextualization of x and S leads to better representation and alignment between the utterance and schema.
Second, RAT-SQL uses relational-aware selfattention (Shaw et al., 2018) to encode the structure of the schema and other prior knowledge on relations between encoded tokens. Specifically, given a sequence of token representations (u 1 , . . . , u |u| ), relational-aware self-attention computes a scalar similarity score between pairs of token representations e ij ∝ u i W Q (u j W K + r K ij ). This is identical to standard self-attention (W Q and W K are the query and key parameter matrices), except for the term r K ij , which is an embedding that represents a relation between u i and u j from a closed set of possible relations. For example, if both tokens correspond to schema tables, an embedding will represent whether there is a primary-foreign key relation between the tables. If one of the tokens is an utterance word and another is a table column, a relation will denote if there is a string match between them. The same principle is also applied for representing the self-attention values, where another relation embedding matrix is used. We refer the reader to the RAT-SQL paper for details.
Overall, RAT-SQL jointly encodes the utterance, schema, the structure of the schema and alignments between the utterance and schema, and leads to state-of-the-art results in text-to-SQL parsing.
RAT-SQL layers are typically stacked on top of a pre-trained language model, such as BERT (Devlin et al., 2019). In this work, we use GRAPPA (Yu et al., 2020), a recent pre-trained model that has obtained state-of-the-art results in text-to-SQL parsing. GRAPPA is based on ROBERTA , but is further fine-tuned on synthetically generated utterance-query pairs using an objective for aligning the utterance and query.
Autoregressive top-down decoding The prevailing method for decoding in semantic parsing has been grammar-based autoregressive top-down decoding (Yin and Neubig, 2017;Krishnamurthy et al., 2017;Rabinovich et al., 2017), which guarantees decoding of syntactically valid programs. Specifically, the target program is represented as an abstract syntax tree under the grammar of the formal language, and linearized to a sequence of rules (or actions) using a top-down depth-first traversal. Once the program is represented as a sequence, it can be decoded using a standard sequence-tosequence model with encoder attention (Dong and Lapata, 2016), often combined with beam search. We refer the reader to the aforementioned papers for further details on grammar-based decoding.
We now turn to describe our method, which provides a radically different approach for decoding in semantic parsing.

The SMBOP parser
We first provide a high-level overview of SMBOP (see Algorithm 1 and Figure 1). As explained in §2, we encode the utterance and schema with a RAT-SQL encoder. We initialize the beam (line 3) with the K highest scoring trees of height 0, which include either schema constants or DB values. All trees are scored independently and in parallel, in a procedure formally defined in §3.3. Next, we start the search procedure. At every step t, attention is used to contextualize the trees with information from input question representation (line 5). This representation is used to score every tree on the frontier: the set of sub-trees of depth ≤ t + 1 that can be constructed from subtrees on the beam with depth ≤ t (lines 6-7). After choosing the top-K trees for step t+1, we compute a new representation for them (line 8). Finally, we return the top-scoring tree from the final decoding step, T . Steps in our model operate on tree representations independently, and thus each step is efficiently parallelized.
SMBOP resembles beam search as in each step it holds the top-K trees of a fixed height. It is also related to (pruned) chart parsing, since trees at step t + 1 are computed from trees that were found at step t. This is unlike sequence-to-sequence models where items on the beam are competing hypotheses without any interaction.
We now provide the details of our parser. First, we describe the formal language ( §3.1), then we provide precise details of our model architecture ( §3.2) including beam initialization ( §3.3, we describe the training procedure ( §3.4), and last, we discuss the properties of SMBOP compared to prior work ( §3.5).

Operation
Notation

Representation of Query Trees
Relational algebra Guo et al. (2019) have shown recently that the mismatch between natural language and SQL leads to parsing difficulties. Therefore, they proposed SemQL, a formal query language with better alignment to natural language. In this work, we follow their intuition, but instead of SemQL, we use the standard query language relational algebra (Codd, 1970). Relational algebra describes queries as trees, where leaves (terminals) are schema constants or DB values, and inner nodes (non-terminals) are operations (see Table 1). Similar to SemQL, its alignment with natural language is better than SQL. However, unlike SemQL, it is an existing query language, commonly used by SQL execution engines for query planning.
We write a grammar for relational algebra, augmented with SQL operators that are missing from relational algebra. We then implement a transpiler that converts SQL queries to relational algebra for parsing, and then back from relational algebra to SQL for evaluation. Table 1 shows the full grammar, including the input and output semantic types of all operations. A relation (R) is a tuple (or tuples), a predicate (P ) is a Boolean condition (evaluating to True or False), a constant (C) is a schema constant or DB value, and (C ) is a set of constants/values. Figure 2 shows an example re- lational algebra tree with the corresponding SQL query. More examples illustrating the correspondence between SQL and relational algebra (e.g., for the SQL JOIN operation) are in Appendix B. While our relational algebra grammar can also be adapted for standard top-down autoregressive parsing, we leave this endeavour for future work.
Tree balancing Conceptually, at each step SM-BOP should generate new trees of height ≤ t + 1 and keep the top-K trees computed so far. In practice, it is convenient to assume that trees are balanced. Thus, we want the beam at step t to only have trees that are of height exactly t (t-high trees).
To achieve this, we introduce a unary KEEP operation that does not change the semantics of the subtree it is applied on. Hence, we can always grow the height of trees in the beam without changing the formal query. For training (which we elaborate on in §3.4), we balance all relational algebra trees in the training set using the KEEP operation, such that the distance from the root to all leaves is equal. For example, in Figure 2, two KEEP operations are used to balance the column actor.name. After tree balancing, all constants and values are at height 0, and the goal of the parser at step t is to generate the gold set of t-high trees.

Model Architecture
To fully specify Alg. 1, we need to define the following components: (a) scoring of trees on the frontier (lines 5-6), (b) representation of trees (line 8), and (c) representing and scoring of constants and DB values during beam initialization (leaves). We now describe these components. Figure 3 illustrates the scoring and representation of a binary operation.
Scoring with contextualized beams SMBOP maintains at each decoding step a beam Z t = i is a symbolic representation of the query tree, and z (t) i is its corresponding vector representation. Unlike standard beam search, trees on our beams do not only compete with one another, but also compose with each other (similar to chart parsing). For example, in Fig. 1, the beam Z 0 contains the column age and the value 60, which compose using the ≥ operator to form the age ≥ 60 tree.
We contextualize tree representations on the beam using cross-attention. Specifically, we use standard attention (Vaswani et al., 2017) to give tree representations access to the input question: K ) are the queries, and the input tokens (x 1 , . . . , x |x| ) are the keys and values.
Next, we compute scores for all (t + 1)-high trees on the frontier. Trees can be generated by applying either a unary (including KEEP) operation u ∈ U or binary operation b ∈ B on beam trees. Let w u be a scoring vector for a unary operation (such as w κ , w δ , etc.), let w b be a scoring vector for a binary operation (such as w σ , w Π , etc.), and let z i , z j be contextualized tree representations on the beam. We define a scoring function for frontier trees, where the score for a new tree z new generated by applying a unary rule u on a tree z i is defined as follows: where F F U is a 2-hidden layer feed-forward layer with relu activations, and [·; ·] denotes concatenation. Similarly the score for a tree generated by applying a binary rule b on the trees z i , z j is: where F F B is another 2-hidden layer feed-forward layer with relu activations.
We use semantic types to detect invalid rule applications and fix their score to s(z new ) = −∞. This guarantees that the trees SMBOP generates are well-formed, and the resulting SQL is executable. Overall, the total number of trees on the frontier is ≤ K|U| + K 2 |B|. Because scores of different trees on the frontier are independent, they are efficiently computed in parallel. Note that we score new trees from the frontier before creating a representation for them, which we describe next.
Recursive tree representation after scoring the frontier, we generate a recursive vector representation for the top-K trees. While scoring is done with Figure 3: Illustration of our tree scoring and representation mechanisms. z is the symbolic tree, z is its vector representation, and z its contextualized representation.
contextualized trees, representations are not contextualized. We empirically found that contextualized tree representations slightly reduce performance, possibly due to optimization issues.
We represent trees with another standard Transformer layer. Let z new be the representation for a new tree, let e be an embedding for a unary or binary operation, and let z i , z j be non-contextualized tree representations from the beam we are extending. We compute a new representation as follows: where for the unary KEEP operation, we simply copy the representation from the previous step.
Return value As mentioned, the parser returns the highest-scoring tree in Z T . More precisely, we return the highest-scoring returnable tree, where a returnable tree is a tree that has a valid semantic type, that is, Relation (R).

Beam initialization
As described in Line 3 of Alg. 1, the beam Z 0 is initialized with K schema constants (e.g., actor, age) and DB values (e.g., 60, "France"). In particular, we independently score schema constants and choose the top-K 2 , and similarly score DB values and choose the top-K 2 , resulting in a total beam of size K.

Schema constants
We use a simple scoring function f const (·). Recall that s i is a representation of a constant, contextualized by the rest of the schema and the utterance. The function f const (·) is a feedforward network that scores each schema constant independently: f const (s i ) = w const tanh (W const s i ), and the top-K 2 constants are placed in Z 0 . DB values Because the number of values in the DB is potentially huge, we do not score all DB values. Instead, we learn to detect spans in the question that correspond to DB values. This leads to a setup that is similar to extractive question answering (Rajpurkar et al., 2016), where the model outputs a distribution over input spans, and thus we adopt the architecture commonly used in extractive question answering. Concretely, we compute the probability that a token is the start token of a DB value, P start (x i ) ∝ exp(w start x i ), and similarly the probability that a token is the end token of a DB value, P end (x i ) ∝ exp(w end x i ), where w start and w end are parameter vectors. We define the probability of a span (x i , . . . , x j ) to be P start (x i ) · P end (x j ), and place in the beam Z 0 the top-K 2 input spans, where the representation of a span (x i , x j ) is the average of x i and x j .
A current limitation of SMBOP is that it cannot generate DB values that do not appear in the input question. This would require adding a mechanism such as "BRIDGE" proposed by .

Training
To specify the loss function, we need to define the supervision signal. Recall that given the gold SQL program, we convert it into a gold balanced relational algebra tree z gold , as explained in §3.1 and Figure 2. This lets us define for every decoding step the set of t-high gold sub-trees Z gold t . For example Z gold 0 includes all gold schema constants and input spans that match a gold DB value, 4 Z gold 1 includes all 1-high gold trees, etc.
During training, we apply "bottom-up Teacher Forcing" (Williams and Zipser, 1989), that is, we populate 5 the beam Z t with all trees from Z gold t and then fill the rest of the beam (of size K) with the top-scoring non-gold predicted trees. This guarantees that we will be able to compute a loss at each decoding step, as described below.
Loss function During search, our goal is to give high scores to the possibly multiple sub-trees of 4 In Spider, in 98.2% of the training examples, all gold DB values appear as input spans. 5 We compute this through an efficient tree hashing procedure. See Appendix A. the gold tree. Because of teacher forcing, the frontier F t+1 is guaranteed to contain all gold trees Z gold t+1 . We first apply a softmax over all frontier trees p(z new ) = softmax{s(z new )} znew∈F t+1 and then maximize the probabilities of gold trees: where the loss is normalized by C, the total number of summed terms. In the initial beam, Z 0 , the probability of an input span is the product of the start and end probabilities, as explained in §3.3.

Discussion
To our knowledge, this work is the first to present a semi-autoregressive bottom-up semantic parser. We discuss the benefits of our approach.
SMBOP has theoretical runtime complexity that is logarithmic in the size of the tree instead of linear for autoregressive models. Figure 4 shows the distribution over the height of relational algebra trees in SPIDER, and the size of equivalent SQL query trees. Clearly, the height of most trees is at most 10, while the size is 30-50, illustrating the potential of our approach. In §4, we demonstrate that indeed semi-autoregressive parsing leads to substantial empirical speed-up.
Unlike top-down autoregressive models, SM-BOP naturally computes representations z for all sub-trees constructed at decoding time, which are well-defined semantic objects. These representations can be used in setups such as contextual semantic parsing, where a semantic parser answers a sequence of questions. For example, given the questions "How many students are living in the dorms?" and then "what are their last names?", the pronoun "their" refers to a sub-tree from the SQL tree of the first question. Having a representation for such sub-trees can be useful when parsing the second question, in benchmarks such as SPARC (Yu et al., 2019).
Another potential benefit of bottom-up parsing is that sub-queries can be executed while parsing (Berant et al., 2013;Liang et al., 2017), which can guide the search procedure. Recently, Odena et al. (2020) proposed such an approach for program synthesis, and showed that conditioning on the results of execution can improve performance. We do not explore this advantage of bottom-up parsing in this work, since executing queries at training time leads to a slow-down during training.
SMBOP is a bottom-up semi-autoregressive parser, but it could potentially be modified to be autoregressive by decoding one tree at a time. Past work (Cheng et al., 2019) has shown that the performance of bottom-up and top-down autoregressive parsers is similar, but it is possible to re-examine this given recent advances in neural architectures.

Experimental Evaluation
We conduct our experimental evaluation on SPIDER (Yu et al., 2018), a challenging large-scale dataset for text-to-SQL parsing. SPIDER has become a common benchmark for evaluating semantic parsers because it includes complex SQL queries and a realistic zero-shot setup, where schemas at test time are different from training time.

Experimental setup
We encode the input utterance x and the schema S with GRAPPA, consisting of 24 Transformer layers, followed by another 8 RAT-SQL layers, which we implement inside AllenNLP (Gardner et al., 2018). Our beam size is K = 30, and the number of decoding steps is T = 9 at inference time, which is the maximal tree depth on the development set. The transformer used for tree representations has one layer, 8 heads, and dimensionality 256. We train for 60K steps with batch size 60, and perform early stopping based on the development set.
Evaluation We evaluate performance with the official SPIDER evaluation script, which computes exact match (EM), i.e., whether the predicted SQL query is identical to the gold query after some query normalization. The evaluation script uses  anonymized queries, where DB values are converted to a special value token. In addition, for models that output DB values, the evaluation script computes denotation accuracy, that is, whether executing the output SQL query results in the right denotation (answer). As SMBOP generates DB values, we evaluate using both EM and denotation accuracy Models We compare SMBOP to the best nonanonymous models on the SPIDER leaderboard at the time of writing. Our model is most comparable to RAT-SQL+GRAPPA, which has the same encoder, but an autoregressive decoder. In addition, we perform the following ablations and oracle experiments: • NO X-ATTENTION: We remove the cross attention that computes Z t and uses the representations in Z t directly to score the frontier. In this setup, the decoder only observes the input question through the 0-high trees in Z 0 .
• WITH CNTX REP.: We use the contextualized representations not only for scoring, but also as input for creating the new trees Z t+1 . This tests if contextualized representations on the beam hurt or improve performance.
• NO DB VALUES: We anonymize all SQL queries by replacing DB values with value, as described above, and evaluate SMBOP using EM. This tests whether learning from DB values improves performance.
• Z 0 -ORACLE: An oracle experiment where Z 0 is populated with the gold schema constants (but predicted DB values). This shows results given perfect schema matching. Table 2 shows test results of SMBOP compared to the top (non-anonymous) entries on the leaderboard (Zhao et al., 2021;Shi et al., 2021;Yu et al., 2020;Deng et al., 2020;Wang et al., 2020a). SMBOP obtains an EM of 69.5%, only 0.3% lower than the best model, and 0.1% lower than RAT-SQL+GRAPPA, which has the same encoder, but an autoregressive decoder. Moreover, SMBOP outputs DB values, unlike other models that output anonymized queries that cannot be executed. SMBOP establishes a new state-of-the-art in denotation accuracy, surpassing an ensemble of BRIDGE+BERT models by 2.9 denotation accuracy points, and 2 EM points.

Results
Turning to decoding time, we compare SMBOP to RAT-SQLv3+BERT, since the code for RAT-SQLv3+GRAPPA was not available. To the best of our knowledge the decoder in both is identical, so this should not affect decoding time. We find that the decoder of SMBOP is on average 2.23x faster than the autoregressive decoder on the development set. Figure 5 shows the average speed-up for different query tree sizes, where we observe a clear linear speed-up as a function of query size. For long queries the speed-up factor reaches 4x-6x. When including also the encoder, the average speed-up obtained by SMBOP is 1.55x.
In terms of training time, SMBOP leads to much faster training and convergence. We compare the learning curves of SMBOP and RAT-SQLv3+BERT, both trained on an RTX 3090, and also to RAT-SQLv3+GRAPPA using performance as a function of the number of examples, sent to us in a personal communication from the authors. SMBOP converges much faster than RAT-SQL (Fig. 7). E.g., after 120K examples, the EM of SM-BOP is 67.5, while for RAT-SQL+GRAPPA it is 47.6. Moreover, SMBOP processes at training time 20.4 examples per second, compared to only 3.8 for the official RAT-SQL implementation. Combining these two facts leads to much faster training time (Fig. 6), slighly more than one day for SMBOP vs. 5-6 days for RAT-SQL.

Conclusions
In this work we present the first semiautoregressive bottom-up semantic parser that enjoys logarithmic theoretical runtime, and show that it leads to a 2.2x speed-up in decoding and ∼5x faster taining, while maintaining state-of-the-art performance. Our work shows that bottom-up parsing, where the model learns representations for semantically meaningful sub-trees is a promising research direction, that can contribute in the future to setups such as contextual semantic parsing, where sub-trees often repeat, and can enjoy the benefits of execution at training time. Future work can also leverage work on learning tree representations (Shiv and Quirk, 2019) to further improve parser performance.