Grammar-Constrained Neural Semantic Parsing with LR Parsers

Target meaning representations for semantic parsing tasks are often based on programming or query languages, such as S QL , and can be formalized by a context-free grammar. Assuming a priori knowledge of the target domain, such grammars can be exploited to enforce syntactical constraints when predicting logical forms. To that end, we assess how syntactical parsers can be integrated into modern encoder-decoder frameworks. Speciﬁcally, we implement an attentional S EQ 2S EQ model that uses an LR parser to maintain syntactically valid sequences throughout the decoding procedure. Compared to other approaches to grammar-guided decoding that modify the underlying neural network architecture or attempt to derive full parse trees, our approach is conceptually simpler, adds less computational overhead during inference and integrates seamlessly with current S EQ 2S EQ frameworks. We present preliminary evaluation results against a recurrent S EQ 2S EQ baseline on G EO Q UERY and ATIS and demonstrate improved performance while enforcing grammatical constraints.


Introduction
Semantic parsing aims at delivering granular, structured representations of natural language utterances, referred to as meaning representations or logical forms. Thus, it goes beyond shallow semantic analysis involving argument identification and role labeling (Collobert et al., 2011;Roth and Lapata, 2016). Meaning representations based on programming or query languages (PYTHON, SQL) are describable by (deterministic) context-free grammars and used for general purpose code genera-tion (Xiao et al., 2016;Yin and Neubig, 2017). Our work targets this particular subset of logical forms. A context-free grammar may be exploited to constrain a semantic parser to only produce token sequences derivable from the grammar. Specifically, we investigate how syntax constraints can be enforced in semantic parsers based on modern encoder-decoder frameworks in a non-intrusive, computationally inexpensive way at inference time. We show that enforcing grammatical constraints with LR parsers is particularly well suited for modern autoregressive neural network architectures used in neural machine translation (Sutskever et al., 2014;Vaswani et al., 2017). We do not require any modifications to standard SEQ2SEQ neural network architectures and make very little assumptions about the inputs and outputs of such models. In contrast, most grammar-constrained decoders attempt to model the grammar explicitly within the neural network, complicating the architecture. Moreover, they predict complete syntax trees or derivation sequences. Our approach predicts source code token streams, preserving syntactic validity throughout the decoding procedure. Enforcing syntactical constraints relieves neural networks models from having to learn the syntactic structure of the target language, which is particularly beneficial for ensuring balanced expressions over long ranges (Bahdanau et al., 2014;Ling et al., 2016). Also, when integrating our models into larger application environments, we may want to preclude specific failure modes (i.e., syntax errors) when executing the generated program snippets to increase robustness. Preliminary evaluation results on the GEOQUERY and ATIS data sets demonstrate that simply enforcing syntactical constraints on the pre-dicted lexical tokens at inference time improves the performance of the semantic parser against a recurrent SEQ2SEQ baseline.

Related Work
Enforcing grammatical constraints within neural network models has sparked a fair amount of research interest. (Xiao et al., 2016) take a derivational viewpoint when decoding derivation trees, demonstrating improved performance when accounting for grammatical constraints. They predict leftmost derivation sequences, each uniquely associated with a corresponding derivation tree. Employing a constrained loss over probabilities p (ŷ t ), whereŷ t are the permissible continuations of a derivation sequence, constraints are enforced at training time. We take inspiration from (Xiao et al., 2016), however, enforce grammatical constraints at inference time and on lexical token streams in a bottom-up fashion, eliminating the need to derive entire syntax trees and effectively reducing the sequence length. Similarly, (Yin and Neubig, 2017) predict entire syntax trees sequentially using a SEQ2SEQ model, starting from the root node and generating tree nodes in depth-first, left-to-right order, deterministically converting them to the corresponding surface code. They define a dedicated grammar model that predicts action sequences that either apply a production rule or generate a lexical token. (Krishnamurthy et al., 2017) additionally ensure that decoder predictions satisfy type constraints by providing a type-constrained grammar. (Rabinovich et al., 2017) propose a decoder that employs a separate neural network module for each construct in the grammar. The decoder generates an abstract syntax tree (AST) through mutual recursion between modules. At each decoding step, the decoder either generates a symbol or propagates the decoder state to the next module. (Yin and Neubig, 2018) developed a transition-based abstract syntax parser (TRANX) guided by a grammar specified under ASDL (Wang et al., 1997). TRANX uses ASTs as general-purpose intermediate meaning representations, decoupling the semantic parsing procedure from domain-specific grammars. A user-defined grammar converts ASTs to domain-specific meaning representations. Similar to (Yin and Neubig, 2017) an AST is generated using a sequence of tree-constructing actions. All approaches enforce syntactical constraints by first predicting the treestructured syntax tree top-down. Instead, we pro-pose to directly generate lexical tokens (the values of syntax tree leaves) and constrain the decoding process by means of an bottom-up LR parser.

Problem Statement
Informally, we aim at translating a set of natural language utterances X to a structured representation of their meaning. We assume the syntax of target meaning representations is describable by a deterministic context-free grammar and that it is known at training time. Given a grammar G, our goal is to enforce the constraints imposed by G during decoding. That is, the image of our model f shall be the language generated by G.
We achieve this by means of a recurrent encoderdecoder model as proposed by (Sutskever et al., 2014) and an LR parser. We briefly introduce recurrent encoder-decoder NMT models and the specifics of our model.

SEQ2SEQ Model
All modern encoder-decoder frameworks define a probability distribution P (y|x) (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014), where in our case, x represents a natural language input. For a target source code string, y represents the token stream generated by a lexical analyzer and a corresponding lexical grammar (see section 3.2). P (y|x) is factorized as: The encoder portion of the neural network encodes x into a vector-valued, so-called context. Conditioned on the context and all previous decoder hidden states, the decoder generates the output tokens y = (y 1 , ..., y κ ). Both encoder and decoder are distinct recurrent neural networks (LSTM's in our case). The decoder generates a sequence of hidden states and outputs a hidden state h L t at the topmost, L-th layer at timestep t. The individual factors in Eq. 2 are finally obtained using a feedforward neural network with a softmax layer that maps each hidden state to a probability distribution over the token vocabulary V D of the decoder. We optimize standard cross-entropy loss.

Grammar-Constrained Decoder
LR parsers are used to verify that a given token stream is derivable from a deterministic contextfree grammar G. The parsing stage is usually preceded by lexical analysis. During lexical analysis, a source code string is converted into a sequence of tokens, ready to be consumed by the parser. LR parsers employ ACTION and GOTO tables associated with the grammar, governing the applicable shift-reduce decisions the parser can make on each token input and determining the error states of the parser. The decoder stage in the SEQ2SEQ model can be viewed as taking the role of the lexical analyzer in the parsing process (see Figure 1).

Decoder
Parser  The decoder determines the set of applicable tokens in its vocabulary by consulting the parser's ACTION table. We generate a probability distribution as described in section 3.1 over the actions (identified by lexical tokens) in the current parser state and return the most likely token to the parser. The parser consumes the token, updates its state, and requests the next token. This process continues until the parser encounters a token that indicates ac-ceptance (an EOF token "$"). The neural network model is implemented using PYTORCH 1 . The parser implementation relies on the parsing toolkit LARK 2 . The output vocabulary V D consists of all source code tokens defined by the given grammar. Literals, such as string or integer values, are usually tokenized by matching them with a regular expression. We explicitly include all occurrences of literal values in the data sets as distinct tokens in the vocabulary. Algorithm 1 describes our modified LR parsing procedure relying on the decoder module providing the token stream. The procedure is initialized with the context vector obtained from the encoder and the parser start state. Given the state s, we determine the set of possible tokens e t by looking in the parser's ACTION table. Conditioned on the previous hidden state, we invoke the DECODE function and generate an output distributionŷ over all output vocabulary tokens. We finally choose a = max(ŷ et ) as our prediction, whereŷ et are the elements ofŷ indexed by e t . The next hidden state h t is returned, and the parser updates its state by parsing a. On shift actions, we push the associated state i onto the stack and request the next token. On reduce actions, we pop the recognized handle off the stack and push the left-hand side of the production onto the stack. The decoding procedure concludes when the parser encounters a token that indicates acceptance (corresponding to action "acc"). Note, that the decoder is invoked only if |e t | > 1, i.e., when there is more than one applicable token. Otherwise, we simply set a = e t . The additional computational overhead of running a single parsing step is constant at each decoding step. Although most programming languages are close to deterministic, generalizing our approach to GLR parsers (and thus to context-free grammars) may incur an additional computational cost proportional to the non-determinism in the grammar (Tomita, 1985).

Model Training
Algorithm 1 is only used during inference. Thus, during model training, the decoder may generate sequences s / ∈ L(G). Furthermore, since the decoder is only invoked when |e t | > 1, there is a mismatch between sequences seen during training and during test time. To account for this mismatch, each target sample is parsed prior to training, and for each state s for which EXPECTED(s) = 1, we filter the corresponding target sequence element from the target sample.

Experimental Evaluation
We present preliminary evaluation results and compare our approach to a recurrent SEQ2SEQ baseline (see section 3.1) and an attentional SEQ2SEQ model as reported in (Finegan-Dollak et al., 2018). Our attentional model extends the recurrent baseline with an attention layer as proposed by (Bahdanau et al., 2014).

Datasets
For our trials we use the canonicalized and annotated semantic parsing data sets for text-to-SQL tasks provided by (Finegan-Dollak et al., 2018). Compared to data sets like WIKISQL, GEOQUERY and ATIS feature complex queries with low levels of redundancy. We hypothesize that the benefits of a grammar-constrained decoder will be particularly pronounced in data sets with high complexity and variability. To ensure comparability, we use identical training, validation and test splits as (Finegan-Dollak et al., 2018).

Setup
We run trials without entity anonymization and with anonymized entities. We refer to trials with the standard dataset, i.e., the trials without anonymized entities, as standard trials. Trials with entity anonymization are referred to as oracle trials. Greedy-search was used for generating output sequences. We measured the exact match classification accuracy. A predicted token sequence that is identical to the token sequence in the corresponding test set example constitutes an exact match. Stochastic gradient descent with momentum (0.9) and a learning rate of 0.1 was used for each trial. The batch sizes ({16, 32, 64} for GEOQUERY and {128, 256} for ATIS), hidden and embedding dimensions ({64, 96, 128, 256}), the dropout rate for embeddings and hidden units ({0.05, 0.1, 0.2, 0.4}), the number of layers ({1, 2}) and the teacherforcing ratio ({1.0, 0.9, 0.8, 0.7}) were determined using grid search. We tested the models with best validation set performance during training and set an early stopping criterion.

Results
In Table 1 and Table 2 we present the results of the evaluation. We see the greatest improvements in the oracle trials without an attention layer. This verifies that the main utility of enforcing syntactical constraints lies with resolving the complex syntactical structures of target logical forms. Correctly recognizing entities and inserting the appropriate literals into the query is more akin to a slot-filling task than a semantic parsing task, and we observe no added value in enforcing grammatical constraints to resolve such literals in the standard trials. Applying an attention mechanism to both our approach and the basic recurrent model of (Finegan-Dollak et al., 2018) further puts the results into perspective. An attention layer in recurrent SEQ2SEQ models helps with resolving long range dependencies that may occur when expanding non-terminals, for example, involving long sub-queries (Bahdanau et al., 2014). Similarly, long range dependencies are resolved by virtue of the LR parser, ensuring that any non-terminal node is fully expanded, even if it involves sub-expressions that are expanded to long token sequences. Thus, using an attention mechanism, syntactic relationships between tokens can be learned much better, although syntax errors cannot be completely precluded as with an LR parser.

Conclusion and Future Work
We showed that grammatical constraints can be enforced with LR parsers, imposing no assumptions on the neural machine translation model used and adding little computational overhead. We intend to expand the trials to include other logical forms than SQL and comparable approaches to enforcing grammatical constraints (Xiao et al., 2016;Yin and Neubig, 2017). Moreover, we intend to generalize our approach to context-free grammars using GLR parsers and enforce grammatical constrains at training time.