A Globally Normalized Neural Model for Semantic Parsing

In this paper, we propose a globally normalized model for context-free grammar (CFG)-based semantic parsing. Instead of predicting a probability, our model predicts a real-valued score at each step and does not suffer from the label bias problem. Experiments show that our approach outperforms locally normalized models on small datasets, but it does not yield improvement on a large dataset.


Introduction
Semantic parsing has received much interest in the NLP community (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Jia and Liang, 2016;Guo et al., 2020). The task is to map a natural language utterance to executable code, such as λ-expressions, SQL queries, and Python programs.
Recent work integrates the context-free grammar (CFG) of the target code into the generation process. Instead of generating tokens of the code (Dong and Lapata, 2016), CFG-based semantic parsing predicts the grammar rules in the abstract syntax tree (AST). This guarantees the generated code complies with the CFG, and thus it has been widely adopted Guo et al., 2019;Bogin et al., 2019;Sun et al., 2019Sun et al., , 2020.
Typically, the neural semantic parsing models are trained by maximum likelihood estimation (MLE). The models predict the probability of the next rules in an autoregressive fashion, known as a locally normalized model. However, local normalization is often criticized for the label bias problem (Lafferty et al., 2001;Andor et al., 2016;Wiseman and Rush, 2016;Stanojević and Steedman, 2020). In semantic parsing, for example, grammar rules that generate identifiers (e.g., variable names) have much lower probability than other grammar rules. Thus, the model will be biased towards such rules that can avoid predicting identifiers. More generally, the locally normalized model will prefer such early-step predictions that can lead to low entropy in future steps.
In this work, we propose to apply global normalization to neural semantic parsing. Our model scores every grammar rule with an unbounded real value, instead of a probability, so that the model does not have to avoid high-entropy predictions and does not suffer from label bias. Specifically, we use max-margin loss for training, where the ground truth is treated as the positive sample and beam search results are negative samples. In addition, we accelerate training by initializing the globally normalized model with the parameters from a pretrained locally normalized model.
We conduct experiments on three datasets: ATIS (Dahl et al., 1994), CoNaLa , and Spider (Yu et al., 2018). Compared with local normalization, our globally normalized model is able to achieve higher performance on the small ATIS and CoNaLa datasets with the long short-term memory (LSTM) architecture, but does not yield improvement on the massive Spider dataset when using a BERT-based pretrained language model.

Related Work
Early approaches to semantic parsing mainly rely on predefined templates, and are domain-specific (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Kwiatkowksi et al., 2010). Later, researchers apply sequence-to-sequence models to semantic parsing. Dong and Lapata (2016) propose to generate tokens along the syntax tree of a program. Yin and Neubig (2017) generate a program by predicting the grammar rules; our work uses the TranX tool  with this framework.
Globally normalized models, such as the conditional random field (CRF, Lafferty et al., 2001), are able to mitigate the label bias problem. How-ever, their training is generally difficult due to the global normalization process. To tackle this challenge, Daumé and Marcu (2005) propose learning as search optimization (LaSO), and Wiseman and Rush (2016) extend it to the neural network regime as beam search optimization (BSO). Specifically, they obtain negative partial samples whenever the ground truth falls out of the beam during the search, and "restart" the beam search with the ground truth partial sequence teacher-forced.
Our work is similar to BSO. However, we search for an entire output, and do not train with partial negative samples. This is because our decoder is tree-structured, and different partial trees cannot be implemented in batch efficiently. We instead perform locally normalized pretraining to ease the training of our globally normalized model.

Methodology
In this section, we first introduce the neural semantic parser TranX, which servers as the locally normalized base model in our work. We then elaborate how to construct its globally normalized version.

The TranX Framework
TranX is a context-free grammar (CFG)-based neural semantic parsing system . TranX first encodes a natural language input X with a neural network encoder. Then, the model generates a program by predicting the grammar rules (also known as actions) along the abstract syntax tree (AST) of the program.
In TranX, these actions are predicted in an autoregressive way based on the input X and the partially generated tree, given by where θ L denotes the parameters of the neural network model, and the subscript L emphasizes that the probability is locally normalized. o(·) denotes the logit at this step, and a t is an action (i.e., grammar rule) among all possible actions at this step A t (·), which is based on previous predicted rules a <t .
In other words, the prediction probability is normalized at every step, and the training objective is to maximize where n is the total number of steps.

Globally Normalized Training
A locally normalized model may suffer from the label bias problem (Lafferty et al., 2001). This is because such a model normalizes the probability to 1 at every step. However, the candidate action set A t (a <t ) may have different sizes, and the actions from a smaller A t (a <t ) typically have higher probabilities. Thus, the model would prefer such actions a <t that will yield smaller A t (a <t ) in future steps. 1 We propose to adapt TranX to a global normalized model to alleviate label bias. Instead of predicting a probability P (a t |a <t , X) as in (2), our globally normalized model predicts a positive score at a step as where o(·) is the same logit as (1), and θ G is the parameters.
The probability of the sequence a 1:n is normalized only once in a global manner, given by where Z G = a 1:n n t=1 s(a t |a <t ; θ G ) is the partition function.
A globally normalized model alleviates the label bias problem, because it does not normalize the probability at every prediction step, as seen from (4). Thus, it is not biased by the size of A t (a <t ).
The training objective is still to maximize the likelihood, albeit normalized in a global way. However, computing the partition function Z G requires enumerating all combinations of actions a 1:n in the partition function of (4), which is generally intractable.
In practice, the maximum likelihood training is approximated by max-margin loss between a positive sample a 1:n and a negative sample a − 1:n , where o(a 1:n |X) = 1 n n t=1 o(a t |a <t ) is the average of logits. ∆ is a positive constant.
The positive sample is simply the ground truth actions, whereas the negative samples are obtained by beam search. In other words, we perform beam search inference during training, and the sequences in the beam (other than the ground truth) serve as the negative samples.
Similar to MLE training for (4), the max-margin loss increases the logits of the ground truth sample, while decreasing the logits for others. It is noted that the quality of negative samples will largely affect the max-margin training, as only a few samples are used to approximate Z G .
To address this issue, we initialize the parameters of the globally normalized model θ G with θ L in a pretrained locally normalized model. Thus, our negative samples are of higher quality, so that the max-margin training is easier and more stable.

Handling the Copy Mechanism
TranX has a copy mechanism (Gu et al., 2016) as an important component for predicting the terminal nodes of the AST, as the target program largely overlaps with the source utterance, especially for entities (e.g., "file.csv" in Figure 1). In the locally normalized TranX, the copy mechanism marginalizes the probability of generating a token in the vocabulary and copying it from the source: P L (a t = GenToken[v] | a <t , X) = P (gen | a <t , X)P (v | gen, a <t , X) + P (copy | a <t , X)P (v | copy, a <t , X) where GenToken[·] denotes generating a terminal token v. P (copy|·) is the predicted probability of copying the token v from the source utterance, and P (gen|·) = 1 − P (copy|·) is the probability of generating v from the vocabulary.
However, the copy mechanism cannot be directly combined with global normalization, because we use unbounded, real-valued logits instead of probabilities. This would not make much sense when both logits are negative, whereas their product is positive.
Therefore, we propose a variant of copy mechanisms in the globally normalized setting. Specifically, we keep the probabilities P (copy|·) and P (gen|·), and use them to weight the logits of generating and copying a token v, given by Here, o(a t = GenToken[v] | ·) is a linear interpolation of two logits, and thus fits the max-margin loss (5) naturally.
It should be pointed out that much work adopts data anonymization techniques to replace entities with placeholders (Dong and Lapata, 2016;Neubig, 2017, 2019;Sun et al., 2020). This unfortunately causes a large number of duplicate samples between training and test. This is recently realized in Guo et al. (2020), and thus, in our work, we only compare the models using the original, correct ATIS dataset.
Settings. Our globally normalized semantic parser is developed upon the open-sourced TranX 2 . We adopt the CFG grammars provided by TranX to convert lambda calculus and Python programs into ASTs and sequence of grammar rules (actions). For ATIS and CoNaLa datasets, we use long LSTM models as both the encoder and the decoder. Their dimensions are set to 256. For the Spider dataset, we use a pretrained BERT model 3 (Devlin et al., Dev Test Jia and Liang (2016)  2019) and the relation-aware Transformer (Wang et al., 2020) as the encoder and an LSTM as the decoder. The architecture generally follows the work by Xu et al. (2021). The beam size is set to 20 to search for negative samples, and is set to 5 for inference. The margin ∆ in (5) is set to 0.1. We use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 5e-4 for training.
For both ATIS and CoNaLa datasets, we report the best results on the development sets and the corresponding results on the test set. For the Spider dataset, we only report the results on the development set as the ground truth of the test set is not publicly available.

Results
ATIS dataset. Following Yin and Neubig (2017); Sun et al. (2020), we report the exact match accuracy for ATIS. We first replicate locally normalized models with and without the copy mechanism and achieve similar results to Jia and Liang (2016) and Guo et al. (2020), shown in Table 1. This verifies that we have a fair implementation and are ready for the study of global normalization.
We observe that the copy mechanism largely affects the accuracy on the test set, although it has little effect on the development set. This is because the training and validation distributions closely resemble each other, whereas the test distribution differs largely. Therefore, the copy mechanism is important for handling unseen entities in the test set, and our proposed copy variant in Section 3.3 is also essential to globally normalized models.
We then train our model with the max-margin loss. Our globally normalized model consistently improves the accuracy on both development and  Dev Acc. Rubin and Berant (2020) 73. 4% Yu et al. (2021) 74.7% Ours (local) 73.79% + Global 73.69% test sets, compared with its locally normalized counterpart. This shows the effectiveness of our approach.
In addition, we notice that a large number of entities in ATIS have a form like "ap:denvor" (Denver Airport). We thus use the combination of characterlevel ELMo embeddings (Peters et al., 2018) and word-level GloVe embeddings (Pennington et al., 2014). This further improves the accuracy, which outperforms the previous methods by ∼1.9% in the setting without data augmentation.
CoNaLa dataset. For CoNaLa, BLEU is treated as the main metric in previous work (Yin and Neubig, 2019), because accuracy is generally very low (<3%) on this dataset. From Table 2, we observe that our globally normalized model improves the BLEU scores on both the development and test sets compared with the locally normalized baseline. Such improvement is consistent with that on ATIS.
We further compare our model with Yin and Neubig (2019), which reranks beam search results by heuristics. Our method is outperformed by the reranking approach. Note that reranking can be considered as alleviating label bias with postprocessing, as the locally normalized model fails to assign the correct sequence with the highest joint probability. However, the reranking method requires training several reranking scorers, combined with an ad hoc feature (namely, length). By contrast, our global normalization does not rely on ad hoc human engineering.
Spider dataset. Table 3 lists the results on the Spider dataset. Here, our locally normalized model uses BERT as the encoder, and its performance is on par with that from the recent state-of-the-art approaches (Rubin and Berant, 2020;Yu et al., 2021). However, our global normalization does not improve the performance. It is noted that BERT is a more powerful model than LSTM, and Spider has a much larger training set than CoNaLa and ATIS. We conjuncture that BERT learns the step-by-step local prediction probability very well, which in turn yields a satisfying joint probability and largely mitigates label bias by itself. Therefore, the globally normalized model does not exhibit its superiority on the Spider dataset.

Conclusion
In this work, we propose to apply global normalization for neural semantic parsing. Our approach predicts the score of different grammar rules at an autoregressive step, and thus it does not suffer from the label bias problem. We observe that our proposed method is able to improve performance on small datasets with LSTM-based encoders. However, global normalization becomes less effective on the large dataset with a BERT architecture.