Graph-Based Decoding for Task Oriented Semantic Parsing

The dominant paradigm for semantic parsing in recent years is to formulate parsing as a sequence-to-sequence task, generating predictions with auto-regressive sequence decoders. In this work, we explore an alternative paradigm. We formulate semantic parsing as a dependency parsing task, applying graph-based decoding techniques developed for syntactic parsing. We compare various decoding techniques given the same pre-trained Transformer encoder on the TOP dataset, including settings where training data is limited or contains only partially-annotated examples. We find that our graph-based approach is competitive with sequence decoders on the standard setting, and offers significant improvements in data efficiency and settings where partially-annotated data is available.


Introduction
Semantic parsing, the task of mapping natural language queries to structured meaning representations, remains an important challenge for applications such as dialog systems. To support compositional utterances in a task oriented dialog setting, Gupta et al. (2018) introduced the Task Oriented Parse (TOP) representation and released a dataset consisting of pairs of natural language queries and associated TOP trees. As illustrated in Figure 1, TOP trees are hierarchically structured representations consisting of intents, slots, and query tokens.
We propose a novel formulation of semantic parsing for TOP as a graph-based parsing task, presenting a graph-based parsing model (hereafter, GBP). Our approach is motivated by the success of such approaches in dependency parsing (McDonald et al., 2005;Kiperwasser and Goldberg, 2016;Dozat and Manning, 2017;Kulmizev et al., 2019) and AMR parsing (Zhang et al., 2019). Recently, sequence-to-sequence (seq2seq) models have become a dominant approach to semantic parsing * Work done while on internship at Google. (e.g., Dong and Lapata 2016, Jia and Liang 2016, Wang et al. 2019a), including on TOP (e.g., Rongali et al. 2020;Aghajanyan et al. 2020;Shao et al. 2020). Unlike such approaches that predict outputs auto-regressively, GBP decomposes parse tree scores over parent-child edge scores, predicting all edge scores in parallel.
First, we compare GBP with seq2seq and other decoding techniques, within the context of a fixed encoder and pretraining scheme: in this case, BERT-Base (Devlin et al., 2019). This allows us to isolate the role of the decoding method. We compare these models across the standard setting, as well as additional settings where training data is limited, or when fully annotated examples are limited but partially annotated examples are available. We find that GBP outperforms other methods, especially when learning from partial supervision. Second, we compare GBP with seq2seq models that additionally leverage pretrained decoders. We find that GBP remains competitive, and continues to outperform in the partial supervision setting.

Task Formulation
We present a novel formulation of the TOP semantic parsing task as a graph-based parsing task. Our goal is to predict a TOP tree y given a natural language query x as input. The nodes in y consist of intent and slot symbols from a vocabulary of output symbols V and the tokens in x. However, y cannot be predicted directly by a conventional

ROOT
UNUSED UNUSED Figure 2: The graph-based model predicts parent assignments across a set of nodes consisting of query tokens, output symbols for intents and slots, and special UNUSED and ROOT symbols. This is the corresponding parse tree for the TOP tree shown in Figure 1. Not all output symbols are drawn; omitted symbols are attached to UNUSED. Intent and slot names are abbreviated.
graph-based approach (McDonald et al., 2005) for two reasons. First, given x, we do not know the subset of intent and slot 1 symbols that occur in y. Second, intent and slot symbols can occur more than once in y. 2 To address this, let us consider a parse tree z in a space of valid trees defined as Z(x). The parse tree z can be deterministically mapped to and from y. The parse tree z consists of: (1) the tokens in x, (2) every symbol in V replicated up to a maximum number of occurrences 3 and assigned a corresponding index, and (3) a special UNUSED node in addition to the standard ROOT node. Let N (x) be this set of nodes which all trees in Z(x) consist of. When mapping from y to z, output symbols occurring multiple times are indexed following a pre-order traversal, and any output symbol that does not occur in y is assigned to the UNUSED node in z. For example, Figure 1 illustrates an example TOP tree, y, and Figure 2 illustrates a corresponding parse tree, z.

Scoring Model
Given that the mapping between y and z is deterministic, our goal is to model p(z | x). We follow a conventional edge-factored graph-based approach (McDonald et al., 2005), decomposing parse tree scores over directed edges between parent and child node pairs (p, c) in z: , 1 Note that one could imagine instead treating slots as edge labels instead of nodes, but as the set is large (36 slots for 25 intents), little advantage would be expected. 2 See Figure 5 in Appendix for an example. 3 The number of repetitions per output symbol is determined from the training data. If a symbol has a maximum of k occurrences in a TOP tree in the training data, it will have k + 2 replications. See Appendix C for more information.
where edge scores, φ(p, c, x), are computed similarly to the model of Dozat and Manning (2017): where e p x and e c x are contextualized vector representations of the nodes p and c, respectively, and U and u are a parameter matrix and vector, respectively. 4 Node representations are computed differently for each node type in N (x). Encodings for token nodes are based on the output of a BERT (Devlin et al., 2019) encoder; replicated output symbols are embedded based on their symbol and index; ROOT and UNUSED nodes likewise have a unique embedding. All nodes are then jointly encoded with a Transformer (Vaswani et al., 2017) encoder, which produces the contextualized node representations e x p and e x c which are used in the above equations to produce the factored edge scores.
The scoring model is trained using a standard maximum likelihood objective.

Chu-Liu-Edmonds Algorithm
The Chu-Liu-Edmonds (CLE) algorithm is an optimal algorithm to find a maximum spanning arborescence over a directed graph (Chu and Liu, 1965;Edmonds, 1965). It has commonly been used for parsing dependency trees from edge-factored scoring models (e.g., McDonald et al. 2005;Dozat and Manning 2017). Note that in an arborescence (hereafter tree), each node can have at most one 'parent', or incoming edge. Thus, the algorithm first chooses the highest scoring parent for each node as the initial best parent. It is possible these initial best parents already form a tree; however, it may instead produce a graph with cycles. In that case, CLE recursively breaks the cycles until the optimal tree is found. Note that CLE takes the index of the root of the tree as an input, and begins by deleting all incoming edges to enforce this constraint. Conventionally, in dependency parsing, the root of the tree is the special ROOT node.
This algorithm is optimal for dependency parsing; however, our formulation differs due to additional constraints based on how TOP trees are mapped to and from dependency trees. First, by convention, the parent of the UNUSED subtree must be ROOT. Second, the UNUSED subtree must be of depth 2: it cannot have any grandchildren. Finally, as valid TOP trees have only one root, the ROOT node must have only one 'child', or outgoing edge.

Unused Node Preprocessing
As stated, our UNUSED subtree must only have depth 2 to follow our task formulation. Otherwise, the final tree score will be computed incorrectly when translating to a TOP tree, as the entire UNUSED subtree is effectively discarded. Thus, we first preprocess the UNUSED subtree to ensure depth 2. In practice, simply using the initial best parents will result in UNUSED subtrees with depth 3 or greater about 1% of the time.
We resolve such cases by making a decision for each node a whose initial best parent is UNUSED and has children itself. One option is to delete the edge to a from UNUSED, making the next highest scoring edge the new best parent of a. The cost of this action is equal to the difference in scores between the corresponding edges. Alternatively, we can take a similar action on each child of a: delete the edge from a, making the next highest scoring edge the new best parent. The cost of this action is equal to the difference in the corresponding edges summed over every child of a. We iterate over the children of UNUSED that have children, selecting the action with the lower cost, until the constraint is met. Then, we no longer allow further modifications to the UNUSED subtree, effectively deleting it for the remaining stages of the algorithm.
Note that this algorithm is not necessarily optimal: the order in which we consider the children of UNUSED can affect the final result. However, we find this approximation to work well in practice.

Multiple Root Resolution
Our second modification to the CLE algorithm concerns the ROOT node. Valid TOP trees are single-rooted: in our formalism, this means the ROOT node can only have a single child. To enforce this constraint, we want to choose the single child of ROOT that results in the highest scoring tree. We then provide this child's index to the CLE subroutine and delete all edges from ROOT, effectively discarding it. To find the best root, we start with the set of nodes whose initial best parent is the ROOT node. If this set is a singleton, we simply choose that node as the tree's root, providing its index to the CLE subroutine. In about 0.5% of trees, there is more than one node. In that case, we run the CLE algorithm with each node as the given root index, taking the highest-scoring tree. This is still not guaranteed to be optimal: the optimal choice of the root node could have a different initial best parent than ROOT. However, this was not observed in our experiments and trying every node drastically increases the computation.

Experiments
The TOP dataset consists of trees where every token in the query is attached to either an intent (prefixed with IN:) or slot label (prefixed with SL:). Intents and slot labels can also attach to each other, forming compositional interpretations. We evaluate several models on the standard setup of the TOP dataset. We also devise new setups comparing the abilities of several models to learn from a smaller amount of fully annotated data, both with and without additional partially annotated data. Models are compared on exact match accuracy. Following Rongali et al. (2020);Einolghozati et al. (2018), and Aghajanyan et al. (2020), we filter out queries annotated as unsupported 5 , leaving 28414 train examples and 8241 test examples.

Standard Supervision
We use standard supervision to refer to settings where all training examples contain a complete output tree. We also evaluate data efficiency, by comparing the performance when training data is limited to 1% or 10% of the original dataset.

Partial Supervision
We use partial supervision to refer to settings where we discard labels for certain nodes in the output trees of some or all training examples. Such partially annotated examples could arise in practice; for instance, when there is annotator disagree-  ment on part of the output tree, or when changes to the set of possible slots or intents render parts of previously annotated trees obsolete.
As semantic parsing datasets normally require expert annotators, extending fully annotated examples with additional partial annotation can be an effective strategy. For instance, Choi et al. (2015) scaled their semantic parsing model with partial ontologies, and Das and Smith (2011) used additional semi-supervised data for their frame semantic parsing model. We consider two types of partially annotated output trees described below.
Terminal-only Supervision For this type of partial supervision, only the labels of each token (i.e., terminal) are preserved. See Figure 3 for an example. The label for each individual token is known, but the full set of intents and slots, and their tree structure, is unknown. This is similar to utilizing span labels that do not have full trees available.
Nonterminal-only Supervision For this type of partial supervision, token (i.e., terminal) labels are discarded. This is equivalent to deleting all of the token nodes from the tree. See Figure 4 for an example. This provides the opposite type of supervision as the terminal-only supervision case. The complete set of intents and slots and their tree structure is known, but their anchoring to the query text is unknown. For instance, if a query is known to have the same parse as a fully annotated query, its grounding may still be unknown.

Comparisons with Fixed Encoder
We first compare GBP with other methods using the same pre-trained encoder (BERT-Base; Devlin et al. 2019). We compare with a standard sequence decoder (a pointer-generator network; Vinyals et al. 2015;See et al. 2017) implemented using a Transformer-based (Vaswani et al., 2017) decoder (PTRGEN). We report the previous results from Rongali et al. (2020) and new results from an implementation based on that of Suhr et al. (2020), which provides a slightly stronger baseline. We also compare with the factored span parsing (FSP) approach of Pasupat et al. (2019). Notably, we report new results for FSP using a BERT-Base encoder, which are significantly stronger than previously published results which used GloVe (Pennington et al., 2014) embeddings (85.1% vs.81.8%).
Results can be found in Table 1. We evaluate these models across both the standard and partial supervision settings. Notably, GBP can incorporate partial supervision in a straightforward way because scores for parse trees are factored over conditionally-independent scores for each edge. Training proceeds as described in Section 3; however, the loss from the edges that are not given by the example is masked. Additional training details can be found in Appendix B. For PtrGen, each type of partial supervision is given a task-specific prefix; details are in Appendix A. Similar to GBP, FSP factors parse scores across local components, but also considers chains of length > 1. Therefore, terminal-only supervision uses only length 1 chains; there is no trivial way to use nonterminalonly supervision without very substantial changes.
GBP is the highest-performing of the BERT-base models on the standard setup. Both GBP and FSP show better data efficiency than PtrGen. Only GBP appears to effectively benefit from partially annotated data in our experiments; the other models perform worse when incorporating this data.

Comparisons with Pretrained Decoders
Recently, sequence-to-sequence models with pretrained decoders, such as BART (Lewis et al., 2019) and T5 (Raffel et al., 2020), have demonstrated strong performance on a variety of tasks. Careful comparisons isolating the effects of model size and pretraining tasks are limited by the availability of pretrained checkpoints for such models. Regardless, we compare GBP (with BERT-Base) directly with such models. On the standard setting for TOP, Aghajanyan et al. (2020) report SOTA performance with BART (87.1%), outperforming GBP. We also report new results comparing GBP with T5 on both the standard supervision and partial supervision settings in Table 2.
Notably, T5 is able to leverage partiallyannotated examples much more effectively than PTRGEN, which is also a Transformer-based sequence-to-sequence model but does not have a pretrained decoder. While T5 outperforms GBP on the standard setting, GBP outperforms T5 on the data efficiency and partial supervision settings.

Related Work
The most recent state of the art on TOP has focused on applying new methods of pretraining; (Rongali et al. 2020;Shao et al. 2020;Aghajanyan et al. 2020) all use seq2seq methods, enhanced by better pretraining from BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and BART (Lewis et al., 2020) while using similar model architectures. In this work, we instead investigate the choice of decoder. While the FSP model (Pasupat et al., 2019) similarly uses a factored approach, its approach is more specific to TOP, as its trees must be projective and anchored to the input text. In dependency parsing, the performance of graph-based and transition-based parsing is compared in both Zhang and Clark (2008) and Kulmizev et al. (2019). Graph-based parsing has also been used in AMR parsing (Zhang et al., 2019), which translates sentences into structured graph representations. Similar methods have also been used in semantic role labeling (He et al., 2018), which requires labeling arcs between text spans. This work is the first to adapt graph-based parsing to tree-like task-oriented semantic parses.

Conclusions
We propose a novel framing of semantic parsing for TOP as a graph-based parsing task. We find that our proposed method is a competitive alternative to the standard paradigm of seq2seq models, especially when fully annotated data is limited and/or partially-annotated data is available.

Ethical Considerations
We fine-tune all models using 32 Cloud TPU v3 cores 6 . Additional training details are in Appendix A and Appendix B. We reused existing pretrained checkpoints for both BERT and T5, reducing the resources needed to run experiments. Our evaluation focuses on the existing TOP dataset: the details of the collection can be found in Gupta et al. (2018). TOP is an English-only dataset, which limits our ability to claim that our findings generalize across languages. A deployed dialog system has additional ethical considerations related to access, given their potential to make certain computational functions faster, easier, or more hands-free.    (Vaswani et al., 2017) with 4 attention heads and 768 dims and a dropout rate of 0.3. We use a hidden size of 1024 for computing edge scores similarly to Dozat and Manning (2017). Cross entropy loss is minimized with the optimizer described in Devlin et al. (2019).
For partial supervision experiments, the loss is masked for unsupervised edges.
The model is trained over 20000 steps with a learning rate of 0.0001 and 2000 warmup steps. All hyperparameters are chosen based on validation set exact match accuracy performed by a grid search. BERT-base has approximately 110M parameters, and GBSP introduces approximately 13M additional parameters, for a total of approximately 123M parameters. Note that larger versions of BERT did not lead to performance improvements in our experiments.
Note a comparison on validation performance can be found in Table 3 (the validation set without unsupported has 4032 examples). All tested values for hyperpameters can be found in Table 4. We estimate approximately 1,000 total training runs during the development cycle. After tuning hyperparameters on the full set, no re-tuning occurred: partial supervision and data efficiency experiments used the same setup. Model training takes approximately 45 minutes.

IN:GET_EVENT
SL:DATE_TIME this weekend events SL:DATE_TIME Holiday Figure 5: Example TOP tree with two occurrences of SL:DATE_TIME. When mapping from TOP trees to the parse trees predicted by our model, each instance of SL:DATE_TIME is assigned an index based it its preorder position in the TOP tree.

C Repeated Nodes
See Figure 5 for an example of a TOP tree with repeated nodes.
We chose to pad occurrences based on the observation that certain nodes can occur more times than they do in the training set. About half of the nodes only ever occur once. On the validation set, 2 additional replications was the highest value before performance degraded.
There are many alternatives to our handling of repeated nodes. For instance, Zhang et al. (2019) had a slightly different task, but we could have adopted their approach of generating the node set auto-regressively. Unfortunately, this would have complicated our method of partial supervision. Another method would be to use a fixed number of duplications: this worked slightly worse in practice, based on validation set performance. Alternatively, the model could have learned a regression, which has been used in non-autoregressive machine translation (e.g., Wang et al. 2019b). We leave trying such an approach to future work.

D Full Data
Results on the full dataset (including unsupported intents) can be found in