TreePiece: Faster Semantic Parsing via Tree Tokenization

Autoregressive (AR) encoder-decoder neural networks have proved successful in many NLP problems, including Semantic Parsing -- a task that translates natural language to machine-readable parse trees. However, the sequential prediction process of AR models can be slow. To accelerate AR for semantic parsing, we introduce a new technique called TreePiece that tokenizes a parse tree into subtrees and generates one subtree per decoding step. On TopV2 benchmark, TreePiece shows 4.6 times faster decoding speed than standard AR, and comparable speed but significantly higher accuracy compared to Non-Autoregressive (NAR).


Introduction
Autoregressive (AR) modeling (Sutskever et al., 2014) is a commonly adopted framework in NLP where the next prediction is conditioned on the previously generated tokens.This paper focuses on AR approach for Semantic Parsing (Wong, 2005), an NLP task that converts a natural language utterance to a machine-interpretable symbolic representation called logical form.The sequence of actions to derive a logical form is isomorphic to a directed tree and often referred to as a parse tree (Zettlemoyer and Collins, 2005).
The runtime latency of AR linearly correlates to the output length and could result in low inference speed (Gu et al., 2017;Wang et al., 2018).Non-Autoregressive (NAR) modeling (Gu et al., 2017;Wei et al., 2019;Ma et al., 2019), on the other hand, is able to produce outputs in parallel and reduce latency by an order of magnitude (Ghazvininejad et al., 2019).However, NAR performs considerably worse than its AR counterparts without extra training recipes (Wang et al., 2019;Zhou and Keung, 2020;Su et al., 2021).The quality benefits of AR models therefore motivates us to improve their speed, rather than exploring NAR.

Our contributions
• We propose a novel approach of tokenizing parse trees into large units called TreePiece units, and then building an AR model that predicts one TreePiece unit at a time, thus reducing the number of steps needed to generate a full parse tree.To the best of our knowledge, we are the first to extend subword-tokenizer algorithm to semantic trees such that each token is a subtree.
• We validate our approach on TOPv2 benchmark and show that TreePiece decoding is 6.1 times faster than standard AR with less than 0.15% accuracy degradation, and nearly as fast as NAR with up to 0.7% accuracy gains.
• We provide theoretical proofs to support our main algorithms and their variants.

Parse tree
In this paper, we utilize the hierarchical semantic representations based on intent and slot (Gupta et al., 2018), allowing for modeling complex compositional queries in task-oriented dialog systems.See Figure 2  define a few recurring notions in the paper: Definition 2.1 (Ontology).A parse tree node is called an ontology iff it represents an intent/slot, prefixed by in: and sl: respectively.Definition 2.2 (Skeleton).The skeleton of a parse tree is the subtree that consists of all ontologies.Definition 2.3 (Utterance leaf ).A text-span node is called an utterance leaf iff its parent is a slot1 .We propose Algorithm 1 as our TreePiece tokenizer, a Viterbi-type (Viterbi, 1967) forwardbackward (Nagata, 1994a) algorithm which computes the optimal tokenization and probability for given skeleton S.

Notations in Algorithm 1:
(1) T is the set of all subtrees of S that share the same root as S denoted by T ; T admits a natural filtration where T d is the set of all depth-d-subtrees.
(2) L is the log probability on T as follows: where V is the TreePiece vocabulary and p the TreePiece simplex.
(3) for efficiency we apply Filter(•, t) to restrict to subtrees t ′ such that (a) t ′ is a subtree of t, (b) the set difference t ′ ∆t has exactly one connected component and it is a TreePiece unit.
In summary, the forward step uses dynamic programming inductive on tree-depth to update all subtrees' log-probabilities and eventually obtain P(S; p) -the probability of the skeleton S. The forward step also returns a map P that stores for each t ∈ T the optimal position of its previous partition.Then in the backward step we can backtrack along the path S, P(S), P(P(S)), • • • to recover the optimal partition π S (p).(Gage, 1994;Sennrich et al., 2015), we obtain the TreePiece vocabulary V and map F 0 between TreePiece units and their frequencies in S .For details see Appendix A.
Stage 2, Update p: initialize the TreePiece simplex p 0 as the normalized frequency p 0 (t) =: F 0 (t)/ τ ∈V F 0 (τ ) for all t ∈ V and then solve for p i+1 iteratively as follows: (1) In general, problem (1) is NP-hard as it involves summing over Π S , the set of all possible partitions π of a skeleton S: To solve (1) in polynomial time, we propose Algorithm 2 (whose E-step uses Algorithm 1) and impose the following assumption on the joint distribution of S and π: where π S (p) = argmax π∈Π S τ ∈π p(τ ).Applying (3), we see that all but one summand in (2) vanish.
For proof of Theorem 2.6, see Appendix B.

Modeling
We describe the model that generates parse tree components and the method to piece them together.

Modeling mechanism
As illustrated in Figure 1, an encoder computes the hidden states of a given utterance, then an AR decoder consumes the encoder hidden states and generate TreePiece units autoregressively.The technique in Subsection 2.3.2 will allow us to put these units together and obtain a full skeleton.The skeleton then uniquely determines the number (denoted by N ) and positions of all utterance leaves (see Figure 2), which offers us the convenience to use an NAR decoder to generate all utterance leaves within one step.For NAR utterance decoder, we closely follow (Ghazvininejad et al., 2019) prepare the NAR decoder's input by concatenating the embeddings of the predicted TreePieces and N mask tokens.Then each decoder layer performs self-attention as well as cross attention with encoder hidden states.Lastly the decoder generates utterance predictions at these N masked positions.

Assemble TreePiece units
Unlike subword-tokenization, where original sentence can be trivially recovered from subword units via string concatenation, there is no canonical way to reassemble TreePiece units.To overcome this issue, we allow TreePiece units to have placeholders 2 , and require that two units can only be joined at a placeholder node.This design provides a unique way to glue a sequence of ordered (e.g.pre/levelordered) TreePiece units, as shown in Figure 2.

Datasets
We train, validate, and test our approach on the publicly available benchmark TOPv2 (Chen et al., 2020), a multi-domain task-oriented semantic parsing dataset.The dataset provides a training/validation/test split.Throughout our experi-2 Finding all possible placeholder patterns is NP-hard and unnecessary.In Appendix D we provide a practical solution.ments, we use the training split to train the models, the validation split for earlystopping, model checkpointing, and hyperparameter tuning, and the test split to report the best model's performance.

Metrics
We evaluate the model performance on two metrics: Exact Match (EM) respectively Exact Match of Skeleton (EM-S), defined to be the percentage of utterances whose logical forms respectively skeletons are correctly predicted (Shrivastava et al., 2022).

Baselines
We compare our approach against 2 baselines: AR and NAR.Both baselines are sequence-to-sequence (seq2seq) that produces subword units of serialized logical forms.Their output space consists of ontologies (prefixed by left bracket "["), utterance leaves3 , and right bracket "]".
AR baseline admits a standard AR structure.It has an autoregressive decoder that generates serialized logical forms by producing one token at a time.
NAR baseline adopts mask-predict (Ghazvininejad et al., 2019) with beam size 1, which predicts the output length first and then generates all tokens in one step using a non-autoregressive decoder.

Experiment setup
For TreePiece models, we experiment with 9 different expansion sizes (used in Stage 1, Subsection 2.2.1) varying from 0 to 1200.We optimize the hyperparameters for both baselines and TreePiece-600, and apply the same hyperparameters from TreePiece-600 to all other TreePiece models.We defer the model configurations, training details, hyperparameter choices to Appendix E.

Quality
As shown in Table 1, TreePiece model sees up to 0.7% relative improvements over NAR and less than 0.15% degradation from AR in terms of EM, while achieving the best EM-S score among all approaches, especially showing 0.8% relative improvement over NAR.We attribute TreePiece's high quality on skeleton predictions to its ability to respect the tree structure of logical forms and generating 100% valid outputs by design so that the model can better focus on utterance-understanding without being distracted by any structure issue.
Table 2 further shows TreePiece's privilege over NAR in handling tasks of higher complexity, achieving > 2% improvements for frames with more than 1 intents.

Latency
Table1 indicates that TreePiece makes decoding 6.1\5.8 times faster and overall inference 3.0\2.5 times faster than AR on GPU\CPU, with only 5% inference latency growth compared to NAR.In  2018).Especially, the Sequence-to-Tree scheme was adopted by (Dong and Lapata, 2016).To speed up the inference time, Non-autoregressive modeling were introduced to the field of Machine Translation (Gu et al., 2017;Lee et al., 2018;Libovický and Helcl, 2018), and later become popular in Semantic Parsing as well (Ghazvininejad et al., 2019;Babu et al., 2021;Shrivastava et al., 2021).However, to match the quality of AR, extra training stages are necessary such as Knowledge Distillation from AR models (Gu et al., 2017;Lee et al., 2018;Wei et al., 2019;Stern et al., 2019).

Conclusion
This paper proposes a novel way to model and speed up Semantic Parsing via tokenizing parse trees into subtrees.We provide thorough elucidations and theoretical supports for our technique, and demonstrate significant improvements in terms of speed and quality over common AR and NAR baselines on the TOPv2 benchmark.

Limitations
The proposed TreePiece technique, while evaluated on TOPv2 dataset, is not intrinsically bound to it.Indeed, our approach requires only two conditions on a dataset for applicability: • offers a closed vocabulary of ontologies; • logical forms inherently carry tree structures.
As a matter of fact, TreePiece can seamlessly adapt to a broad range of datasets, including Wik-iSQL (Zhong et al., 2017), WEBQUESTIONS (Berant et al., 2013), SequentialQA (Iyyer et al., 2017), GEOquery (Davis and Meltzer, 2007), Spider (Yu et al., 2018), ATIS (Hemphill et al., 1990), etc.Despite this, we solely focused on showcasing its effectiveness in the specific case of "taskoriented natural language understanding based on intent and slots".Additionally, our approach employs standard autoregressive decoding for subtree generation, neglecting the exploration of even more efficient decoding techniques.Lastly, our current tokenization algorithm may introduce outof-vocabulary (OOV) tokens; while we proposed effective ways to reduce OOV rates, it however cannot fully eliminate the OOV phenomena.

Ethics Statement
Our proposed method presents a novel tokenization operation for tree-like data, yielding substantial practical implications in semantic parsing domains such as natural language understanding, SQL generation, code generation, etc.However, it is crucial to acknowledge that, similar to many other tokenization algorithms, our approach may introduce biases from the training data into vocabulary and tokenization patterns.Consequently, practitioners needs to be mindful of this when curating the training corpus before utilizing our method.

A Appendix: Vocabulary generation
This stage resembles the merging operation in Byte Pair Encoding (BPE) (Gage, 1994;Sennrich et al., 2015).Given a training corpus, denote its skeletons by S .We initialize the TreePiece vocabulary V as the set of ontologies extracted from S and F 0 as the map between ontologies and their frequencies in S .Now repeat the steps below until V reaches a pre-determined size: • Count the frequencies of all adjacent but unmerged TreePiece unit pairs in S .Find the most frequent pair p * and its frequency n * .
• Merge p * in every S ∈ S that contains p * , add p * to V, and update F 0 with F 0 (p * ) = n * .
B Appendix: Proof of Theorem 2.6 For convenience we adopt the following notations.
Assumption 3 says that the probability measure P(π|S; p k ) is supported on the singleton π S (p i ), therefore the following holds for all τ ∈ V: Now inserting the identity (7) to the right hand side of equation ( 6) we obtain Next, Inserting (8) to the M-step in Algorithm 2, we have where T (t) =: {t ′ ∈ T : P(t, t ′ ) > 0}.Here a smaller θ leads to a more uniform sampling distribution among all partitions, while a larger θ tend to select the Viterbi partition picked by Algorithm 1 (Kudo, 2018).
T ← All subtrees of S with the same root.

D Appendix
As discussed in Subsection 2.3.2, a placeholder structure is necessary for well-defined assembly of TreePiece units.However, adding all possible placeholder patterns to vocabulary is impractical for both time and memory.Instead, we shall include only those patterns that most likely to occur.
To do so, we tokenize every training skeleton and add the results to the TreePiece vocabulary.As illustrated by the "Tokenize" direction in Figure 2, when a node loses a child during tokenization, we attach a placeholder to the missing child's position.
Remark D.1.There may exist new placeholder patterns that are Out-Of-Vocabulary (OOV) at inference time.To mitigate OOV, we apply Algorithm 3 (in place of Algorithm 1) to tokenize each training skeleton K 0 times.Both K 0 and the sampling coefficient θ 0 will be specified in Appendix E.1.2.Intuitively, with a larger K 0 and a smaller θ 0 , Algorithm 3 is able to generate more abundant placeholder patterns to cover as many OOV placeholders as possible.

E Appendix
E.1 Model configurations

Figure 2 :
Figure2: Illustration of tokenizing parse tree/assembling TreePiece units with the placeholder design for given utterance "Remind me to send Susan an email about the meeting tonight".

Figure 3 :
Figure 3: Plot of averaged TreePiece decoding steps against AR decoding steps for skeleton generations.

Table 1 :
Quality and latency of all models on TOPv2.We train each model with 3 random seeds, and report the averaged EM/EM-S scores and latency on test split of TOPv2 dataset.We measure the decoding and overall inference latency of all models on both CPU and NVIDIA V100 32 GB GPU, and report the averaged milliseconds over all test samples.The number suffix for a TreePiece model represents the expansion size when creating TreePiece vocabulary.The best entry for each metric is bolded.
τ i ∈π 1 τ =τ i .In other words, n(π, τ ) is the number of appearances of τ in π. S ×[0, 1] |V| → {0, 1} is locally smooth almost everywhere (under Lebesgue measure on [0, 1]).In other words, for a.e.p ∈ [0, 1] |V| and every π ∈ Π S there exists a neighborhood B ϵ (p) where χ(π, •) is constant.S∈S E Π S [n(π, τ * )|S; p k ] * ) = * (t) as S∈S τ ∈π S (p i ) 9) Invoking Lemma B.1, we see that p k+1 is the solution to problem (1), which proves the first conclusion in Theorem 2.6.Secondly, the monotonicity of log P(S; p i ) can During backward, we call Sampling to randomly sample a previous subtree of t with respect to the following distribution: exp(θ • log P(t ′ , t)) s∈T (t) exp(θ • log P(s, t)) t ′ ∈T (t) Forward ends t curr ← S, π S (p) ← ∅ // Backward begins while t curr ̸ = BOS token, do t prev ← Sampling(P, t curr , θ) ∆ Π S [n(π, τ )|S; p k ], which now becomes NP-hard.But we can utilize Algorithm 3 to obtain an approximate solution.Indeed, if we iteratively run Algorithm 3 in place of the E-step in Algorithm 1 K times to obtain a partition sequence π S (p) (1) , π S (p) (2) , • • • , π S (p) (K) , and use the averaged partitions to update the frequency F * , then following similar lines in Appendix B, we can prove an asymptotic version of Theorem 2.6 under Assumption 20, by showing that the averaged frequency over K partitions converges to E n(π, τ ) S; p k as K tends to infinity, a direct consequence of Law of Large Numbers.We omit the details.