StructSP: Efficient Fine-tuning of Task-Oriented Dialog System by Using Structure-aware Boosting and Grammar Constraints

We have investigated methods utilizing hierarchical structure information representation in the semantic parsing task and have devised a method that reinforces the semantic awareness of a pre-trained language model via a two-step fine-tuning mechanism: hierarchical structure information strengthening and a final specific task. The model used is better than existing ones at learning the contextual representations of utterances embedded within its hierarchical semantic structure and thereby improves system performance. In addition, we created a mechanism using inductive grammar to dynamically prune the unpromising directions in the semantic structure parsing process. Finally, through experiments 1 on the TOP and TOPv2 (low-resource setting) datasets, we achieved state-of-the-art (SOTA) performance, confirming the effectiveness of our proposed model.


Introduction
Task-oriented dialog systems are computer systems designed to perform a specific task (Bai et al., 2022). These systems have a wide range of applications in modern business  and daily lives (Yan et al., 2017), and so on. The semantic parsing model at the core of these systems plays an essential role in converting user utterances into machine-understandable representations for use in capturing the semantic meaning and returning appropriate responses. The introduction of hierarchical representation by Gupta et al. (2018) demonstrated the importance of nested sublogic composition in a task-oriented dialog system. Although this representation is flexible enough to capture the meaning of complicated queries, it also challenges models to identify labels and select the * Equal contribution + Corresponding author 1 The source code of this work is released at https:// github.com/truongdo619/StructSP corresponding spans of constituent semantics in the natural sentence. Pre-trained language models (e.g., BERT (Devlin et al., 2019)) have recently achieved impressive results on many tasks, including semantic parsing (Ziai, 2019;Herzig and Berant, 2021;Rubin and Berant, 2021). Among them, the Recursive INsertion-based Encoding (RINE) model (Mansimov and Zhang, 2022) produces SOTA results by using recursive insertion-based mechanism that directly injects the decoded tree from the previous steps as input to the current step. Although many semantic parsing systems have shown impressive results, there is still much room for improvement in their overall performance.
A promising way to improve the performance of a task-oriented dialog system is to use a pretrained language model based on the characteristics of this task. The language models is pre-trained on large-scale unstructured text, which means that it does not explicitly learn the logical structure of any sentences. Since information about sentence structures is crucial in the semantic parsing task. Besides, the hierarchical semantic representation in a task-oriented dialog system usually follows a grammar based on a specific task or domain. For Training data [IN:GET_EVENT [SL:CATE_EVENT Concerts ] by [  example, the parent node of slot SL:NAME_EVENT is always the intent IN:GET_EVENT, etc. However, existing methods ignore this information (Mansimov and Zhang, 2022). We thus focused on making two improvements: strengthening the hierarchical structure awareness of the pre-trained language model and using dynamically pruning the unpromising decoding directions by using inductive grammar ( Figure 2).
In this paper, we introduce StructSP, a framework for effectively embedding the structured logic information of utterances into a pre-trained language model and thereby enhancing the performance of the semantic parsing system. Our method is generalized and can be adapted to any logical structure, such as a logical form, by using λcalculus or Prolog syntax (Dong and Lapata, 2016). In particular, we exploit the hierarchical representation, a deep fine-grained structure widely used in semantic parsing (Zhao et al., 2022b;Desai and Aly, 2021;Louvan and Magnini, 2020). Our method consists of two phases: structure-aware boosting and grammar-based RINE. The structure-aware boosting phase aims to enhance the structured information of the pre-trained language models and includes two subtasks. The first is structurefocused mask language modeling (MLM), which extends the standard MLM task (Devlin et al., 2019) by focusing more on the logical units in the linearized hierarchical representation. The second subtask is relative tree agreement where relative trees are the trees parsed in the middle steps of the parsing process ( Figure 1). This subtask enables the model to represent the relative trees with closer hidden vectors and learn the hidden relationship between them. The second phase, grammar-based RINE uses the grammar extracted from annotated data to support node label prediction. Incorporating structure information into the model enabled our StructSP to achieve an exact match score on the TOP dataset 0.61 points better than with a SOTA method and demonstrated similar potential results on the TOPv2 datasets (low-resource setting).
This work makes three contributions: (1) an effective fine-tuning approach is introduced for incorporating hierarchical semantic structured information into a pre-trained language model; (2) the use of grammar is introduced in the parsing process to reduce unpromising node label prediction; (3) the StructSP framework is shown to outperform existing models in task-oriented semantic parsing on two datasets, TOP and TOPv2.

Related Work
Overview Task-oriented parsing (TOP) (Gupta et al., 2018) and its variant TOPv2 (Chen et al., 2020) were created as benchmarks for assessing the performance of task-oriented semantic parsing models. Various approaches have been proposed for tackling the semantic parsing task on these datasets. Zhu et al. (2020) introduced a non-autoregressive sequence-to-sequence semantic parsing model, which is based on the Insertion Transformer model (Stern et al., 2019). Aghajanyan et al. (2020) introduced an extension of the hierarchical representation, "decoupled representation", and used sequence-to-sequence (S2S) models based on the pointer generator architecture to parse this representation. Rongali et al. (2020) introduced a unified architecture based on S2S mod-els and a pointer-generator network for semantic parsing. Einolghozati et al. (2019) introduced a shift-reduce parser based on recurrent neural network grammars (RNNGs) with three improvements to the base RNNG model: incorporating contextualized embeddings, ensembling, and pairwise reranking based on a language model. Additionally, Zhao et al. (2022a) showed that compositional TOP could be formulated as abstractive question answering (QA), where the parse tree nodes are generated by posing queries to a QA model. The RINE model (Mansimov and Zhang, 2022) splits the work of parsing an utterance into multiple steps, where the input of each step is the output of the previous step. However, even SOTA methods ignored the hierarchical structure information during the parsing process. We thus focused on utilizing the hierarchical structure information by using inductive grammar extracted from annotated data to guide the node label predictions.
Pre-trained Language Model Adaptation Several recent studies have demonstrated the value of adapting pre-trained language models to specific tasks with different training objectives, such as in summarization (Zhang et al., 2020a) and knowledge inference (Sun et al., 2019;Liu et al., 2020). In the realm of semantic parsing, the SCORE pretraining method (Yu et al., 2020b) focuses on inducing representations that capture the alignment between dialogue flow and structural context in conversational semantic parsing tasks. Grappa (Yu et al., 2020a) is a pre-training approach designed for table semantic parsing and seeks to learn an inductive compositional bias through the joint representation of textual and tabular data. Bai et al. (2022) introduced a semantic-based pre-training approach that uses abstract meaning representation (AMR) as explicit semantic knowledge to capture the core semantic information of utterances in dialogues. In comparison, to strengthening hierarchical structure information, we continue training the model by using only annotated data for the finetuning task instead of generating (or augmenting) a large artificial dataset to learn structure information (Yu et al., 2020b,a). This enables the model to effectively performs the hierarchical semantic parsing task using fewer computational resources.
Grammar-Constrained Neural Network Models Incorporating constraint grammar in text generation tasks using neural networks is a topic of interest to researchers. Yin and Neubig (2017) proposed an approach using a grammar model to generate an abstract syntax tree by a series of actions. Krishnamurthy et al. (2017) presented a typeconstrained semantic parsing model, which ensures that the decoder only generates well-typed logical forms. Shin et al. (2021) demonstrated that the constrained decoding process with an intermediate sub-language mapped by a large language model can support parsing the target language. More recently, Baranowski and Hochgeschwender (2021) extracted context-free grammar from the target logical form of the semantic parsing task. These grammar constraints were then enforced with an LR parser to maintain syntactically valid sequences throughout decoding. In comparison, our approach utilizes grammar as additional structured information during training with a conditional loss function. This key difference sets our approach apart from previous works.

Overview
Our StructSP framework consists of two finetuning phases: structure-aware boosting and grammar-based RINE ( Figure 2). In the first phase, Structure-Aware Boosting, we use a pre-trained RoBERTa model (Liu et al., 2019) as a backbone and continue to train it in a multi-task setting. This improves the model's ability to express structured information of hierarchical representation by using two sub-tasks: structure-focused MLM (section 3.2.1) and relative tree agreement (section 3.2.2).
In the next phase, we use grammar-based RINE, which uses the RINE approach (Mansimov and Zhang, 2022) augmented by grammar rules to tackle the problem of the hierarchical semantic parsing task. The parsing process is split into multiple steps, where the input of each step is the output of the previous one. The structure-aware model from the previous phase is used as an encoder to make predictions. Especially, an inductive grammar synthesized using training data is used to prune the unpromising decoding directions. The grammar rules not only correct the parsing process but also reduce resource usage by eliminating unnecessary predictions.

Structure-focused MLM
Before introducing our proposed structure-focused masking strategy, we describe the original MLM (Devlin et al., 2019).

Original MLM Given a sequence of tokens
corresponding to the input sequence is generated, where t i indicates the probability that token x i will be chosen for masking. Following Devlin et al.
(2019), all t i are equal to 15%, meaning that every token has the same probability of being chosen in the masking process. Let D be the set of tokens randomly chosen for masking (e.g., D = {x 2 }), and let X ′ be the sentence with masked tokens (e.g., . The cross-entropy loss function is computed for all masked tokens: where |D| is the total number of masked tokens in a sentence, |V| is the size of vocabulary of language model, and p(x ij | X ′ ) is the probability that the model assigns to the token x i belong to token j th in vocabulary given the context provided by the masked sequence. The model is trained to minimize this loss function, which measures how well it can predict the masked tokens on the basis of the context provided by the unmasked ones Structure-focused MLM In the linearized hierarchical representation of an utterance (Table 1), there are two types of tokens: normal and logical. The logical tokens can be divided into two sub-categories: bracket tokens ("[" or "]") indicate the span of a label, and label tokens represent the label itself (e.g., IN:GET_EVENT). The correct prediction of a masked bracket token demonstrates the model's understanding of the label span. Similarly, the correct prediction of a masked label token demonstrates the model's understanding of the label type. Therefore, masking logical tokens to train the model for learning the hierarchical structure is a reasonable approach (Bai et al., 2022). However, the original MLM treats all tokens equally, regardless of whether they are normal or logical. In our approach, we modify the original MLM approach by assigning a higher masking probability to logical tokens, thereby pushing the model to pay more attention to the hierarchical structure (the logical token) information. This enables the model to learn the structural characteristic of natural sentences while retaining the knowledge learned from the original MLM (Yu et al., 2020a). More specifically, our structure-focused MLM strategy gives logical tokens a higher masking probability, defined as α (with α > 15%).
be the revised sequence of masking probabilities, where t ′ i = α if the corresponding i th token is a logical one. The structured-focused MLM loss (L str_mlm ) function of masked tokens is computed similarly to the original MLM loss.

Relative Tree Agreement
In hierarchical representation, leaf nodes are words, while non-terminal nodes are semantic tokens: either intents or slots.
We define a non-terminal list as an ordered list of nonterminal nodes by using a breadth-first search. For example, using the tree shown in Figure 2, we obtain the non-terminal list [IN:GET_EVENT, SL:CATE_EVENT, SL:NAME_EVENT]. This list is then used to form positive training samples. The goal of this task is to improve the ability of the model to produce similar representations for trees in the same parsing process, enabling it to effectively perform the recursive insertion-based semantic parsing task. To achieve this goal, a pruned tree is generated at each training iteration by randomly selecting a node from the non-terminal list of the full tree (i.e., the annotated tree) and pruning all non-terminal nodes to the right of it in the list. This pruned tree is relative to the full tree. Formally, we denote the linearized representation of the full tree as P f ull and that of the pruned tree as P pruned . The hidden vector representations of these two trees (h f ull and h pruned ) are encoded using a pre-trained language model: We use contrastive learning (Frosst et al., 2019;Gao et al., 2021;Bai et al., 2022;Luo et al., 2022) to train our model for this task. The aim is to align the representations of positive pairs in the latent space by minimizing the distance between them while simultaneously maximizing the distance between negative pairs. Specifically, we define a contrastive loss function that measures the similarity between the hidden states h f ull and h pruned . This information is then used to update the model's parameters. Minimizing this loss function enables our model to learn how to produce more similar representations for the full and pruned trees. In particular, given a training batch B, the positive pair at the i th position is (h pruned ), and the negative pairs are the other samples in the batch. The training objective is given as follows: where sim is the similarity function based on co- pruned are the hidden states for the full and pruned trees, respectively, at the i th position in the batch indexes. τ is a temperature hyperparameter (Gao et al., 2021), and B is the set of samples in the mini-batch.

Objective Function
We combine the objectives of the two sub-tasks to formulate the structure-aware boosting loss (L sab ).
where λ is a hyperparameter that balances the contributions of the two objectives. As proposed elsewhere (Lee et al., 2020;Nandy et al., 2021;Bai et al., 2022), we do not train our model from scratch. Instead, we use a pre-trained language model to initialize its parameters. This enables us to leverage the knowledge contained in the language model.

Grammar-based RINE
RINE In this phase, we use a recursive insertionbased approach (Mansimov and Zhang, 2022), with our structure-aware boosted model created in the previous phase serving as the backbone. The parsing process can be formally represented as a chain of incremental sub-parsed trees, denoted as P = [P 0 , P 1 , ..., P gold ] (Table 1). At the i th step, the model receives the tree in the previous step P i−1 as input and predicts the current node label and its span to decode the tree P i . The tree P i is updated to include a new prediction node and processing recursively moves to the next step i + 1 until a special end-of-prediction signal (EOP) is found.
Grammar constraint Observation revealed that the relationships between nodes in the semantic tree are important information in the parsing process. For example, the intent IN:GET_EVENT typically comes with slots for event information, SL:CATE_EVENT or SL:NAME_EVENT. Therefore, we introduce a grammar integration mechanism into RINE to reduce movement in unpromising decoding directions.
Particularly, we construct grammar G = {A → B | A, B are non-terminal nodes} on the basis of training data (e.g., IN:GET_EVENT → SL:CATE_EVENT ). Integrating grammar into the RINE model means that the parser considers only the potential nodes in a candidate set (e.g., C = G(A)) on the basis of the parent node (e.g., B ∈ C) at each step of node label prediction.
Modeling For mathematical modeling, given subtree P i at the i th step, our model first encodes this tree as a sequence and obtains the corresponding hidden states fused by context. Following Mansimov and Zhang (2022), we use hidden states of the [CLS] token in the last transformer encoder layer to predict the label of the next node. Furthermore, to predict the range of this node, we use all hidden states in the last two layers to compute the probabilities of the start and end positions.
where W · and b· are the learnable parameters, l is the number of transformer encoder layers in the pre-trained language model, and w k is the k th token (word) index of the input sequence. We use a special penalty score (s penalty ) injected into the loss function to cause the unpromising label node predictions to be ignored: where C is the candidates generated from the grammar and the parent node. Finally, the cross-entropy (CE) losses are synthesized in accordance with the gold label (g nodeLb ) and these output probabilities: L nodeLb = CE(g nodeLb , p nodeLb ) + s penalty (9) Inference challenge In the inference step, the challenge with the approach is a lack of information about which node is the parent node at the current step to be used to generate the candidate set. We propose a simple strategy to meet this challenge: use the result of span prediction to determine the parent node. For example, considering the 3 rd step in Table 1 and given input P 2 , we have no information about the next node to be predicted and do not know whether the parent node is IN:GET_EVENT or SL:CATE_EVENT. We predict the span first and obtain the position of "Chris Rock". We thus determine that the parent of the current node is IN:GET_EVENT, and can generate candidates C by using this node.

Experiment
To verify the effectiveness of our proposed method, we conducted experiments on datasets TOP 2 and TOPv2 3 (low-resource setting). Following Mansimov and Zhang (2022), we conducted three random seeds for each experimental setting and reported the average results with standard deviation.

Datasets and Evaluation Metric
The TOP dataset (Gupta et al., 2018) is a collection of utterances divided into two domains: navigation and event.  Rongali et al., 2020;Zhu et al., 2020;Mansimov and Zhang, 2022), we used exact match (EM) as performance metric. The EM score was calculated as the number of utterances for which fully parsed trees were correctly predicted.

Experimental setting
Structure-aware boosting: We used the RoBERTa (Liu et al., 2019) model as our backbone. With the TOP dataset, we used a peak learning rate of 1e-05 and continued training for ten epochs using the Adam optimizer (Kingma and Ba, 2014) with epsilon 1e-08 and a batch size of 16 sequences with 512 tokens. For the hyperparameters, we set τ = 1.0 and selected λ from {0.5, 1.0} 4 . With the TOPv2 dataset, we used the same settings as for the TOP dataset except we increased the number of training epochs to 50 for the 25 SPIS setting. The training process on a single NVIDIA A100 GPU.
Grammar-based RINE: we used the model trained from the structure-aware boosting phase as an encoder. For the TOP dataset, we set the number of warmup steps to 1000, selected the learning rate from {1e-05, 2e-05}, and performed training for 50 epochs. The best checkpoint was chosen on the basis of the model's performance on the validation set. We utilized the Adam optimizer and a batch size of 32 sequences containing 512 tokens. For the TOPv2 dataset, we used the same settings except for adjusting the batch size to 16 for the 500 SPIS setting and 8 for the 25 SPIS setting.

Baseline:
We reproduced the RINE model (Mansimov and Zhang, 2022) as a strong baseline model and kept the same hyperparameter values as the original paper (Mansimov and Zhang, 2022).

Main Results
TOP dataset We evaluated the performance of our proposed StructSP method on the TOP dataset and compared it with that of the other works. The results are shown in Table 2. Our model outperformed all the other models. Specifically, it achieved a higher score than the autoregressive seq2seq model with pointer (Rongali et al., 2020) by 1.51 EM and outperformed the best method of previous works, RINE (Mansimov and Zhang, 2022) by 0.61 EM scores. These results show the effectiveness of our proposed method in injecting prior semantic hierarchical structure information of a natural sentence into the pre-trained language model. Besides, our result outperformed the result of Bai et al. (2022) by a large margin, although the authors also strengthened the structure information of pre-trained models using a general AMR structure. Additionally, using grammar improved the performance of our model by 0.29 points compared with a version of the model trained without using grammar. This demonstrates the value of incorporating grammar into our model when using the TOP dataset.
TOPv2 dataset We evaluated the performance of our proposed StructSP method on the low-resource data of the reminder and weather domains of the TOPv2 dataset and compared it with that of the other methods. The results are shown in Table 3.
Our model outperformed the others for all SPIS settings. At the 25 SPIS setting, our StructSP outperforms the baseline by 0.93 EM and 1.02 EM in the weather and reminder domains. Notice that, with these low-resource settings, in the structure-aware boosting phase, the model also only continues to train with limited data but still achieves impressive improvements.
In addition, at the 25 SPIS setting of the weather domain, we observed that the performance of the StructSP was improved when it was trained without using grammar. We argue that this was due to the size of the training set at this setting being extremely small (176 samples), and the extracted grammar from this training set is not expressive enough to cover the grammar in the validation and the test set.

Ablation Study
To evaluate the effect of using each component in our framework, we compared the model's performance for each combination of component settings with that of the RINE baseline model on the validation set of the TOP dataset (Table 4).
We found that using grammar in the second phase grammar-based RINE led to improved performance on the TOP dataset (EM score 0.12 points higher than with the baseline model). Additionally, using the structure-aware boosting phase substantially improved the EM score compared with the baseline (0.52 points higher). To further understand the contributions of each training subtask in the structure-aware boosting phase, we conducted two additional experiments. The results show that using the structure-focused mask language modeling subtask improves EM performance by 0.36 points compared with the baseline while using the relative tree agreement subtask leads to only a 0.05 improvement.
Furthermore, we conducted a t-test (Lehmann and Romano, 1986) with the null hypothesis that the expected values of our full-setting model (StructSP) and the baseline model (RINE) are identical (Table 4). Based on the experimental outcomes, the p-value was found to be below 0.05, which suggests that the proposed method outperforms the baseline model significantly.

Effect of masking probability α
In another experiment, we analyzed the effect of the logical-token masking probability (α in Section 3.2.1) in the structured-aware boosting phase on overall performance (Figure 3). High performance was achieved when α was set to 0.3 or 0.4. We attribute this to our mechanism pushing the model to pay more attention to the logical tokens, helping it to better capture the structure. In sum-mary, these results show that our StructSP method achieved better performance than the baseline for all values of α, which demonstrates the robustness of our approach.  . This demonstrates the effectiveness of incorporating grammar into our model. The second example presents a ground-truth tree with a complex structure, requiring the model to identify the span "after a Pacers game" as a date time slot (SL:DATE_TIME) and then correctly parse the structure within this span. Our model was able to correctly return the tree, whereas the baseline model was not. The final example is a particularly challenging one, the tree has a depth of 5, indicating that the tree structure is highly complex. Both models failed to return the correct predictions, suggesting that learning to handle such complicated queries is an interesting topic for future work.

Conclusion
We have presented a novel approach to improving the performance of SOTA semantic parsing models on hierarchical representation datasets. In this approach, a model is created that incorporates knowledge of the utterance structures into the semantic parsing process. This is achieved by learning contextual representations from the hierarchical representation of utterances with objective functions targeted at the semantic parsing task as well as by using grammar rules containing knowledge about the structure of the data for training and label prediction. In experiments on the TOP and TOPv2 datasets, our model outperforms previous SOTA approaches.

Limitations
There are two main limitations to our works.
(1) Grammar constraint: The results of the StructSP method at the 25 SPIS setting in the TOPv2 dataset (Table 3) suggest that the results of using grammar with low-resource data can be uncertain. The reason is that the extracted grammar from training data for low-resource setting is not general enough to capture the grammar of the new coming data (validation or test set). Therefore, for our StructSP method to work effectively, the provided grammar should cover all grammar rules if possible.
(2) Prediction time: A recursive insertion-based strategy is used for prediction. This means that the output of the previous parsing step is used as input for the current parsing step, and this process continues until a terminal signal is encountered. As a result, parsing a complex tree with multiple intents/slots (labels) can be a lengthy process due to the recursive nature of this method. Future work includes improving parsing prediction time by predicting all labels at the same level in the parsed tree rather than predicting them one by one.

B Model Hyper-Parameters
The hyper-parameters used in our models in the TOP dataset are shown in Table 6. In the TOPv2 dataset, we use the same hyper-parameters with the following exceptions: for the 25 SPIS setting in the structure-aware boosting phase, the training epoch was adjusted to 50; for the 500 SPIS and 25 SPIS settings in the label prediction phase, the batch size was adjusted to 16 and 8, respectively. The hyperparameters used in our models in the TOP dataset Our framework is implemented using Pytorch 6 and HuggingFace Transformers. 7 . Our source code and extracted grammar can be found at: [masked] (access will be granted upon acceptance). C Case study outputs with tree representation Figure 5 presents the example of the parsing process described in Section 1 using tree representation. Additionally, Figure 5 displays the outputs of examples discussed in Section 5.3 using tree representation.