An Empirical Analysis of Leveraging Knowledge for Low-Resource Task-Oriented Semantic Parsing

Task-oriented semantic parsing has drawn a lot of interest from the NLP community, and especially the voice assistant use-cases as it enables representing the meaning of user requests with arbitrarily nested semantics, including multiple intents and compound entities. SOTA models are large seq2seq transformers and require hundreds of thousands of annotated examples to be trained. However annotating such data to boot-strap new domains or languages is expensive and error-prone, especially for requests made of nested semantics. In addition large models easily break the tight latency constraints imposed in a user-facing production environment. As part of this work we explore leveraging exter-nal knowledge as a replacement for additional annotated data in order to improve model accuracy in low-resource and low-compute settings. We demonstrate that using knowledge-enhanced encoders inside seq2seq models does not result in performance gains by itself, but multitask learning to uncover entities in addition to the parse generation is a simple yet effective way of improving performance across the domains and data regimes. We show this is especially true in the low-compute low-data setting and for entity-rich domains, with relative gains up to 74.48% in some cases on the TOPv2 dataset.

Task-oriented semantic parsing has drawn a lot of interest from the NLP community, and especially the voice assistant use-cases as it enables representing the meaning of user requests with arbitrarily nested semantics, including multiple intents and compound entities. SOTA models are large seq2seq transformers and require hundreds of thousands of annotated examples to be trained. However annotating such data to bootstrap new domains or languages is expensive and error-prone, especially for requests made of nested semantics. In addition large models easily break the tight latency constraints imposed in a user-facing production environment. As part of this work we explore leveraging external knowledge as a replacement for additional annotated data in order to improve model accuracy in low-resource and low-compute settings. We demonstrate that using knowledgeenhanced encoders inside seq2seq models does not result in performance gains by itself, but multitask learning to uncover entities in addition to the parse generation is a simple yet effective way of improving performance across the domains and data regimes. We show this is especially true in the low-compute low-data setting and for entity-rich domains, with relative gains up to 74.48% in some cases on the TOPv2 dataset.

Introduction
Fostered by NLP advances, virtual assistants such as Google Home or Alexa are becoming increasingly competent to address complex yet natural, everyday user needs. While requests as simple as "turn off the living room lights when the movie starts" could not be fulfilled with legacy systems that assigned a single user intent to each utterance and a single slot label to each token in an utterance (Mesnil et al., 2013;Liu and Lane, 2016), recent works on task-oriented semantic parsing (Gupta * The work was done while at Amazon Alexa AI. Aghajanyan et al., 2020) represent utterance semantics with arbitrarily nested trees (Figure 1), thus handling the above use-case among others (e.g. multiple intents, cross-domain intents, compound entities, etc.). The research community tackles this task with success by treating it as a seq2seq generation task where a linearized semantic tree is predicted iteratively (Rongali et al., 2020), but such approaches fall short when confronted by real-life constraints such as strict run-time latency and scarcity of quality training data. Manual data annotation of training examples is a costly and error-prone process, which is exacerbated as utterance target representations become richer (more nested). The impact of data scarcity has been quantified in recent years with the introduction of the TOPv2 benchmark ) that provides low-resource scenarios for task-oriented parsing.
Popular approaches to overcome data scarcity include synthetic data augmentation (Feng et al., 2021;Jia and Liang, 2016;Schick and Schütze, 2021), transfer learning (Ruder et al., 2019;Fan et al., 2017), and meta-learning (Gu et al., 2018;Huang et al., 2018;Wang et al., 2020). In this paper, we explore if we can model richer token representations for mentions by leveraging external knowledge, as mentions are fundamental to generating the correct parse. The backbone motivation lies in the observation that several everyday NLP applications involve real-life entities referenced in knowledge bases (for e.g. street names, sports events, or public figures). This information can be utilized for enhancing downstream NLP tasks. For example the request "play the green line" could refer to either a movie name or a song name, modeling this mention appropriately for the decoder could improve performance while generating a parse. This is particularly appealing in the low-data regime, for which rare entities are unlikely to be represented in the training data at all. Additionally, building entity embeddings through entity-focused modeling objectives has shown promising results in entity based NLP tasks such as named entity recognition (Yamada et al., 2020) and entity linking (Wu et al., 2020).
While there has been prior work to leverage knowledge for generation tasks (Guu et al., 2020;Izacard et al., 2022;Cao et al., 2020) this largely focused on unstructured text generation tasks such as Question-Answering or Entity Linking. To the best of our knowledge, we are the first to investigate its use in seq2seq models for task-oriented semantic parsing, a complex and structured text generation task.
We present an empirical analysis of using knowledge to improve accuracy of semantic parsing models, with a special focus on low-latency models such as small-decoder seq2seq models and nonauto-regressive models like RINE (Mansimov and Zhang, 2022). Our contributions are as follow: • We benchmark three popular Knowledge-Enhanced encoders inside seq2seq models and show this way of leveraging knowledge does not consistently improve accuracy in the lowdata regimes for task-oriented semantic parsing generation. However when reformulated as a classification task we see promising results with knowledge-enhanced encoders.
• We propose a joint training objective combining semantic parsing and mention detection as a simple and effective approach to leverage external knowledge and improve accuracy. We find up to 74.48% relative gains over baselines for low-data settings and entity-rich domains.
• We quantify the benefits of source training for regular, knowledge-enhanced and low-latency models, in gradually increasing low-data scenarios.

Related Work
Task-oriented Semantic Parsing Semantic parsing refers to the task of mapping natural language queries into machine-executable representations. Voice assistants typically transform a voice recording into text, that is further mapped to a backend exploitable representation containing the semantics of the request: the user intent, the invoked entities, relations between those entities, etc. Task-oriented parsing was popularized with the introduction of the TOP dataset (Gupta et al., 2018), and is usu-ally treated as a seq2seq task where utterance tokens are copied into a semantic tree constructed auto-regressively (Rongali et al., 2020;Arkoudas et al., 2022). However such models are not always applicable in production environments with strict memory and latency constraints. This limitation is commonly addressed by reducing model sizes (Jiao et al., 2019;Kasai et al., 2020) and leveraging non-auto-regressive modeling (Gu et al., 2017;Zhu et al., 2020;Mansimov and Zhang, 2022).
Knowledge-Enhanced LMs Retrieval-based seq2seq models such as REALM (Guu et al., 2020) and ATLAS (Izacard et al., 2022) leverage factual knowledge from a corpus or knowledge-graph during training and inference, hence incur a considerable latency cost, despite attempts to make the retrieval more efficient (Wu et al., 2022). Given our low-latency setup, we focus on parametric knowledge that is learnt during the pre-training or fine-tuning process of large language models (LLMs), resulting in embeddings that do not require explicit knowledge retrieval at inference. Knowledge-enhanced pretraining focuses on modeling entities: WKLM (Xiong et al., 2019) learns to determine if an entity was replaced with another entity of the same type in addition to Masked Language Modeling (MLM) and shows gains on downstream knowledge-intensive tasks such as Question-Answering (QA) and Relation Extraction (RE). LUKE (Yamada et al., 2020) explicitly models entity-embeddings through entityembedding prediction during MLM and entityentity self-attention layers during fine-tuning, with gains on Named Entity Recognition (NER), QA and RE. KBIR (Kulkarni et al., 2022) learns to reconstruct keyphrases in a combination and extension of WKLM and SpanBERT (Joshi et al., 2020), improving keyphrase extraction/generation tasks. Lastly, BLINK (Wu et al., 2020) learns entitydisambiguation by aligning entity surface forms to their descriptions resulting in rich entity embeddings. Work in the area of parametric knowledgeenhanced seq2seq models is limited to KeyBART (Kulkarni et al., 2022) for Keyphrase Generation and GENRE (Cao et al., 2020) for Entity Disambiguation.

Methods
We explore two complementary methods for leveraging knowledge: (1) fine-tuning knowledgeenhanced encoders for task-oriented semantic pars- ing inside seq2seq models, and (2) multi-tasking the parse generation with a mention detection task.
Task formulation We follow the task formulation of the Seq2Seq-PTR model as a sequence-tosequence generation setup (Rongali et al., 2020). The source sequence is an utterance and the target sequence is a linearized representation of the semantic parse. The target sequence is modified to contain only intent and slot labels or pointers to tokens in the utterance. Following (Aghajanyan et al., 2020) and subsequent work we use the decoupled format that limits prediction to tokens that are leaves of slots 1 as it yielded better downstream performance in previous work. We illustrate the format used with an example from the TOPv2 dataset below: Each @ptr i token here points to the i th token in the source sequence. Here @ptr 3 corresponds to the word minneapolis.
Proposed Architecture Based on the observation that many slot-values present in our task are actual real-life entities, we hypothesize that learning more effective representations of these slot-values may result in generating more accurate semantic parses as mentions play a critical role in understanding the utterance. We use knowledge-enhanced pretrained encoders (as described in Section 2) inside the Seq2Seq-PTR architecture used in Rongali et al. (2020), extended to multitask training of parse generation and training of the encoder to perform token classification (mention detection), as it aligns with classification-based pre-training of the encoder. We anticipate that the multitask training will allow the knowledge-enhanced encoder representations to be attended and leveraged more effectively by the decoder generating the parse. Further, by modeling mentions inherently present in the annotated data, this serves well for low-resource use cases since we maximize the potential to learn from the data available. Figure 1 illustrates our proposed architecture, whereby for a given input utterance [x 1 , .., x n ] we obtain encoder representation [e 1 , ..., e n ], from which we jointly learn two tasks: a) Mention Detection and b) Parse Generation.

Mention Detection
We frame this as a token classification task to identify spans corresponding to mentions using the BIO tagging schema. Given the input sequence containing two mention spans [x 0 , x 1 ] and [x 3 ], the corresponding target labels are [B-MEN, I-MEN, O, B-MEN], where B represents the beginning of the span, I represents an intermediate label within the mention span and O represents a non-mention span token. We only use this coarse-grained single entity-type label (MEN) as this is not used for inference but rather only to guide learning better encoder representations to be used by the decoder. We use a cross-entropy loss to learn these model parameters: Parse Generation Given the first t − 1 generated tokens, the decoder generates the token at step t as follows: the decoder first produces a hidden state d t through a multi-layer, multi-head self-attention (MHA) on the encoder hidden states and the decoder states so far, in line with the transformer decoder from Vaswani et al. (2017). The hidden state d t is fed into a dense layer to produce scores over the target vocabulary and weights are learnt using a reconstruction loss L r .
As the loss scales are similar, we use an equally weighted joint loss combining the losses from both the task to update the model parameters.
We use a crowdsourced dataset called TOPv2  for this empirical analysis. The dataset maps user queries to hierarchical representation as exemplified in Figure 1. The dataset contains 8 domains, such as Reminder (used to set alarms, reminders) and Navigation (used to get driving directions, traffic information). Some domains are more complex than others, by having larger catalogs and overall more nested semantics. TOPv2 is a relevant testbed for virtual assistant understanding models in low-data settings, as it comes with different data regimes called Samples Per Intent and Slot (SPIS), for example 10 SPIS which means that each intent and slot label is present in only 10 different annotations.

Mention Distribution
We use the FLAIR (Akbik et al., 2019, 2018) NER model 2 to tag entities and then leverage BLINK 3 (Wu et al., 2020) to link entities to get their canonical surface form when available. Entity-type information is only used to facilitate linking. Table 1 shows the entity distribution across the various domains of the TOPv2 dataset. This leads us to pick the following domains for our analysis: • Event, which has the highest percentage of utterances that contain entities, serving as an ideal candidate to test our hypothesis.
• Reminder, which has the second least number of entities per utterance. We consider this domain to evaluate the impact of our proposed method for entity-scarce domains 4 .
Because FLAIR NER tagger is limited to identify only three types of entities: Organizations (ORG), Persons (PER) and Locations (LOC), we extend our entity set by using slot-values present in the TOPv2 annotations. We manually select slots labels that are close to real-life entity types, but which slot values might not be recognized by the NER tagger. We describe the slots used for each domain in Appendix A.2.
The updated mention distribution is illustrated in Table 2. We see that trends between domains stay relatively the same, however there are significantly more utterances now containing entities. Event and Navigation almost double the number of average entities present in their utterances: from 1.04 to 1.76 for Event, and 1.31 to 1.86 for Navigation. For Reminder it remains more or less the same as before (1.03 vs 1.07). Even by adding those slots there isn't a lot of salient information to be captured in the form of entities in Reminder.
Our experiments show that using a combination of the entities tagged by FLAIR NER + BLINK and those tagged by the slot-matching mechanism described in A.2, was more effective than using either of these methods independently. We consider the spans of the tagged entities as labels. In the case both systems flag overlapping spans of text, longer spans override the shorter spans in case of nested entities as shown in A.3.
Source Training A common scenario for deployed production systems that serve N domains is to scale to a new N+1 th domain. We assume the existing N domains have longer established, larger datasets that we can use as training data to bootstrap the new domain, on which we want to fine-tune and perform evaluation.
Models Given our resource-constrained setting, all models we evaluate are base variants of the Domain   Alarm  Event  Messaging  Music  Navigation  Reminder  Timer  Weather   Train  Test  Train Test  Train  Test  Train  Test  Train  Test  Train  Test  Train  Test  Train  Test Avg   publicly available models, unless specified otherwise. We work with both seq2seq pre-trained transformer models and pre-trained transformer encoders stitched with a transformer decoder as done in Rongali et al. (2020). We primarily experiment with: • BART: We use the pre-trained encoderdecoder BART-base 5  as our baseline for the sequence generation task.
• RoBERTa2BART: We use the RoBERTabase 6 (Liu et al., 2019) as the encoder and randomly initialize a six layer decoder in the same configuration as the BART-base decoder. This largely serves as a baseline to LUKE as a parametric non-knowledge-enhanced encoder i.e. a vanilla encoder.
• LUKE2BART: We use the LUKE-base 7 as the encoder and randomly initialize a six layer decoder in the same configuration as the BARTbase decoder. LUKE 8 serves as our parametric knowledge-enhanced encoder in evaluations.
Lightweight Architecture Variants As we explore the computation constrained setting with limited latency budget, we also implemented our models using a Single Layer Decoder (SLD) while maintaining the same size encoder. We do this as the largest portion of the latency footprint comes 5 https://huggingface.co/facebook/bart-base 6 https://huggingface.co/roberta-base 7 https://huggingface.co/studio-ousia/ luke-base 8 It is directly comparable to ROBERTA in architecture and size since we use only the token embeddings, and not the entity-entity self-attention layers. For results including these too see Section 6. from the passes through the decoder, since autoregressive decoding requires token representation to travel all their way up to the decoder as many times as there are tokens to generate. As such we propose BART2SLD, RoBERTa2SLD, and LUKE2SLD variants with a randomly initialized single layer decoder. Another angle to latency reduction is to use non-auto-regressive modeling, such as RINE (Mansimov and Zhang, 2022), a RoBERTa-based approach that achieve state-ofthe-art accuracy on low and high-resource TOP dataset while being 2-3.5 times faster than autoregressive counterparts. In this work we experiment with rine-roberta (the original RINE model), and rine-luke, where we instead initialize the encoder model weights with the LUKE-base parameters.

Implementation Details We use HuggingFace
Transformers (Wolf et al., 2020) for seq2seq modeling architecture to ensure reproducibility. We do not tokenize intent and slot tags, but instead learn embeddings from scratch. For all our experiments we use 8 V100 NVIDIA GPUs, with batch sizes of 32 per GPU with a gradient accumulation of 2 with FP16 enabled. Source training uses a learning rate of 1e −5 over 100 epochs and fine-tuning uses a learning rate of 8e −5 over 50 epochs. Both use the Adam optimizer (Kingma and Ba, 2015). We use beam search decoding with beam size 3, and a maximum generation length of 128.
Evaluation We report Exact Match (EM) accuracy score metrics in line with previous literature Aghajanyan et al., 2020;Rongali et al., 2020). Exact match accuracy is the most important metric to report as it strictly penalizes any incorrectly generated intermediate tokens as the end-performance of a semantic parsing system would result in a failure even for partially correct answers.

Results
All our results are source trained + fine-tuned, unless specified otherwise. We perform 3 runs across each experiment setting and report average scores and standard deviations. Our findings are as follows: Knowledge-enhanced encoders don't improve generative semantic parsing Table 3 shows results for the six layer (full) decoder setting and Table 4 shows results for a single layer decoder. In both the Multitasking and Non-Multitask setting, we see that the best performing model across data-regimes and domains is not consistently the knowledge-enhanced encoder LUKE. In the full decoder setting, LUKE-encoder based models perform on par but no better than the vanilla RoBERTaencoder based models. We also note that both these model underperform BART, but that the gap bridges as we add more training samples. In the light-decoder setting, we also see similar trends, however an interesting finding is that BART tends to underperform when compared to RoBERTa and LUKE, even in the full data setting. This could be attributed to the smaller encoder size for BART.
The above findings are contrary to expected performance improvements typically seen using knowledge-enhanced encoders for other entityrelated tasks such as NER, RE and QA. We believe the reason for this is that the aforementioned tasks are all classification-based tasks that are able to leverage the entity representations in making decisions on class-types, but in contrast Task-Oriented Semantic Parsing is a complex generation task. Even though entities play a critical role, the entity representations are not able to effectively guide the from-scratch decoder. This problem is alleviated to a certain extent through the Multitask training that we hypothesize is able to jointly learn representations of entities that will guide the decoder, but these jointly learnt representations do not necessarily benefit from the knowledge-enhanced encoder. Further, the application of Source Training potentially wipes out any gains the knowledgeenhanced encoder had over the vanilla counterparts as they have seen sufficient data to negate the gains through knowledge-enhancements as discussed in Section 6.
However knowledge-enhanced encoders can bring gains when reformulating parsing as a classification task as shown in Table 5 with the RINE approach that inserts utterance tokens in a semantic tree by recursively predicting triplets (label, start position, end position) until it predicts termination. We do not penalize misplaced non-semantic tokens in metric calculation. Recasting the generation task to a classification task serves to be more in-line with how LUKE was pre-trained. Further, we also do not require any form of source training in this setting. We observe that rine-luke outperforms rine-roberta in most scenarios for the two entity-rich domains, but not on the entity-poor domain Reminder.
Multitasking with mention detection is an efficient way to leverage knowledge and improves performance across the board on the two TOPv2 domains with strong entity presence (Navigation and Event), especially in the lightweight decoder setting (up to 74.48%, Table 4), but also nonnegligible in the full decoder setting (up to 8.60%, Table 3). When trained in domains with a weak entity presence (Reminder) multitasking serves as noise in the loss and results in a worse performing model for both full (-31.14%) and lightweight decoder (-82.83%). We also observe minor regression on 10 SPIS in Event but not in other data regimes for the domain, leading us to believe this may be an aberration. We find that while for certain settings such as Navigation+Lightweight decoder trained w/ MT knowledge-enhanced encoders outperform their vanilla counterparts, this behavior is not consistent across domains and decoder settings. Hence while the gains through multitasking remain consistent throughout, KE encoders do not play a large role in these gains. However, we also find that in the full decoder setting in the Navigation domain, LUKE seems to benefit the most from the Multitasking across all data regimes albeit performing slightly worse than RoBERTa still. Finally we also observe that as more data is added to the training set, the effectiveness of the Multitask learning reduces drastically. We believe this helps demonstrates that Multitask learning is most effective in the lower-data regime by leveraging knowledge available in the data.
Source-training is essential as shown in Table  8 in which KE models on their own are not sufficient to reach reasonable accuracy, as is the case for BART and was reported in . We show that source-training improves accuracy by up to 86.36% in full data regimes, with larger percentage gains for LUKE and RoBERTa when compared to BART, further demonstrating that Source Training is required to tune the encoders to the generation task as knowledge-enhanced pre-training    is typically classification-based. We also find that source-training drastically improves performance especially in low-data regimes with gains of up to 1262.20%. However, as more training data is made available, the impact of Source Training also drops quickly. In the absence of further pretraining of KE models, source training is a required step, and can actually be viewed as pretraining step. We also explored if using a pre-trained decoder from BARTbase helps in improving performance but found no significant gains hence skipped the results for brevity.

Case Study on Knowledge-enhanced encoders
To better understand the lack of performance boost by KE encoders we propose a deeper dive on using LUKE as well as two alternative KE encoders.
Further enhancements to LUKE only result in limited gains For our previous experiments we restrict to using only LUKE's token embeddings to make a fair comparison with RoBERTa. However the original LUKE encoder is armed with many more parameters, including the entity-entity self-attention that allows us to leverage richer entity embeddings. We explore using the entity em-   beddings in various forms and methods as we report in Table 6. luke2bart+linked entities finds the corresponding entity representation from LUKE's entity vocab and concatenates the embedding to the token representation. We also explore the approach luke2bart+unlinked entities that does not rely on finding a match in LUKE's entity vocabulary but rather generates the entity embedding based only on the given surface form. While the two aformentioned approaches are run only on entities tagged by FLAIR NER and linked with BLINK, we also try luke2bart+multitask entities, where the setup is similar to luke2bart+unlinked entities but leverages a larger entity set, which is actually the entity set used for the Multitasking, and uses entity embeddings for each surface form. We find that luke2bart+linked entities is the most effective methodology for 10 SPIS (+2.5 EM), however gains are neutralized as data is added (-0.2 EM). luke2bart+unlinked entities serves as a slightly more resource efficient way of improving performance as it skips the need to link entities before using them (+1.56 EM). Most interestingly, in contrast to the multitask learning setup we find that only concatentating representations of the slot-values in luke2bart+unlinked mentions actually hurts model performance (-0.86 EM). We believe the reason for this is that without the jointly learnt embeddings a higher number of concatenations to token representations introduces more noise than useful information, especially in low-data settings where there is insufficient data to learn across many parameters. Lastly, along the same lines of having too many parameters to learn from too few data, we made the additional finding that in the pointer generator network used by the decoder, using Dot Product Attention (DPA) is more effective than Multi-Head Attention (MHA) as it contains fewer parameters to learn.
Other KE encoders than LUKE lead to similar conclusions We explore using other knowledgeenhanced encoders: KBIR and BLINK. KBIR is potentially better suited as it is pre-trained to exploit keyphrases, which are closer to slot-values than entities. However Table 7 shows that KBIR performance is worse than its LUKE and RoBERTa counterparts (-3.87 EM). Using BLINK as the pretrained encoder also results in sub-par performance (-6.33 EM). This further strengthens our claim that the knowledge-enhanced encoders do not automatically enhance model performance. However, we see that Multitasking still continues to largely benefit both these encoders too, with BLINK making the largest gains of up to 23.19%.
Any potential KE encoder gains are diluted by Source Training We further investigated if KE encoders could have had a larger impact with less source training, for e.g. over fewer training epochs. We plot training curves for all our settings as seen in Figure 2. Our main observation here is that in the multitask setting LUKE outperforms RoBERTa in the single layer decoder setups early in training. However, as we train over more steps, the performance from both models converge. Further, in all other settings LUKE shows no discernible edge over RoBERTa during Source Training.

Conclusion & Future Work
We presented an empirical analysis of how we can leverage external knowledge for task-oriented semantic parsing in the low-resource and lowcompute settings, by conducting a rigorous set of experiments. We demonstrated that simply using a knowledge-enhanced encoder is not sufficient to improve performance over baselines for the complex task of sequence generation, but shows promising result when the task is reformulated as a classification task. We presented a multitask learning framework that leverages external knowledge and requires little to no extra data annotation, and demonstrated its effectiveness in the low-data and low-compute settings. Future work could probe the type of knowledge learned by this method, and attempt to apply it to other entity-rich tasks, across model architectures. It could also explore an indepth error analysis of where knowledge-enhanced encoders fail in order to address these shortcomings. Further we could extend this work for retrievalbased seq2seq models to improve task-oriented semantic parsing.

Limitations
We concede that there are differences in the number of parameters between the BART models when compared to the RoBERTa and LUKE counterparts. However, as per our result discussions and observations, the gains are orthogonal to the encoder used and the differences in the base models are not as significant when comparing the larger counterparts. We note that we also explored seq2seq pre-trained knowledge-enhanced models like Key-BART and GENRE, however both resulted in underwhelming performance compared to BART. Further exploration is required in improving performance for such models. We also note that while we demonstrate gains by switching to a classificationbased approach in RINE, such models are limited in other generation task capabilities such as translation or summarization. We will release the data and code used for this work, but emphasize that some processing was done over the raw TOPv2 dataset, namely reconstructing source utterances directly from the provided target instead of using the provided source, as we encountered mismatches when constructing pointers. The source was then lowercased.

Ethics Statement
We use publicly available data sets in our experiments with permissive licenses for research experiments. We do not release new data or annotations as part of this work.