Diagnosing Transformers in Task-Oriented Semantic Parsing

Modern task-oriented semantic parsing approaches typically use seq2seq transformers to map textual utterances to semantic frames comprised of intents and slots. While these models are empirically strong, their specific strengths and weaknesses have largely remained unexplored. In this work, we study BART and XLM-R, two state-of-the-art parsers, across both monolingual and multilingual settings. Our experiments yield several key results: transformer-based parsers struggle not only with disambiguating intents/slots, but surprisingly also with producing syntactically-valid frames. Though pre-training imbues transformers with syntactic inductive biases, we find the ambiguity of copying utterance spans into frames often leads to tree invalidity, indicating span extraction is a major bottleneck for current parsers. However, as a silver lining, we show transformer-based parsers give sufficient indicators for whether a frame is likely to be correct or incorrect, making them easier to deploy in production settings.


Introduction
Task-oriented semantic parsing-mapping textual utterances to semantic frames-is a critical component of modern conversational AI systems (Gupta et al., 2018;Aghajanyan et al., 2020). Recent methodology casts parsing as transduction, using seq2seq pre-trained transformers to produce linearized parse trees (Aghajanyan et al., 2020;Chen et al., 2020;Li et al., 2021); here, each frame token is either copied from the utterance or generated from an ontology. Compared to explicit grammarbased approaches (Gupta et al., 2018), this plugand-play of transformers simplifies the learning objective and scales to multilingual settings, but the lack of provenance makes it challenging to understand model behavior "under the hood." In this work, we investigate the strengths and weaknesses of transformer-based semantic parsers Figure 1: Example decoupled semantic frame representation (Aghajanyan et al., 2020) for the utterance Directions to the Warriors game. and provide modeling directions based on datadriven insights. Specifically, we study BART  and XLM-R (Conneau et al., 2020), two state-of-the-art conversational semantic parsers, on both monolingual (TOP/TOPv2; (Gupta et al., 2018;Chen et al., 2020)) and multilingual (MTOP; (Li et al., 2021)) datasets. The compositionality of utterances in these datasets provide a strong testbed for resolving both complex syntactic structure and semantic ambiguity, mirroring the types of challenges our parsers are likely to encounter in practice.
We design our experiments around three main questions. First, broadly speaking, what types of errors do transformer-based parsers make? We begin by annotating 500+ predicted frames across 6 languages and categorize them with fine-grained types. We find transformer-based parsers struggle not only with classification (i.e., disambiguating intents/slots) but also planning (i.e., switching between copying/generating). Planning errors are more egregious: misplacing close brackets, for example, can violate tree constraints, rendering the entire frame unusable.
Next, we investigate transformer-based parsers' abilities to generate syntactically-valid trees. Specifically, are planning mistakes caused by general uncertainty, or worse, a pathology of seq2seq learning? To address this, we devise an oracle set-   ting where a model conditions on partially gold information (either utterance spans or syntactic structure) and predicts the remaining parts of the frame. Surprisingly, we find conditioning on gold spansnot gold structures-results in near-perfect trees at most depths, pointing towards span extraction as a major bottleneck for current parsers. Finally, though transformer-based parsers are susceptible to error, ideally, we should be able to proactively diagnose mistakes. Using features from model generations (e.g., confidence), can we intrinsically judge if a sequence is correct or incorrect? Encouragingly, we show that a confidence estimation system combining a transformer-based parser and feature-based classifier can detect correct frames with 90%+ F1, indicating usability in production settings.
Each dataset sample consists of a textual utterance x and (linearized) semantic frame y. Here, frames are in decoupled form (Aghajanyan et al., 2020), as each token is derived either from copying from the utterance or generating from the ontology (see Figure 1). Following prior work, we fine-tune seq2seq transformers to maximize the log likelihood of the gold frame token at each timestep: (x,y) t log P (y t |y <t , x; θ).   On TOP/TOPv2, we fine-tune BART , a seq2seq transformer pre-trained with a denoising autoencoder objective on monolingual corpora, and on MTOP, we fine-tune XLM-R (Conneau et al., 2020) (equipped with a randomlyinitialized decoder), a transformer encoder pretrained with a masked language modeling objective on multilingual corpora. For XLM-R, specifically, we attach a randomly-initialized decoder (see Table 2). Table 3 shows model performance as judged by exact match. Hyperparameters for all models are listed in Table 4.

Error Analysis
In this section, we seek to better understand the types of errors transformer-based parsers make across both monolingual and multilingual settings.

Error Types
To standardize our analysis, we categorize model errors under the following types: intent (incorrect intent prediction), slot (incorrect slot prediction), out-of-domain (incorrect out-of-domain intent prediction), mode (confusion between copying an utterance token or generating an ontology token), and leaf (incorrect span in a frame leaf slot). In addition, we report the syntactic validity of parse trees separately, though we note mode errors typically result in invalid constructions.  One complicating factor is that a predicted sequence may potentially contain several errors, and because decoding is conducted autoregressively, a given error may be influenced by earlier errors (if any such exist). Therefore, to reduce the number of confounding variables, we only consider settings where an incorrect prediction has gold history argmax y i P (y i |y * <t , x) = y * i ; put another way, we only count the first error in a sequence.
Using the framework discussed above, we annotate 700 errors across BART and XLM-R on TOP and MTOP, respectively; 100 errors are from TOP and 6×100 errors are from MTOP (100 per language).

Results
Table 5 benchmarks overall model performance and Figure 2 categorizes errors with fine-grained types; from these results, we draw the following conclusions: Transformer-based parsers typically struggle with both classification and planning. In the seq2seq formulation, models must jointly classify (i.e., provide intent and slot labels) and plan (i.e., switch between copying and generating) when producing a semantic frame. Our results show intent/slot and mode errors, which generally fall under the theme of classification and planning, respectively, account for nearly 70-80% of errors. A key observation, however, is that classification and planning error statistics are relatively consistent across languages, suggesting our models may not need language-specific fine-tuning to address these particular errors.
Nearly 40% of incorrectly predicted frames are syntactically invalid. Surprisingly, a large percentage of incorrectly predicted frames violate tree constraints; for linearized frames, this implies the number of open brackets ([in or [sl) do not match the number of close brackets (]). Though wellformedness is correlated with depth, we see tree validity (1) is not substantially improved by increasing the number of monolingual samples (TOP → TOPv2) and (2) drops off quite rapidly for multilingual samples (TOP/TOPv2 → MTOP).
Span extraction is more challenging in multilingual settings. Leaf errors in English (TOP|MTOP)-en are typically twice as lower compared to those in non-English languages MTOP-(es|fr|de|hi|th).
Upon closer inspection, we find most leaf errors in English are relatively benign; the model may drop a preposition when copying a span (e.g., Monday as opposed to on Monday). However, for languages beyond English, extracted spans in leaf slots typically consist of hallucinated or duplicated subwords, which are much more serious in nature. Finally, though languages with non-projective structures (e.g., German) can populate leaf slots with non-contiguous spans, we noticed errors on these types of samples were infrequent.
Out-of-domain detection is also a significant source of error. TOP, in particular, mixes the canonical semantic parsing task with out-ofdomain detection by assigning such utterances the frame [in:unsupported ]. 1 Though wellmotivated, roughly 20% of errors are related to incorrect out-of-domain predictions, suggesting our models have not precisely learned the boundary between in-domain and out-of-domain utterances. If high detection accuracy is preferred, multi-tasking parsers in this fashion may not be an effective use of parameters (assuming more data is not available); instead, out-of-domain detection can be conducted independently with alternate methodology (Gangal et al., 2019).

Syntactic Structure
Our case study above demonstrates transformerbased parsers can produce syntactically-invalid frames at a high rate. These structural errors are more serious than disambiguation errors since they render the frame unusable, potentially causing cascading failures in a task-oriented dialog system. Therefore, in this section, we dive deeper into why tree constraints are not satisfied and question the possibility of achieving perfect tree validity.
While transduction models do not explicitly impose tree constraints, there is precedent that strong neural representations do implicitly model tree structures; recent studies demonstrate large-scale pre-training, in particular, imbues strong notions of syntax (Goldberg, 2019;Jawahar et al., 2019;Tenney et al., 2019). Taking these results together, we hypothesize that transformer representations may be "good enough", but instead there exist ambiguous aspects of our task-oriented semantic parsing task which cause tree invalidity.
Previously, we saw transformer-based semantic parsers largely struggled with classification-and planning-related errors. Therefore, the question we pose is: if we resolve these ambiguities by creating oracle models, can we achieve perfect tree validity? This setup also enables us to gain a deeper understand of the upper-bound performance of transformer-based semantic parsers, even as their representations get stronger. Oracle Models. Because classification and planning target inherently different phenomena, creating an oracle that simultaneously makes both less ambiguous is challenging. Instead, we experiment with two separate oracles-span oracle and structure oracle models for classification and planning, respectively-which map an utterance x along with a "partially gold" snippet z to generate the frame y, inducing the objective (x,y,z) t log P (y t |y <t , x, z; θ). For example, given an utterance Here, providing z as input helps the model learn y \ z; span oracle models optimize for correct structure and structure oracle models optimize for correct spans. Table 6 shows example source and target pairs for the regular, span oracle, and structure oracle models. Figure 3 shows the oracle model results; we measure both exact match and tree validity error. A key phenomenon we observe is that conditioning on gold spans results in near-zero tree validity error at most depths. Surprisingly, we see conditioning on gold structures (to stress, the exact syntactic structure) never consistently results in well-formed trees, especially as the depth in-  creases. Structure oracle models still suffer from mode errors during generation: augmenting a leaf span with an extra word instead of placing a close bracket, for example, is a typical mistake. Furthermore, we see this problem is magnified in MTOP, which connects to the notion that span extraction tends to be difficult in multilingual settings. Our experiments suggest seq2seq transformerbased parsers can achieve near-perfect tree validity-even at large depths-provided that span extraction is precise. Currently, however, this is a major source of ambiguity our parsers are not well-equipped to handle, especially when scaling to languages beyond English.

Confidence Estimation
Despite the criticism we have presented of state-ofthe-art, transformer-based conversational semantic parsers, these models do demonstrate strong performance over prior baselines, and correctly parse a vast majority of samples. A property that can make these models easier to deploy in practice is if they "know what they don't know" (Desai and Durrett, 2020); besides interpretability, this is particularly useful for identifying and correcting errors in tail scenarios via active learning (Dredze and Crammer, 2008;Duong et al., 2018;Sen and Yilmaz, 2020). We frame this problem as confidence estimation (Blatz et al., 2004): given an utterance x, predicted frame y , and gold frame y, we seek to learn a binary classifier which uses target-side features f (y ) to estimate P (y = y) = sigmoid(w f (y )).
To make our approach as generalizable as possible, we constrain f (y ) to be as model-agnostic and recall-oriented as possible. We select the following features: (1) length: |y |; (2) validity: where V + and V − are the set of open and close brackets, respectively; and (3) confidence: 1 |y | i P (y t |y <t , x). Using our best transformerbased parsers, we obtain predictions on a held-out set D dev and test set D test . Then, we train and test a SVM on D dev and D test , respectively, using the  Table 7: Precision (P), recall (R), and F1 of the SVMbased confidence estimator. -x indicates an ablation of feature x (i.e., it is omitted during learning).
features defined above.
In addition to the standard hinge loss, we also add a class imbalance penalty as positive examples are typically 5-8× as prevalent depending on the dataset. We chiefly evaluate the binary classifier's ability to identify semantic frames which are correct (i.e., the positive class). From an active learning standpoint, getting positive samples wrong is more serious than getting negative samples wrong; annotation resources are best directed towards boundary or incorrect predictions. Table 7 shows the performance and ablations of our confidence estimator. In both monolingual and multilingual settings, using transformerbased features, we can detect correct semantic frames with 90%+ F1. In particular, we see length and validity largely capture the space of correct frames (recall) and confidence effectively distinguishes between correct and incorrect frames (precision). Practitioners may select an SVM variant depending on whether precision or recall is preferred.

Conclusion
In this work, we assess the strengths and weaknesses of seq2seq transformers for task-oriented semantic parsing. These models "know what they don't know", making them easier to depoy in practice, but cannot perfectly model compositional utterances, as indicated by the challenges of span extraction. We believe that modeling efforts in this direction-as opposed to simply annotating more data-can improve parsers substantially.