Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models

End-to-end spoken language understanding (SLU) systems are gaining popularity over cascaded approaches due to their simplicity and ability to avoid error propagation. However, these systems model sequence labeling as a sequence prediction task causing a divergence from its well-established token-level tagging formulation. We build compositional end-to-end SLU systems that explicitly separate the added complexity of recognizing spoken mentions in SLU from the NLU task of sequence labeling. By relying on intermediate decoders trained for ASR, our end-to-end systems transform the input modality from speech to token-level representations that can be used in the traditional sequence labeling framework. This composition of ASR and NLU formulations in our end-to-end SLU system offers direct compatibility with pre-trained ASR and NLU systems, allows performance monitoring of individual components and enables the use of globally normalized losses like CRF, making them attractive in practical scenarios. Our models outperform both cascaded and direct end-to-end models on a labeling task of named entity recognition across SLU benchmarks.


Introduction
Sequence labeling (SL) is a class of natural language understanding (NLU) tasks. These systems tag each word in a sentence to provide insights into the sentence structure and meaning (Jurafsky and Martin, 2009). An SL system that processes unstructured text, first encodes the context and relationships of words in the sentence using an encoder and then labels each token (Lample et al., 2016;Dozat et al., 2017;Akbik et al., 2018). However, when dealing with spoken utterances, sequence labeling introduces an additional complexity of 1 Our code and models are publicly available as part of the ESPnet-SLU toolkit: https://github.com/espnet/ espnet. * Equal Contribution. Siddharth is now at Google. also recognizing the mentions of the labels (Kubala et al., 1998;Zhai et al., 2004). SL in spoken language understanding (SLU) has been approached by two schools of thought, (1) that seek to recognize the spoken words using an Automatic Speech Recognition (ASR) engine and then tag the mentions using an NLU engine in a cascaded manner (Palmer and Ostendorf, 2001;Horlock and King, 2003;Béchet et al., 2004), and (2) that seek to recognize and tag the mentions directly from speech in an end-to-end (E2E) framework (Arora et al., 2022;Ghannay et al., 2018). Prior work has shown that cascaded systems suffer due to error propagation (Tran et al., 2018) from the ASR into the NLU engine, which can be overcome in an E2E framework. However, unlike cascaded models, E2E systems cannot utilize the vast abundance of NLU research (Shon et al., 2022) as they re-define the SL problem as a complex sequence prediction problem where the sequence contains both the tags and its mentions.
Inspired by the principles of task compositionality in SL for SLU, we seek to bring both schools of thought together. Our conjecture is that we can build compositional E2E systems that first convert the spoken utterance to a sequence of token representations (Dalmia et al., 2021), which can then be used to train token-wise classification systems as per the NLU formulation. By also conditioning our token-wise classification on speech, our compositional E2E system allows recovery from errors made while creating token representations. We instantiate our formulation on a popular SL task of named entity recognition (NER) and (1) present the efficacy of our compositional E2E NER-SLU system on benchmark SLU datasets (Bastianelli et al., 2020;Shon et al., 2022) surpassing both the cascaded and direct E2E systems §5.2. (2) Our compositional model consists of ASR and NLU components compatible with pre-trained ASR and NER-NLU models §5.3. (3) Our E2E systems ex-hibit transparency towards categorizing errors by enabling the evaluation of individual components of our model in isolation §5.4.
The paper first describes the traditional SL formulation ( §2), and discusses shortcomings in current SLU formulations ( §3). Section §4 presents our compositional E2E model that can overcome these shortcomings. We then evaluate these approaches towards the SL task of NER ( §5).

Sequence Labeling (SL)
SL systems tag each word, w i , of a text sequence, S = {w i ∈ V|i = 1, . . . , N } of length N and vocabulary V, with a label from a label set L, {w i → y i |y i ∈ L}. This produces a label sequence, Y = {y i ∈ L|i = 1, . . . , N } of the same length N . Using decision theory, sequence labeling models seek to outputŶ from a set of all possible tag sequence L N , where F (Y, S) is global score of the tag sequence Y given S. This is modeled using a linear chain CRF which computes the global score as a sum of local scores f (.) for each position in Y as follows Lample et al. (2016) and Yan et al. (2019) use contextualized neural encoders like LSTMs and transformers to model context of the entire sequence S for every word w l . This allows for effective modeling of f (.) by using encoder representations for each word as the emissions, and maintaining a separate transition score t y l−1 →y l to give F (Y, S): Token Classification Model: Since the advent of strong contextual modeling using transformer based models, sequence labeling can also be treated as token classification (Devlin et al., 2019), a simplification over MEMM estimations (McCallum et al., 2000), with the assumption that the current tag is conditionally independent to previous tag.
These models are still effective as h l is able to model the full context S for every word w l . In cases like NER, where an entity can span multiple words, these problems are modeled using BIO tags (Ramshaw and Marcus, 1995), where begin (B), inside (I) tags are added for entities and an outside (O) tag for non-entity words, extending the tag set vocabulary from L to L ′ = {l B ⊕ l I |l ∈ L} ∪ {O}.When modeled using sub-word tokens the tags can be aligned to the first sub-word token of the word and the remaining ones can be marked with a special token ∅ giving L ′′ = L ′ ∪ {∅}.

Sequence Labeling in SLU
Sequence Labeling in SLU introduces an added complexity of recognizing mentions on top of textbased SL tasks ( §2) as they aim to predict the tag and its mentions directly from a spoken sequence. Given a sequence of d dimensional speech feature of length T frames, X = {x t ∈ R d |t = 1, . . . , T }, these systems seek to estimate the label sequencê where P (Y |X) have been modeled as: Cascaded SLU (Béchet et al., 2004;Parada et al., 2011;Zhou et al., 2015) models P (Y |X) from P (Y |S) using an NLU framework ( §2) and P (S|X) using an ASR model (Povey et al., 2011;Chan et al., 2016;Graves, 2012), assuming condi- OnceŜ is estimated,Ŷ can be estimated using Eq 1. Although this enables realizingŶ using two well studied frameworks, the independence assumption doesn't allow recovery from errors in estimatingŜ.
Direct End-to-End SLU (Arora et al., 2022;Shon et al., 2022;Ghannay et al., 2018) systems avoid cascading errors by directly modeling P (Y |X) in a single monolithic model. To achieve this while being able to recognize the spoken mentions, these systems enrich Y with transcripts S, Y e = {y e i ∈ V ∪ L|i = 1, . . . , N ′ }, where N ′ is the length of Y e . This can be modeled using an autoregressive decoder as: However this new formulation cannot utilize the well studied sequence labeling framework §2. Additionally, this applies an extra burden of labeling along with alignment on the decoder and makes understanding the errors made by these systems particularly difficult. For example, Eq 13 gives non-zero likelihood to a corrupt sequence with only labels and no words as y e ∈ {V ∪ L}.

Compositional End-to-End SLU
We propose to bring the two paradigms together in a compositional end-to-end system, by extending over the cascaded SLU formulation using searchable intermediate framework (Dalmia et al., 2021): ≈ max S P (Y |S, X) This system can be realized with two sub-networks as shown in Figure 1, where: Figure 1: Schematics of our compositional E2E SLU architecture with ASR and NLU sub-nets. The ASR subnet consists of an encoder and decoder. The NLU subnet consists of an encoder that conditions on both speech information via encoder ASR and the text information via decoder ASR 's hidden representation h ASR followed by token classification or CRF layer.
The end-to-end differentiability is maintained by using h ASR 1:N in Eq 20. During inference, we approximate the Viterbi max of S using beam search to giveĥ ASR 1:N . ThenŶ can be found using Viterbi search with no approximation as the output length is known and the solution is tractable.
This composition allows incorporating the ASR modeling and text-based sequence labeling framework §2. It also brings transparency to end-to-end modeling as we can also monitor performance of individual sub-nets in isolation. Further, encoder NLU can attend to speech representations h E 1:T using cross attention (Dalmia et al., 2021) enabling the direct use of speech cues for NLU. This speech attention mechanism can allow the model to recover from intermediate errors made during ASR stage.
Recently, there has been some works (Rao et al., 2020;Saxon et al., 2021) that explore compositional SLU models which utilize the ASR and NLU formulations. Saxon et al. (2021) uses discrete outputs from the ASR module that are made differ-entiable using various approaches like Gumbelsoftmax (Jang et al., 2017). Rao et al. (2020) also uses the ASR decoder hidden representations in the NLU module by concatenating it with token embeddings of the ASR discrete output. However, this approach requires the ASR and NLU submodule to have a shared vocabulary space, limiting the usage of pretrained ASR and LM in this architecture. Moreover, the benefits of our proposed compositional framework are not explored in these works.

Spoken Named Entity Recognition
To show the effectiveness of our compositional E2E SLU model we build spoken NER systems on two publically available SLU datasets, SLUE (Shon et al., 2022) and SLURP (Bastianelli et al., 2020) (dataset and preparation details in §A.2). We compare our compositional E2E system with cascaded and direct E2E systems. We also compare with another compositional E2E system that predicts the enriched transcript ( §3) using a decoder like (Dalmia et al., 2021) instead of label sequence (i.e. Y e instead of Y in Eq. 15) using a token level classification sub-network. We refer to this baseline model as "Compositional E2E SLU with Direct E2E formulation".
SLURP is evaluated using SLU-F1 (Bastianelli et al., 2020) which weighs the entity labels with the word and character error rate of the predicted mentions and SLUE using F1 (Shon et al., 2022) which evaluates getting both the mention and the entity label exactly right. We also compute Label-F1 for both datasets which considers only the entity label. We report micro-averaged F1 for all results.

Model Configurations
We build all our systems using ESPnet-SLU (Arora et al., 2022) which is an open-source SLU toolkit built on ESPnet (Watanabe et al., 2018), a flagship toolkit for speech processing. We use encoderdecoder based architecture for our baseline E2E system. We use Conformer encoder blocks (Gulati et al., 2020) and Transformer decoder blocks (Vaswani et al., 2017) with CTC multi-tasking (Arora et al., 2022). The baseline compositional model with Direct E2E SLU formulation consists of a conformer encoder and transformer decoder in it's ASR component and transformer encoder and transformer decoder in it's NLU component. Our proposed compositional model with the NLU formulation, as shown in Figure 1 Table 1 shows that our proposed compositional E2E models with the token-level NLU formulation outperform both cascaded and direct E2E models on all benchmarks using both CRF and Token Classification. In order to understand gains of our proposed model, we examine the performance of our compositional system with direct E2E formulation ( §3). While being comparable to direct E2E models, they still lag behind our proposed models showing the efficacy of modeling SL tasks as a token-level tagging ( §2) in an E2E SLU framework. We further analyze our compositional systems that don't attend to speech representations. We observe a performance drop as these models are not able to recover from errors made while "recognising" entity mentions. For example, in an utterance that says "change the bedroom lights to green", though the ASR component incorrectly predicts the transcript as "change the color of lights to green", the NLU component w/ Speech Attention is able to recover the entity type HOUSE_PLACE.

Utilizing External Sub-Net models
Components of our compositional E2E SLU model have functions similar to an ASR and NLU model . This allows fine-tuning our models using sub-systems, pre-trained on large amounts   of available sub-task data. Table 2 shows that our compositional model has better compatibility with ASR and NLU fine-tuning over direct E2E systems, thereby increasing their performance gap, particularly for SLUE, an under-resourced SLU dataset. Further our models have the ability to use transcripts from a strong external model (S ext ) directly during inference, by instantiating our models with these transcripts to produce h ASR and then evaluate P (Y |S ext , X). Table 2 shows using transcripts from an external ASR with no fine-tuning steps can achieve similar performance to ASR fine-tuning.

Transparency in Compositional E2E SLU
Following Eq 15, we can estimate ASR performance by calculatingŜ using beam search and NLU performance by estimatingŶ from P (Y |S GT , X), where S GT is the ground truth transcripts. Table 3 shows the performances of individual components of our model along with performances of ASR and NLU only models suggesting that we can effectively monitor the performance of these components, helping practitioners analyze and debug them. For instance, while our models with and without speech attention have comparable performance on ASR, using speech attention improves NLU power. Further the one-to-one alignment of transcripts and sequence labels can provide further categorization of errors, as shown in §A.4.

CRF vs Token Classification
For practical SLU the likelihoods of our compositional model P (Y |S, X), should be correlated with errors in label sequence Y . We found that in SLURP our compositional E2E SLU, while using locally normalized token classification shows no correlation (Corr=0.13,p=0), using CRF exhibits moderate correlation (Corr=0.43, p=0). This makes globally normalized models attractive for real-world scenarios like automated data auditing and human in-the-loop ML (Mitchell et al., 2018) despite their marginal addition in computation cost.

Conclusion
We propose to combine text based sequence labeling framework into the speech recognition framework to build a compositional end-to-end model for SLU. Our compositional E2E models not only show superior performance over cascaded and direct end-to-end SLU systems, but also bring the power of both these systems in a single framework. These models can utilise pretrained sub task components and exhibit transparency like cascaded systems, while avoiding error propagation like direct end-to-end systems.

Limitations
Our compositional model relies on the availability of transcripts for training. This although is a limitation, it is a safe assumption for sequence labeling tasks for spoken language understanding. We can see from §3 that the task for sequence labeling in SLU also requires the model to recognize the words being spoken along with the sequence labels, implying the need for at least a partial transcript for training direct end-to-end SLU systems.

Broader Impact
With our compositional end-to-end SLU model, we strive to bring the research from the text based sequence labeling directly into speech based spoken language understanding. Our aim is to avoid reinvention of the wheel, but rather come up with innovative ways to build end-to-end models by converting a complex problem into simpler ones that have seen substantial research in the past. Additionally we believe the increased capacity for error analysis in our compositional end-to-end system can help towards building better practical systems during deployment. Our compositional end-to-end systems can effectively utilize pre-trained ASR and NLU systems, thereby avoiding the need for collecting large labeled datasets for SLU. This framework also saves compute by utilizing pre-trained ASR systems directly during inference to improve downstream performances with no fine-tuning.

A.1 Applications of SLU
SLU is an essential component of many commercial devices like voice assistants, home assistants (Yu et al., 2019;Coucke et al., 2018) and spoken dialog systems (Nguyen and Yu, 2021) that map speech to executable commands on a daily basis. One of the key applications of SLU is to extract key mentions like entities from a user command to take appropriate actions. As a result, several datasets (Bastianelli et al., 2020;Shon et al., 2022;Del Rio et al., 2021) have been proposed to build understanding systems for spoken utterances.

A.2 Dataset Description
We evaluated our proposed approach on publicly available SLU datasets, namely SLUE (Shon et al., 2022) and SLURP (Bastianelli et al., 2020) datasets on the task of Named Entity Recognition (NER) from naturally available speech. SLURP is a linguistically diverse and challenging spoken language understanding benchmark that consists of single-turn user conversation with a home assistant, annotated with both intent and entities. Similar to the approach followed in our prior work (Bastianelli et al., 2020;Arora et al., 2022), we bootstrap our train set with 43 hours of synthetic data for all our experiments. We evaluate our approach using SLU-F1 (Bastianelli et al., 2020), a metric for spoken entity prediction, and Label F1, which considers only entity-tag predictions. SLUE is a recently released SLU benchmark that focuses on Spoken Language Understanding from limited labeled training data. Specifically, it consists of SLUE VoxPopuli dataset that can be used for building systems for ASR and NER. Similar to (Shon et al., 2022), we evaluate our systems using two micro-averaged F1 scores, the first score that evaluates both named entity and tag pairs is referred to as F1, and the second that evaluates only entitytag phrases is referred to as Label-F1. Note that the released test sets are blind without ground truth labels, and hence we compare different methods using the development set.

A.3 Experimental Setup
Our models are implemented in PyTorch (Paszke et al., 2019), and the experiments are conducted using the ESPnet-SLU toolkit (Arora et al., 2022).

A.3.1 Speech Preprocessing
Speech inputs are globally mean-variance normalized 80 dimensional logmel filterbanks using a 16kHz sampling and window of 512 frames and a 128 hop length. We apply speed perturbation for the under-resourced dataset of SLUE of 0.9 and 1.1 to increase the samples. We also apply specaugmentation (Park et al., 2019) on both datasets. We also remove all examples smaller than 0.1 seconds and larger than 20 seconds from the training data.

A.3.2 Text Processing
For the cascaded system, we process ASR transcripts S using bpe tokenization (Kudo and Richardson, 2018) and train ASR models to generate bpe subtokens. We use bpe size of 500 for SLURP and 1000 for SLUE dataset. For the direct E2E models, we predict the enriched label sequence Y e using the same bpe size as the ASR models in cascaded sequence. Similarly, compositional models also use the same bpe size to generate the ASR transcripts.
For creating the BIO tags we modify the data preparation such that we take the entities for each utterance and create a "label utterance". This consists of one-to-one mapping of the label tags with the words and Begin (B), Inside (I) and Outside (O) marked for each label. After performing BPE tokenization we add ∅ for every subtoken of the word. We have attached the data preparation code.

A.3.3 Model and Training Hyperparameters
We run parameter search for both direct end-to-end and our compositional end-to-end systems using the same model search space (Table 5). In this section, we will describe our best architecture for both direct and compositional E2E systems.
Direct E2E SLU systems After searching through hyperparameter space, our Direct E2E SLU systems consists of 12-layer Conformer (Gulati et al., 2020) encoder and a 6-layer Transformer (Vaswani et al., 2017) decoder with 8 attention heads for SLURP dataset. We use a dropout of 0.1, output dim of 512 and feedforward dim of 2048, giving a total parameter size of 109.3 M.
For SLUE dataset, we found 12-layer Conformer with 4 attention heads and decoder is a 6-layer Transformer with 4 attention heads to give best validation performance. We use a dropout of 0.1, output dim of 256 and feedforward dim of 1024 in encoder and 2048 in the decoder, giving a total parameter size of 31.2 M.
Compositional E2E SLU systems Our Compositional model which uses Direct E2E SLU formulation consists of 12-layer conformer block for encoder, 6-layer transformer block for decoder in it's ASR component and 4-layer transformer encoder and 6-layer transformer decoder in it's NLU component. Each of these attention blocks consist of 8 attention heads, dropout of 0.1, output dim of 512, feedforward dim of 2048, giving a total of 153.9M parameters in SLURP dataset. For SLUE dataset, each attention block has 4 attention heads, dropout of 0.1, output dim of 256, feedforward dimension of 1024 in encoder and 2048 in decoder, giving a total parameter size of 46.8M.
Our Composition model with Proposed NLU formulation replaces NLU component in Direct E2E formulation with 8-layer transformer encoder followed by linear layer. All these attention blocks consist of 8 attention heads, dropout of 0.1, output dim of 512, feedforward dim of 2048, giving a total of 142.9M parameters in SLURP dataset. For SLUE dataset, each of these attention blocks have 4 attention heads, dropout of 0.1, output dim of 256, feedforward dimension of 1024 in encoder and 2048 in decoder, giving a total parameter size of 43.8M. Our NLU component can further attend to speech representations using cross attention (Dalmia et al., 2021). We further implement CRF loss using publicly available python library 2 .

A.3.4 Decoding Hyperparameters
We keep the same decoding parameter of beam size and penalty as that of Arora et al. (2022). For direct E2E systems and our models CTC weight of 0.1 worked best. We searched over CTC weight of [0, 0.1, 0.3, 0.5].

A.3.5 Development Results
We use F1 scores on the validation data to select the best hyperparameters. Table 6 presents the validation performances for our models.

A.3.6 Compute Infrastructure
Our models were trained using mixed precision training on either a100, v100 or A6000 on our compute infrastructure depending on their availability. Depending on the GPU and the file i/o latency, the training time ranged from 4-7 hours for SLUE, while for SLURP the training time ranged from 12-18 hours.  Figure 2: Qualitative examples of our compositional E2E SLU model for various error categories. We can observe that in the first case, the model is correctly able to predict both entity types and mentions even when the name "mona" is not a common name for an event. In the second case, even though it predicts the correct ASR transcript, it mislabels "Edinburgh" as a news topic since the phrase "is there anything happening" usually occurs with news topics. In the third case, even though it makes a mistake in the person name, the model correctly tags it as a person. Finally, the model incorrectly generates the word "ninety," and this error gets propagated to the NLU component through token representations which then predicts entity type "date". This analysis shows that the alignment between ASR and NLU outputs can help us gain better insights into model performance.

A.3.7 External ASR and NLU components
For the experiments in Table 2, we used ASR and NLU models trained on external data. For the ASR fine-tuning we used an ESPnet model 3 trained on the GigaSpeech dataset (Chen et al., 2021a). This model has the same architecture as the baseline direct E2E model on SLURP. We initialize both the encoder and decoder for direct E2E SLU and the ASR sub-net for the compositional E2E SLU model. For NLU fine-tuning we used Canine (Clark et al., 2022), a character based BERT language model, which exhibits strong performance on named entity recognition while being able to model token sizes comparable to our SLU systems. 4 We initialize our NLU sub-network without speech attention with Canine and keep the model parameters fixed during training. For finding the best parameters we only tuned the learning rate and LR schedule from Table 5 and report the best numbers among CRF and Token Classification loss. For using External ASR Transcripts, we trained an ASR system initialized using GigaSpeech and WavLM (Chen et al., 2021b) respectively. They were then fine-tuned on the respective datasets. These systems achieve 10.0% WER and 9.2% WER on SLURP and SLUE respectively.

A.4 Error Categorization
The predictions made by our compositional E2E SLU model can be categorized into different buckets on the basis of the errors by ASR or NER component.  trained with and without speech attention. Most of the performance differences between compositional E2E SLU model w/ and w/o speech attention are caused by the kinds of errors where the ASR predictions are inaccurate, but the NLU module is nevertheless able to recover the correct entity type from the utterance. This confirms our intuition that cross attention on speech representations can help the NLU module to recover from mistakes made during "recognizing" spoken mentions. We also present anecdotes for each of these error categories in Figure 2. This further emphasizes the transparency in our compositional E2E SLU models. Due to the lack of one-to-one alignment between ASR and Sequence Labeling, such analysis is not possible in direct E2E SLU systems, making it particularly difficult to categorize errors when the entity prediction is wrong.