ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling

Automatic Speech Recognition (ASR) robustness toward slot entities are critical in e-commerce voice assistants that involve monetary transactions and purchases. Along with effective domain adaptation, it is intuitive that cross utterance contextual cues play an important role in disambiguating domain specific content words from speech. In this paper, we investigate various techniques to improve contextualization, content word robustness and domain adaptation of a Transformer-XL neural language model (NLM) to rescore ASR N-best hypotheses. To improve contextualization, we utilize turn level dialogue acts along with cross utterance context carry over. Additionally, to adapt our domain-general NLM towards e-commerce on-the-fly, we use embeddings derived from a finetuned masked LM on in-domain data. Finally, to improve robustness towards in-domain content words, we propose a multi-task model that can jointly perform content word detection and language modeling tasks. Compared to a non-contextual LSTM LM baseline, our best performing NLM rescorer results in a content WER reduction of 19.2% on e-commerce audio test set and a slot labeling F1 improvement of 6.4%.


Introduction
Task-oriented conversations in voice chatbots deployed for e-commerce usecases such as shopping (Maarek, 2018), browsing catalog, scheduling deliveries or ordering food are predominantly shortform audios. Moreover, these dialogues are restricted to a narrow range of multi-turn interactions that involve accomplishing a specific task (Mari et al., 2020). The back and forth between a user and the chatbots are key to reliably capture the user intent and slot entities referenced in the spoken utterances. As shown in previous works (Irie et al., 2019;Parthasarathy et al., 2019;Sun et al., 2021), rather than decoding each utterance independently, there can be benefit in decoding these utterances based on context from previous turns. In the case of grocery shopping for example, knowing that the context is "what kind of laundry detergent?" should help in disambiguating "pods" from "pause". Another common aspect in e-commerce chatbots is that the speech patterns differ among sub-categories of usecases (Eg. shopping clothes vs ordering fast food). Hence, some chatbot systems allow users to provide pre-defined grammars or sample utterances that are specific for their usecase . These user provided grammars are then predominantly used to perform domain adaptation on an n-gram language model. Recently (Shenoy et al., 2021) showed that these can be leveraged to bias a Transformer-XL (TXL) LM rescorer on-the-fly.
While there has been extensive previous work on improving contextualization of TXL LM using historical context, none of the approaches utilize signals from a natural language understanding (NLU) component such as turn level dialogue acts. This paper investigates how to utilize dialogue acts along with user provided speech patterns to adapt a domain-general TXL LM towards different ecommerce usecases on-the-fly. We also propose a novel multi-task architecture for TXL, where the model jointly learns to perform domain specific slot detection and LM tasks. We use perplexity (PPL) and word error rate (WER) as our evaluation metrics. We also evaluate on downstream NLU metrics such as intent classification (IC) F1 and slot labeling (SL) F1 to capture the success of these conversations. The overall contributions of this work can be summarized as follows : • We show that a TXL model that utilizes turn level dialogue act information along with long span context helps with contextualiziation and improves WER and IC F1 in e-commerce chatbots.
• To improve robustness towards e-commerce domain specifc slot entities, we propose a novel TXL architecture that is jointly trained on slot detection and LM tasks which significantly improves content WERR and SL F1.
• We show that adapting the NLM towards user provided speech patterns by using BERT on domain specific text is an efficient and effective method to perform on-the-fly adaptation of a domain-general NLM towards ecommerce utterances.

Related Work
Incorporating cross utterance context has been well explored with both recurrent and non-recurrent NLMs. With LSTM NLMs, long span context is usually propogated without resetting hidden states across sentences or using longer sequence lengths (Xiong et al., 2018a;Irie et al., 2019;Khandelwal et al., 2018;Parthasarathy et al., 2019). In (Xiong et al., 2018b), along with longer history, information about turn taking and speaker overlap is used to improve contextualization in human to human conversations. With transformer architecture based on self attention (Vaswani et al., 2017) (Dai et al., 2019) showed that by utilizing segment wise recurrence Transformer-XL (TXL) (Dai et al., 2019) is able to effectively leverage long span context while decoding. More recently, improving contextualization of the TXL models included adding a LSTM fusion layer to complement the advantages of recurrent with non-recurrent models (Sun et al., 2021). (Shenoy et al., 2021) incorporated a non-finetuned masked LM fusion in order to make the domain adaptation of TXL models quick and on-the-fly using embeddings derived from customer provided data and incorporated dialogue acts but only with an LSTM based LM. While (Sunkara et al., 2020) tried to fuse multi-model features into a seq-to-seq LSTM based network. In (Sharma, 2020) cross utterance context was effectively used to perform better intent classification with e-commerce voice assistants.
For domain adaptation, previous techniques explored include using an explicit topic vector as classified by a separate domain classifier and incorporating a neural cache (Mikolov and Zweig, 2019;Li et al., 2018;Raju et al., 2018;Chen et al., 2015). (Irie et al., 2018) used a mixture of domain experts which are dynamically interpolated. It is also shown in , that using a hybrid pointer network over contextual metadata can also help in transcribing long form social media audio. Joint learning NLU tasks such as intent detection and slot filling have been explored with RNN based LMs in (Liu and Lane, 2016) and more recently in (Rao et al., 2020), where they show that a jointly trained model consisting of both ASR and NLU tasks interfaced with a neural network based interface helps incorporate semantic information from NLU and improves ASR that comprises a LSTM based NLM. In  tried to incorporate joint slot and intent detection into a LSTM based rescorer with a goal of improving accuracy on rare words in an end-to-end ASR system.
However, none of the previous work utilize dialogue acts with a non-recurrent based LM such as Transformer-XL nor optimize towards improving robustness of in-domain slot entities. In this paper we experiment and study the impact of utilizing dialogue acts along with a masked language model fusion to improve contextualization and domain adaptation. Additionally, we also propose a novel multi-task architecture with TXL LM that improves the robustness towards in-domain slot entity detection.

Approach
A standard language model in an ASR system computes a probability distribution over a sequence of words W = w 0 , ..., w N auto-regressively as: In our experiments, along with historical context, we condition the LM on additional contextual metadata such as dialogue acts : (2) Where c 1 , c 2 , ...c k are the turn based lexical representation of the contextual metadata. For baseline, we use a standard LSTM LM as summarized below :  where embed i is a fixed size lower dimensional word embedding and the LSTM outputs are projected to word level outputs using W T ho . A Sof tmax layer converts the word level outputs into final word level probabilities.

Transformer-XL based NLM
Although recurrent language models help in modeling long range dependencies to certain extent, they still suffer from the fuzzy far away problem (Khandelwal et al., 2018). Vanilla transformer LMs on the other hand use fixed segment lengths which leads to context fragmentation. To address these limitations and model long range dependencies, TXL models add segment-level recurrence and use a relative positional encoding scheme (Dai et al., 2019). Hence we choose to use a TXL LM directly. The cached hidden representations from previous segments helps contextual information flow across segment boundaries. If s k = [w k,1 , ..., w k,T ] and s k+1 = [x k+1,1 , ..., x k+1,T ] are two consecutive segments of length T and h n k is the n-th layer hid-den state produced for the k-th segment s k , then, the n-th layer hidden state for segment s k+1 is produced as follows: where SG(.) stands for stop gradient and T L stands for Transformer Layer. To carry over context from previous turns, we train and evaluate the model by concatenating all the turns, including the bot responses, in a single conversation session. The model is trained with a cross entropy objective as defined below : During inference time, we cache a fixed length hidden representation from previous segments. We also use the generated bot responses to perform a forward pass and carry over the context to the next user turn.

Slot detection and language modeling multi-task learning
To make the our domain-general model robust to e-commerce specific slot entities, we propose a multi-task learning approach to training the TXL LM. We train our models on both LM and slot detection tasks. Similar to slot filling, slot detection is a sequence classification task that involves predicting if a word, w i at time step i is a domain specific slot entity. We use a separate slot detection network, consisting of a simple multi-layer perceptron, and use the final layer hidden representation from the TXL network as inputs to the network. Figure 2 shows an example utterance with the slot annotations. Formally, let s = (s 0 , s 1 , ..., s T ) be the slot label sequence, corresponding to a word sequence w = w 0 , w 1 , ...., w T in the k-th segment. We model the slot label output s t as a conditional distribution over input word sequence up to time step t, w ≤t similar to (Liu and Lane, 2016) : We use a cross-entropy training objective for the slot detection task as below : To incorporate this semantic information about the word from previous time step into the NLM, we use the logits from the slot detection network to condition the probability distribution of the next word in the sequence as shown in Figure 1.
The total loss is then computed using a linear combination of LM and slot detection losses: where α SD is the weight for the slot detection loss.

Transformer-XL LM conditioning on dialogue acts
Dialogue acts (DA) in a conversation represent the intention of an utterance and is intended towards capturing the action that an agent is trying to accomplish (Austin, 1975). An example conversation snippet with DA is shown in Table 1. DA classification is typically performed in a separate component that is part of a downstream NLU system and consumes the outputs generated by ASR. The classified DA is an important contextual signal that provides hints about the type of speech pattern that can be expected in the next turn. We utilize these signals to train our TXL models. Specifically, we augment the training data with the dialogue act information prefixed to the user turns and surround them with explicit <dialogue_act> tags. The expectation is that the TXL LM learns the usage patterns associated with different dialogue acts and this information should help narrow down the search space for the model to content words relevant to the current dialogue context.

Domain adaptation using contextual semantic embeddings
In production chatbots, it is common for bot developers to provide example speech patterns, in the form of sample sentences or explicit grammars, which can then be used to bias the n-gram language models in a ASR system . This pre-defined set of speech patterns is a useful source of contextual information that can be also used to bias NLMs as well. As demonstrated in (Shenoy et al., 2021), pretrained masked language  (Radford et al., 2016;Brown et al., 2020). However, the sentence or document embeddings derived from such an MLM without finetuning on in-domain data is shown to be inferior in terms of the ability to capture semantic information that can be used in similarity related tasks (Reimers and Gurevych, 2019). Instead of using the [CLS] vector to obtain sentence embeddings, in this paper we take the average of context embeddings from last two layers as these are shown to be consistently better than using [CLS] vector (Reimers and Gurevych, 2019;Li et al., 2020).
We use a simple fusion method as experimented in (Shenoy et al., 2021) where the hidden state from the last layer of the TXL decoder is concatenated with the BERT derived embedding. This is then followed by a single projection layer with a nonlinear activation function σ, such as sigmoid.
Where h T XL t is the hidden state from the last transformer decoder and e M LM is the BERT derived embedding from in domain sample utterances. The intuition here is that the model learns to associate the domain specific BERT derived embedding with the occurrences of jargon specific to that domain. Thus providing different BERT vectors derived from different domain texts should allow the model to adapt towards such domains on-the-fly.

Dataset
We required task-oriented dialogue datasets with actor, dialogue acts and the slot entities annotated. Since no single dataset was large enough to train a reliable language model, we used a combination of Schema-Guided Dialogue Dataset (Rastogi et al., 2019), MultiWOZ 2.1 (Eric et al., 2019;Budzianowski et al., 2018), MultiDoGo (Peskov et al., 2019) along with anonymized in-house datasets that belong to two e-commerce usecases : retail and fastfood delivery. The final LM training data consisted of 260k training samples, 56k validation and evaluation samples and around 9.9 million running words. We used a vocabulary of size 25k. We evaluated our models on anonymized in-house 8kHz close-talk audio. These audio comprised of task-oriented conversations with multiple speakers and acoustic conditions representative of real world usage and belonged to the same two usecases mentioned above. The average number of turns in the audio dataset was 5.

ASR setup and NLM setup
We used a hybrid ASR model comprising of a regular-frame-rate (RFR) model trained on crossentropy loss, followed by sMBR (Ghoshal and Povey, 2013). The first pass LM we used was a domain-general Kneser-Ney (KN) (Kneser and Ney, 1995) smoothed 4-gram model estimated on a weighted mix of datasets spanning multiple domains. The final vocabulary size of the n-gram LM was 500k words. All our NLM rescorers used a 4-layer Tranformer-XL 1 decoder, each of size 512 with 4 attention heads. The input word embedding size was 512. We used a segment and memory length of 25. During model training we applied a dropout rate of 0.3 to both the slot detection network and TXL. For the slot detection layer we used a 3 layer MLP and used the final layer hidden representation from the TXL as the output. To obtain the BERT embedding from in-domain speech patterns, we finetune huggingface 2 pretrained BERT mode on the retail and fastfood text corpus. The derived BERT embedding size used was 768. During inference, we extract n-best hypothesis with n<=50 from the lattice generated by the first pass ASR model. We rescored the n-best hypothesis by multiplying the acoustic score with the acoustic scale and adding it to the scores obtained from the TXL rescorer. We used a fixed α SD of 0.8 for the slot detection loss. Table 3 summarizes the relative perplexity reductions (PPLR). Since we are optimizing our models to improve on the e-commerce domain specific con-tent words we directly report the relative content word error rate reductions (CWERR) in Table 2 along with the relative impact on the downstream NLU tasks of IC and SL. For computing CWERR, we remove all the stop words comprising of commonly used function words, such as conjunctions and prepositions from the transcriptions and evaluate only on content words. We also report statistical significance of our CWER improvements using matched pairs sentence segment word error test (MPSSWE). All the WER numbers are relative to a non-contextual LSTM baseline. The gap in the performance between the two domains we tested on is reflective of the underlying training corpus distribution, which has more text belonging to the fastfood domain.

Results and Discussion
Perplexity gains indicate effective domain adaptation We report both general domain and e-commerce domain PPLR. Overall, the contextualization and domain adaptation techniques help with the PPL dropping in both cases. The jointly trained model on in-domain slot detection however clearly helps more in the e-commerce case. Moreover, since we used BERT that was finetuned on e-commerce text we again see larger gains in the domain specific testset when compared to the general domain testset (23.3% vs 16.3%).
Using system dialogue acts improves intent detection: From our experiments that train the TXL LMs with dialogue act information, it is clear that dialogue acts helps with relatively marginal gains in PPL (3.4% on generic and 9.9% on ecommerce) and WER (1.2% Retail, 14.4% Fastfood). When compared to other techniques we explored, we see that the impact on intent classification was higher in proportion to the gain in WER, which indicates that dialogue acts are valuable contextual signals to help with intent conveying phrases.
Slot detection loss yields improvements on domain specific content words: Rows 4 and 5 of Table 2 report the content WERR, IC and SL F1s that we obtain by incorporating the joint LM and slot detection (SD) loss. As expected, the multitask model improves on the content words significantly (1.2% to 4.3% on Retail, 12.3% to 16.3% on Fastfood). This WER improvement also carries over to a higher SL F1 improvement, but a relatively small IC F1 improvement. This is again indicative that the improvements are mainly on recognition of in-domain slot entities and the auxil-iary function words that are important to recognize intents do not benefit as much.
Domain adaptation using BERT fusion provides maximum gains: Rows 6 and 7 in Table 2 illustrate the performance of the TXL LM that incorporates the BERT embedding fusion layer. Compared to the model trained with joint slot detection loss, BERT fusion model performs better on all ASR and the NLU metrics. It is evident from the results that the BERT embeddings that are derived from different user provided text helps the model effectively adapt to the domain that the embedding was derived from. The gains are amplified when complemented with the dialogue acts ability to improve on intent carrying words and the joint slot detection model leading to a WERR improving from 12.3% to 19.2% on the fastfood domain and 1% to 11.8% on the retail domain. This also carries over to an improvement on IC and SL F1 of 3.8%, 4.3% on retail and 2.1%, 6.4% on fastfood.

Conclusion
In this paper we explored different ways to robustly adapt a domain-general Transformer-XL NLM to rescore N-best hypotheses from a hybrid ASR system for task-oriented e-commerce speech conversations. We demonstrated that Transformer-XL LM trained with turn level dialogue acts benefits intent classification by improving the recognition of content words. Additionally, we show that using semantic embeddings derived from a masked language model finetuned on e-commerce domain can be effectively used to adapt a domain-general TXL LM for e-commerce domain utterance rescoring task. Finally, we introduced a new TXL training loss function to jointly predict content words along with language modeling task, this when combined with BERT fusion and dialogue acts, amplifies the WER, IC F1 and SL F1 gains. We have also shown these improvements to be statistically significant. Future work can look at integrating these methods into an end-to-end ASR system for both rescoring task and first pass LM fusion.