Building Accurate Low Latency ASR for Streaming Voice Search in E-commerce

Automatic Speech Recognition (ASR) is essential for any voice-based application. The streaming capability of ASR becomes necessary to provide immediate feedback to the user in applications like Voice Search. LSTM/RNN and CTC based ASR systems are very simple to train and deploy for low latency streaming applications but have lower accuracy when compared to the state-of-the-art models. In this work, we build accurate LSTM, attention and CTC based streaming ASR models for large-scale Hinglish (blend of Hindi and English) Voice Search. We evaluate how various modifications in vanilla LSTM training improve the system’s accuracy while preserving the streaming capabilities. We also discuss a simple integration of end-of-speech (EOS) detection with CTC models, which helps reduce the overall search latency. Our model achieves a word error rate (WER) of 3.69% without EOS and 4.78% with EOS, with ~1300 ms (~46.64%) reduction in latency.


Introduction
As an e-commerce platform in India, we need to cater to a variety of user bases, and a big part of that consists of users who cannot or do not want to type while interacting with the app, e.g., while searching for a product. For such users, interaction via a voice-based interface becomes an essential feature requiring an accurate and efficient Automatic Speech Recognition (ASR) system.
Recent years have witnessed the popularity of end-to-end ASR models, which have achieved state-of-the-art results (Li et al., 2022). These models offer simplified training and inference processes and have demonstrated higher accuracy compared to traditional pipelines with separate acoustic, pronunciation, and language models. Common approaches for end-to-end ASR models include CTC (Connectionist Temporal Classification), AED (Attention-based Encoder-Decoder), and RNNT (RNN-Transducer) (Graves et al., 2006;Chan et al., 2016;Graves et al., 2013).
However, streaming capability plays a pivotal role in choosing the most suitable ASR model. While non-streaming models can leverage the entire audio for text inference, streaming models have access only to past context, which can result in reduced accuracy. Nevertheless, streaming models provide immediate feedback, a critical requirement for consumer-facing applications like Voice Search. Additionally, low inference latency is essential to ensure a user-friendly experience, as delayed feedback can adversely impact usability.
Another challenge in streaming ASR applications is accurately detecting the end of speech (EOS). Conventional methods rely on standalone Voice Activity Detection (VAD) models, which operate independently from the ASR system and may not offer optimal accuracy.
In this work, we focus on developing a streaming ASR system for large-scale Hinglish Voice Search. Our objective is to enhance accuracy and reduce latency while preserving streaming capabilities. Specifically, we propose modifications to an LSTM and CTC based ASR system, aiming to bridge the gap between streaming and nonstreaming ASR models. We also present a simple training and inference strategy that enables joint ASR and EOS detection within end-to-end CTC models, effectively reducing user-perceived latency in voice search. The contributions of this research can be summarized as follows: • Development of an accurate and efficient streaming ASR model based on LSTM, MHA (Multi-Head Attention), and CTC for Hinglish Voice Search; • Introduction of a straightforward training and inference strategy to enable joint ASR and EOS detection within end-to-end CTC models, addressing the need for accurate EOS detection in streaming applications.
• Analysis of the impact of model modifications on reducing the performance gap between streaming and non-streaming ASR models.
Next, we discuss some related work in Section 2. Section 3 describe the model architecture we use, EOS integration and the inference method. We talk about the dataset and experimental setup in Section 4. Finally, we conclude with a discussion on results and limitations in Section 5.

Related Work
CTC, the first E2E approach developed for ASR (Graves et al., 2006), has been widely used over the last few years (Soltau et al., 2016;. Although it provides simplicity, it makes a conditional independence assumption, that output token at any time doesn't depend on past tokens, which can make it sub-optimal. AED and RNNT models relax this assumption by leveraging past output tokens. While AED models like LAS (Listen, Attend and Spell) (Chan et al., 2016) work very well for non-streaming tasks, they require complex training strategies for streaming scenarios Chiu and Raffel, 2017). RNNT (Graves et al., 2013) provides a natural alternative in streaming scenarios but has high training complexity and inference latency rendering it difficult to use in a real-world setting without complex optimizations/modifications (Li et al., 2019;Mahadeokar et al., 2021).
There have been many attempts to improve the accuracy of CTC models that preserve their training and inference simplicity. Fernández et al. (2007) leverages hierarchical structure in the speech by adding auxiliary losses to train a CTC-based acoustic-to-subword model. Their hierarchical CTC (HCTC) model predicts different text segmentations in a fine-to-coarse fashion. Recent studies have explored the use of attention in CTC models to implicitly relax the conditional independence assumption by enriching the features using other time frames.  uses component attention and implicit language model to enrich the context while Salazar et al. (2019) evaluates a fully self-attention-based network with CTC. In this work, we explore how augmenting an LSTMbased network with windowed self-attention can help improve the transcription while preserving streaming capability.
Another line of work in improving the output of streaming models is the second pass rescoring that uses an additional (usually non-streaming) component to re-rank the streaming model's hypotheses . While we also rescore the candidate hypotheses at the last step, our system doesn't employ any external acoustic model to do so and leverages the hierarchical losses that are part of the model itself.
For addressing EOS detection, conventional approaches use VAD models with a threshold on silence amount. This may lead to early termination of user speech. Shannon et al. (2017) addresses this by training an EOQ (End-of-Query) classifier which performs better than VAD but is still optimized independent of the ASR system. VAD based on output CTC labels has also been explored to detect EOS based on the length of non-speech (blank) region (Yoshimura et al., 2020).  jointly train an RNNT model for EOS detection by using and extra < /s > token with early and late penalties. Prediction of < /s > token by the model during inference marks as the signal for EOS. We follow a similar approach where we train the model with early and late penalties. During inference, we use a dynamic threshold on < /s > probability to detect the endpoint before decoding the text.

Model Architecture
Inspired by Fernández et al. (2007), we build a 3-level HCTC architecture based on LSTM and attention as shown in Fig. 1. Going in a fine-tocourse fashion, the model predicts characters (73 tokens), short subwords (300 tokens) and long subwords (5000 tokens) at the respective levels. Each level consists of an N-layer LSTM-attention block (N being 5, 5 and 2) followed by a linear softmax layer. A time convolution layer with a kernel size of 5 and a stride of 3 after the second level reduces the number of time steps to one-third. This helps emit longer subwords at the third level by increasing the context and receptive field of a time frame. Along with the HCTC loss, we use label smoothing (Szegedy et al., 2016) by adding a negative entropy term to it. This mitigates overconfidence in output distributions leading to improved transcription. Mathematically, the loss for a given training sample, (x, y) = (x, {y char , y s300 , y s5k }), is: For an N-layer LSTM-attention block (Fig. 2), we stack N LSTM layers with 700 hidden dimensions which are followed by a dot-product based multi-headed self-attention layer (MHA) (Vaswani et al., 2017). We use 8 attention heads and project the input to 64-dimensional key, query and value vectors for each head. We project back the 512 (8x64) dimensional output to 700 dimensions and pass it through a linear layer with ReLU activation. To retain the model's streaming capabilities, we restrict the attention to a 5-frame window (t±2) instead of complete input i.e., for input features f t , we use Q(f t ) as the query vector and K(f t−2:t+2 ), V (f t−2:t+2 ) as key-value vectors where Q,K and V are linear projections. To improve the gradient flow, we add a skip connection and layer normalization after each layer.
We use 80 filterbanks from standard log-melspectrogram as inputs, computed with a window of 20ms, a stride of 10ms, and an FFT size of 512. To prevent overfitting, we use time-frequency masking (Park et al., 2019) during training. We also stack five adjacent frames with a stride of three, giving an input feature vector of 400 dimensions with a receptive field of 60ms and stride of 30ms for each time step. Windowed MHA and time convolution increase overall receptive field and stride to 780ms and 90ms resp. Consequently, our model has a forward lookahead of 390ms when deployed in a streaming mode.

Speech End-pointing
Once we have a trained ASR model, we augment the vocabulary with an additional < /s > token and use forced alignment to get the ground truth speech endpoints. We use the output from 1st (character) level of the ASR model for alignment as it has the least lookahead and empirically works better than the output from other blocks. We append the extra < /s > token at the end of each transcript and add early-late (EL) penalties  to the training loss to fine-tune the model for a few more iterations. EL penalties penalize the model for predicting < /s > too early or too late. During online inference, we determine if the current time step (t) is the speech endpoint by evaluating the 278 following conditions: • There is at least one word in output text -to avoid termination before the user starts speaking; • < /s > is the most probable token among all vocab items i.e., P t (< /s >) ≥ P t (:) -call this an EOS peak; • P t (< /s >) ≥ threshold t = α 1+nt/β where n t is the number of EOS peaks before time t.
Thus, the earliest time step satisfying the above conditions is the EOS. Here α controls the aggressiveness of EOS detection as decreasing α decreases the EOS threshold for all time steps resulting in an earlier EOS signal. Empirically, we observe that the model gives a lower probability to < /s > token after each EOS peak. To address this, we add an n t /β term that gradually reduces the threshold whenever an EOS peak appears, giving an additional (but marginal) reduction in latency. For audios where the above conditions are never satisfied, a combination of a small independent VAD model and a maximum time limit works as a backup.

Decoding and Re-scoring
For each chunk of input audio stream, we use prefix beam search, with a beam size of 1000 hypotheses, to decode the text from probability distribution given by the last (subword 5000) level. We use the same probability distribution to detect EOS as well.
When we observe an EOS or the stream ends, a 5gram KenLM and HCTC loss (sum of CTC losses from all levels) are used to re-rank and select the best hypothesis from the top 100 candidates. We use grid search to find the weights of the scores.

Dataset and Training Setup
Queries from E-commerce Voice Search are our primary source of data. We also collect speech from other sources like on-call customer support, crowdsourced read-speech, etc., to augment training data. We transcribe all the utterances, except read-speech, using an existing ASR system and manually correct them. The ASR system that generates reference transcripts progressively improves as part of model iterations. Collectively, the training datasets amount to ∼14M audio-text pairs (8M from the target domain and 6M from other) or roughly 22.5k hours of audio. For evaluation, we randomly sample ∼19k audios from e-commerce voice search queries, transcribe it manually (without any reference text) and reduce the human error by using multiple iterations of verification. We categorize the test set into clean and noisy subsets, containing ∼16k and ∼3k samples resp. Clean utterances are audios where only one speaker's speech is intelligible. Noisy utterances are those where more than one speaker has intelligible speech (overlapping or non-overlapping). In noisy utterances, the primary speaker is the user whose utterance is more relevant for the ecommerce voice search application. Note that clean utterances may also have non-intelligible secondary speakers. We train and evaluate the model to transcribe only the primary speaker's speech while ignoring the rest.
For training KenLM and Sentencepiece models, we use a large corpus comprising text from various sources like transcribed voice search queries and on-call customer support queries, customer support chatbot queries, and product catalogues.
We use a cyclical learning rate (LR) (Smith, 2017) with Adam optimizer to train the ASR model for 200k iterations with a batch size of ∼55 minutes. Training the model on two A100 (40 GB) GPUs takes ∼50 hours. For EOS detection, we fine-tune the model with EL penalties for an additional 48k iterations (∼12 hours). We report WER and mean EOS latency on the test set for evaluating the performance of our model in Table 1. We get the best results when the model is first pre-trained on all the data and then finetuned on the target domain, followed by fine-tuning with EL penalties. To see how α and β affect the results, we do a sweep over both the parameters and plot mean EOS latency vs %WER in Fig. 3  To understand how modifications in the architecture contribute to improving the accuracy of the vanilla LSTM CTC model, we conduct an ablation study and report the WER in Table 2. We train these models for 200k iterations on a reduced dataset of ∼5500 hours sampled from the target domain. As seen from the table, windowed MHA improves the WER by 9.6%. Intuitively, the improvement comes from an increased receptive field (780ms with vs 180ms without attention) and the ability to extract better context from neighbouring frames using self-attention. HCTC loss forces the model to learn hierarchical structure in the speech at multiple levels -from characters to short subwords and then long subwords. The model can then utilize this structure to achieve more accurate predictions. Adding auxiliary losses at intermediate levels helps the convergence as well. The hierarchical loss also facilitates the rescoring since the combination of losses acts like an ensemble of ranking models. Together, HCTC loss and rescoring give a relative improvement of 10.28%. Finally, skip connections improve the gradient flow in training, which further helps the convergence, improving the WER by 13.80%. These modifications, when combined, result in a significant total relative improvement of ∼30% in WER over the baseline.

Comparison with other models
In addition to the baseline LSTM CTC (Table 2), we also compare our model with a non-streaming BiLSTM version, and a streaming Conformer CTC inspired by (Li et al., 2021). For Conformer CTC, we use the causal encoder-only network and train it using CTC loss. As evident from the results in Table 4, the discussed modifications help bridge the gap between streaming LSTM and non-streaming BiLSTM CTC models. The streaming Conformer CTC also performs only marginally better than our LSTM-attention HCTC model while it has much higher training complexity and inference latency.
We evaluate a bidirectional version of our model to analyse the consistency of these improvements. Observe that the same modifications improve the BiLSTM CTC model by a relative 13.4%, vs 30% the LSTM CTC model because BiLSTM already has access to full future context, limiting the scope of improvement. Even then, it performs significantly better than a vanilla BiLSTM CTC model and only slightly worse than a Transformer AED+CTC model (Nakatani, 2019). Thus, these modifications also reduce the gap between   LSTM and transformer-based ASR models for voice search in both -streaming and non-streaming settings. One explanation could be that transformers usually have an advantage in capturing longterm dependencies. This doesn't help as much for speech recognition on short utterances as in our dataset, where audios usually are 4-6 seconds long with an average of 3.34 spoken words. For a fair comparison, we ensure all models are similar in size and use the same KenLM for rescoring.

Error Analysis
To understand the errors better, we analyze 50 random utterances each from clean and noisy subsets where the model makes mistakes. The most common reasons for errors in the clean subset are -wrong pronunciation and background noise. For noisy utterances, multiple speakers with a similar voice, overlapping speakers, and more than one eligible primary speakers contribute to additional errors. Table 3 lists some examples demonstrating these reasons. We also observe that around 62% of the mistakes in the evaluation set have no negative impact on search. In these cases, the errors are usually in stop words or produce a variant of the reference word which can be used, like singular vs plural or the same word with a different spelling.
When using EOS detection, there are additional errors due to early termination in 2.24% of the utterances. In all such cases, EOS is detected prematurely because of a pause in the speech. Usually, after this pause, the user repeats their query, adds more information, or corrects it. In around 47% of the cases, not capturing this additional speech has no negative impact on search. In the rest 53% cases, i.e. 1.19% of the total samples, the missed utterance usually has more information about the query, added by the user, that could have helped in refining the search results.

Conclusions
This work focuses on developing a robust and efficient streaming ASR model for Hinglish Voice Search. We achieve this by utilizing an LSTMattention architecture and employing the HCTC loss. We explore architectural modifications that help bridge the accuracy gap between streaming and non-streaming LSTM-based ASR models.
Our proposed model performs on par with a streaming conformer-based system but offers the advantage of lower latency. Additionally, we present a straightforward method to integrate Endof-Speech (EOS) detection with CTC-based models, requiring only a small number of additional training iterations and utilizing simple thresholding during inference.
The simplicity and low latency of our model contribute to a fast and accurate voice search experience, making it an appealing solution for practical applications.

Limitations and Future Work
In our study, we focused on a high-resource setting with access to approximately 22.5k hours of labeled speech data. While we compared our models with conformer and transformer-based AED and CTC models, we did not include RNNT models due to their higher compute resource requirements. To accommodate deployment constraints, we employed a smaller model with approximately 60 million parameters, which limited its performance.
Moving forward, our future work aims to explore the potential benefits of leveraging large unsupervised datasets and larger models to further enhance our system and extend its applicability to other Indian languages, which typically have less available data compared to Hinglish. Building upon our previous success in adapting a non-streaming model for end-to-end speech-to-intent detection in customer support voicebots (Goyal et al., 2022), we are motivated to investigate the feasibility of developing a single joint model for Automatic Speech Recognition (ASR), End-of-Speech (EOS) detection, and Spoken Language Understanding (SLU). Additionally, we are keen on exploring the development of multilingual ASR models.