Voice Query Auto Completion

Query auto completion (QAC) is the task of predicting a search engine user’s final query from their intermediate, incomplete query. In this paper, we extend QAC to the streaming voice search setting, where automatic speech recognition systems produce intermediate transcriptions as users speak. Naively applying existing methods fails because the intermediate transcriptions often don’t form prefixes or even substrings of the final transcription. To address this issue, we propose to condition QAC approaches on intermediate transcriptions to complete voice queries. We evaluate our models on a speech-enabled smart television with real-life voice search traffic, finding that this ASR-aware conditioning improves the completion quality. Our best method obtains an 18% relative improvement in mean reciprocal rank over previous methods.


Introduction
Query auto completion (QAC) is the task of predicting a user's complete query given the present, incomplete prefix of the query. For example, suppose a user types "COVID vaccine" into Google. Then, a QAC system proposes the most likely completions for that prefix, e.g., "COVID vaccine near me," saving the user time when the prediction is correct. Existing state-of-the-art approaches generate completions using language models conditioned on the prefix (Park and Chiba, 2017), with simple prefix trees serving as a strong baseline.
In the streaming voice search setting, such as on the Google Voice Assistant, naïvely adapting these existing approaches fails because the key assumptions differ. Most glaringly, incomplete queries comprise partial speech, not text. In place of human users, an automatic speech recognition (ASR) system produces the textual transcripts, resulting in intermediate queries that often don't form prefixes or even substrings of the final query. Consider * Work done while employed at Comcast AI. the voice query "Hulu" as an example. Passing it through a streaming ASR system, we observe the transcripts "who," then "Hulu." Traditional QAC approaches fail to complete "Hulu" from "who," since they use orthographic prefixes and substrings (Cai and de Rijke, 2016) instead of the true phonetic prefix, which is generally unavailable at training time.
Nevertheless, the intermediate transcript "who" is still informative toward predicting "Hulu" because it frequently precedes "Hulu" in the sample. Based on this observation, we hypothesize that, for improved QAC quality, we must additionally model the dynamics between the intermediate and the final transcripts from the ASR system.
In this paper, we precisely design and evaluate ASR-aware QAC models. The main contributions of our work are as follows: First, we are the first to describe the task of QAC for streaming, bidirectional ASR systems in voice search. Second, we propose and evaluate novel, ASR system-aware QAC models for the task, showing that incorporating context from intermediate transcripts helps. On the Xfinity X1, a voice-enabled smart TV serving more than twenty million American customers, our best approach attains an 18% relative improvement in mean reciprocal rank over the previous best.

Voice Query Auto Completion
Our novel task is to predict the final voice queries that users issue, given some mid-utterance, intermediate transcripts of their incomplete speech from the ASR system. As is typical, these systems are streaming, with the speech being transcribed to text in real time. Concretely, for some utterance, we are given a k-tuple of string transcripts X := (x (1) , x (2) , . . . , x (k) ) representing the streaming outputs of the ASR system across the utterance, where x (i) is a string of words. We index X in chronological order, e.g., ("Who", "Hulu", "Hulu now") for the utterance "Hulu now." We wish c Input Output to predict x (k) from x (j) for each 1 ≤ j ≤ k; that is, we model (1) In keyboard-input QAC, there is only a single transcript, the submitted query, which researchers model with prefix trees (Mitra and Craswell, 2015) and autoregressive language models (Park and Chiba, 2017) to generate completions from prefixes. A tacit assumption is that partial queries are prefixes (or within a small edit distance of an observed prefix; Chaudhuri and Kaushik, 2009) of the final query. Since this property is invalid for us, we propose voice query-oriented flavors of two state-of-the-art QAC approaches, most popular completion (MPC) and neural query language models (NQLMs), representing a statistical and a neural approach, respectively.  Table 1. Given the context size c, we then construct a training set of strings D c from the training corpus of N transcript tuples D := (X 1 , X 2 , . . . , X N ) as the set of all cconcatenations of each X ∈ D.

Concatenated Sequence Transformation
We differentiate the motivation of our method from that of Tsunematsu et al. (2020), who treat speech completion as the same as typographical QAC. Typing is much more linear than automatic speech recognition is, with intermediate keystrokes (or queries) being within a short edit distance of the final query. In other words, typographical sequence completion already has the full history of the query inherently, in the final query text itself. As a result, our motivation is to include the same information that is available to typographical systems by including the full transcript history, captured from the ASR system.

Our Models
Most popular completion. In keyboard-input most popular completion, researchers construct a trie over the characters of each query in the training corpus, keeping track of the frequency. At inference time, given some prefix of the query, the top-K completions from the trie are returned. In our case, given the context size c, we build the trie from D c , naming this method "concatenated MPC," or "CAT-MPC" for short. Neural query language models. The clear drawback of MPC is that it fails to complete unseen prefixes. The current state-of-the-art workaround is to apply neural language models (NLMs; Park and Chiba, 2017) rather than relying on observed statistics, thus allowing for unseen suffixes to be generated. We propose to model D c using lightweight transformers (Vaswani et al., 2017), which represent the state-of-the-art architecture in language modeling (Brown et al., 2020). For learning the statistical distribution p(W 1 , . . . , W n ) over the word sequence W 1 , . . . , W n , NLMs typically use the negative log-likelihood (NLL) objective where D c is a training corpus as previously defined, p θ is an NLM, and x is a tokenized string. We call this approach "concatenated neural query language model," or "CAT-NQLM." Neural trie objective. NLL is pointwise in the sense that the likelihood of a single word (or outcome) is maximized at each iteration, ignoring the full distribution across the vocabulary. On a long corpus, this is the best we can do, for the data is too sparse to provide an estimate beyond the nextbest token conditioned on all its previous ones. On query logs, the data is instead short, dense, and well modeled by tries (as the high quality of MPC shows), enabling us to estimate the distribution across the vocabulary for each new token. We construct a trie p trie (W 1 , . . . , W n ) over the dataset D c and introduce the objective where KL(·||·) denotes the Kullback-Leibler divergence. Unlike Eqn.
(2), this loss uses the entire distribution across the vocabulary. In the data-sparse case, this objective degenerates to the negative log-likelihood loss since p trie (W i = x i |x 1 , . . . , x i−1 ) = 1. We name Eqn.
Model inference. At inference time, given some intermediate transcripts, we join the final c transcripts with [SEP] and append the [EOS] sentinel. For CAT-MPC, we return the top-K completions following the sentinel; for CAT-NQLM, following Park and Chiba (2017), we feed the transformed string into the NLM, run beam search with a width of K, and return the generated tokens.

Experimental Setup
We run experiments using PyTorch and Transformers (Wolf et al., 2019) on machines with Titan RTX GPUs. For CAT-NQLM, we use the same architecture as GPT-2-base (Radford et al., 2019) but with an embedding and hidden size of 256, 8 attention heads, 4 layers, and 8,000 tokens. This model runs in real time (much less than 100ms, the limit for instantaneous perception) and totals 4.7 million parameters, which is slightly larger than the small 3.8M-parameter model from Park and Chiba (2017) and much smaller than their large 30M variant. We train this model using the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 5 × 10 −4 , a batch size of 128, and 5 epochs. We denote the models trained using the neural trie objective with the "NT" subscript. For tokenization, following the current state of the art, we apply byte-pair encoding (Kudo and Richardson, 2018) to solve the out-of-vocabulary problem. We tune c from Section 2.1 for c = 1, . . . , 5 for all CAT-* approaches and pick a beam search width of 10. For more specific training details, refer to the appendix.
Dataset. We curate a proprietary dataset from reallife voice queries to our smart TV, the X1 entertainment system, which users interact with using a voice remote. Example queries include navigating to a specific channel ("Channel 5"), searching YouTube ("YouTube funny videos"), and program lookups ("Cowboy Bebop").
In Figure 1, we present the empirical cumulative distribution functions (ECDFs) of important statistics. From left to right, we plot the frequency rank of the query, the lexical length of the query, the number of ASR outputs, and the duration between the first and the last output. Note that this output activity duration is zero seconds for single-output utterances. We observe that the top-10 queries make up 25% of the traffic, which is roughly equal to the long tail past a rank of 1000 (see the leftmost figure). As the middle two figures show, the final queries are mostly short with few intermediate ASR outputs, with 80% of them having fewer than 15 characters and yielding no more than 5 intermediate transcripts. Thus, the ASR system actively outputs for less than a second on most queries, as plotted in the rightmost figure. For a detailed analysis of the queries, see our previous work (Li and Ture, 2020;Tang et al., 2019;Rao et al., 2018).  For the training and the development sets, we collect the streaming transcripts (provided by a thirdparty ASR system) of one million voice queries sampled uniformly at random from April 14 th , 2021, setting aside 10% of it for development and the rest for training. This set represents roughly 3% of our daily traffic, containing 163K unique final transcripts . For the test set, we sample 100k queries from April 15 th , 2021, a different date from the training set's, as is common in QAC datasets for testing generalizability (Adar, 2007). We call this dataset "EntSys1M." To evaluate the models, we compute for each prediction the mean reciprocal rank (MRR), defined as MRR := 1 |Dc| u i ∈Dc RR u i , whereD c is the transformed test set using Section 2.1, and RR u i is the reciprocal rank (index) of the first correct prediction in the top-K completions, as produced following Section 2.2. If no correct prediction exists, then RR u i = 0. We further split the test set into two distinct subsets: final transcripts seen at training time and unseen ones.
Baseline models. We implement MPC and NQLMs (Park and Chiba, 2017) trained on the final transcript of each utterance, with the NLM architecture and training procedure matching our CAT-NQLM's for a fair comparison. We also implement and train the error-tolerant version of MPC (Chaudhuri and Kaushik, 2009), named MPC-KExt, which allows for up to an edit distance ofk between the lookup prefix and the observed prefixes. We tunek on the development set fork = 1, . . . , 10.

Results and Discussion
We present the overall quality of the models in Table 2. Our proposed approaches (rows 4-6) outperform the existing ones (rows 1-3) by 0.01-0.1 points in MRR on the full set. In absolute terms, our best model, CAT-NQLM N T , improves over the previous best, NQLM, by about a third of a rank of the correct prediction, on average. MPC-KExt, which is tolerant to some changes in the prefix relative to the final transcript, still underperforms our proposed methods, likely because the intermediate transcript isn't necessarily close in edit distance to the final (e.g., "Who" and "Hulu"). These results highlight the importance of conditioning the model on the intermediate transcripts.
We confirm that the neural trie objective (Eqn. 3; row 6) improves over the negative log-likelihood loss (row 5), showing that training against the entire vocabulary distribution for each word helps. While not shown due to space constraints, these gains are consistent for each c = 1, . . . , 5 and require no extra machinery (and hence latency) at inference time, thus making CAT-NQLM N T Pareto-better. We note that picking a larger context size (c) does not always result in a better model: CAT-MPC is best when c = 1 because the statistics are more sparse for c = 2 and above, yielding less robust predictions for count-based methods.
Subgroup analysis. We further study how the model quality changes with different characteristics of the transcripts. In Figure 2, we choose CAT-NQLM N T , the best model, and plot the MRR against the portion of the audio clip remaining, as well as the length of the final transcript in the right subfigure. Note that the seen set cuts off after 7 words due to a natural mismatch between the statistics of the training set and the test set. We find that the MRR falls off as less audio is available (i.e., the first few transcripts are generally uninformative), with the unseen set's quality decreasing the fastest. We note a sharp uptick on the seen set and the all set when the full audio clip is remaining, mainly because many simple, short queries have a single intermediate transcript. Similarly, due to increased query diversity and audio clip length, the MRR worsens with increasing final transcript length (see the right subfigure).
Finally, in Figure 3, we graph the MRR split across different prefix deletion distances, defined as the number of characters to delete from the end of the intermediate transcript for it to be a prefix of the final one. The NQLM and the MPC approaches fail when the intermediate transcript is not a prefix of the final transcript, and our best model surprisingly outperforms them even when the deletion distance is zero-see the leftmost bucket. These results suggest that our proposed approach is robust across all prefix deletion distances, including zero. Tsunematsu et al. (2020) study speech transcript completion for unidirectional ASR systems on nonquery data, while our focus is QAC on real-life voice queries with a typical ASR system where intermediate transcripts don't necessarily form prefixes of the final one. Park and Chiba (2017) are the first to apply neural language models to QAC, representing the state of the art; Fiorini and Lu (2018) extend this work with user personalization. Other more restricted examinations include improving QAC for rare prefixes (Mitra and Craswell, 2015), QAC in the presence of typographical errors (Chaudhuri and Kaushik, 2009), efficient QAC (Wang et al., 2020), and the effects of conversations on voice QAC (Vuong et al., 2021).

Conclusions and Future Work
We study the task of QAC for voice queries on bidirectional ASR systems. Along with an improved language modeling objective for query logs, we propose several novel methods which relate the mid-utterance transcripts to the final one, attaining relative gains of 18% over the previous best.
Voice QAC lends itself to a variety of end applications: For one, on voice-controlled smart televisions, it can guide viewers toward final queries in real time, much like the now-retired Google Instant feature. For another, in general voice query processing pipelines, it can serve as part of a latency reduction method, where the most likely voice queries are speculatively processed and their responses precomputed as the user speaks. If we know what the user is going to say before they finish speaking, then we can speculatively send those predicted final queries to the rest of the information retrieval system, while the user is speaking. We plan to explore these lines of research in future work.