CGF: Constrained Generation Framework for Query Rewriting in Conversational AI

In conversational AI agents, Query Rewriting (QR) plays a crucial role in reducing user frictions and satisfying their daily demands. User frictions are caused by various reasons, such as errors in the conversational AI system, users’ accent or their abridged language. In this work, we present a novel Constrained Generation Framework (CGF) for query rewriting at both global and personalized levels. It is based on the encoder-decoder framework, where the encoder takes the query and its previous dialogue turns as the input to form a context-enhanced representation, and the decoder uses constrained decoding to generate the rewrites based on the pre-defined global or personalized constrained decoding space. Extensive offline and online A/B experiments show that the proposed CGF significantly boosts the query rewriting performance.


Introduction
Large-scale conversational AI agents such as Alexa, Siri and Google Assistant help millions of users to perform a lot of tasks, such as playing music, controlling light devices at home, etc. In general, such conversational AI agents have multiple components including automatic speech recognition (ASR) and natural language understanding (NLU). ASR is responsible for converting the speech signal of the user's query (e.g., "play Michael Jackson music") to a text transcript. Following this, NLU provides domain/intent classification (e.g., domain: Music, intent: PlayMusic) and entity labelling (e.g., Artist-Name: Michael Jackson), which are used to fulfill the user's request.
However, sometimes there are frictions due to speech recognition or NLU errors. For example, ASR errors may lead to an erroneous transcript "play alien bridges", when the user actually meant "play leon bridges". Due to such errors, the downstream NLU system is affected, capturing a wrong * Work done when working at Amazon. entity "alien bridges" for the slot "ArtistName". This leads to a bad user experience and the user has to rephrase their query. Additionally current NLU technology has limitations and cannot handle all the user requests. For example, "tv to input three" cannot be properly handled by NLU (the user's intended request is "turn tv to h.d.m.i. three"). To reduce the friction and make the dialog system more robust, query rewriting (QR) (Ponnusamy et al., 2019; becomes an increasingly important technique in conversational AI agents. In production conversational AI agents, the QR component is often triggered when the system cannot process user requests with a good confidence. For example, if ASR or the named entity recognition (NER) confidence is low, the QR component can be triggered to automatically map a user query to another form, so that the dialog system can successfully take the right action.
Many existing QR systems use search-based pipelines for either global-wise query rewriting  or personalized query rewriting (Cho et al., 2021). These systems typically have two steps: retrieval and ranking. Users' historical defect-free interactions with conversational agents are used to construct the global or personalized index. When a new request arrives, the system compares it to those utterances in the index using a retrieval model such as a dual encoder with billion-scale similarity search (e.g., FAISS) (Johnson et al., 2017) and retrieves top N candidates from the index. Then a ranking model is used to rank these candidates with both neural semantic and IR features as input. The system picks the top 1 ranked candidate as the final rewrite. Such a search-based system is widely used in the large scale conversational AI agents since it can effectively control the output because of the use of the index and thus reduce the risky rewrites.
However, there are also limitations in such retrieval-based systems. First, the query and rewrite candidate affinity is mainly captured through a vector dot product and lacks token level modeling. Second, a large memory footprint is needed to store dense representations when a large index is used in the retrieval step.
In this work, we propose to leverage generationbased models under the Constrained Generation Framework (CGF) for the query rewriting task. Since little work has incorporated the previous context information in query rewriting, although its importance is recognized (Wu et al., 2018), we use the previous dialog context and the user's current request in the encoder. The decoder uses constrained decoding in inference to force the generated rewrite to be in a predefined candidate set. The proposed CGF enables us to mitigate the aforementioned shortcomings from the search-based system since the autoregressive formulation allows the model to directly capture relations between the contextual input and target rewrites and thus effectively cross encode both. Moreover, the memory footprint is greatly reduced because the parameters of our encoder-decoder architecture scale with the vocabulary size, not the index count. Though the neural language generation approaches are known to hallucinate content, our proposed constrained decoding approach with a predefined candidate set makes the generation model faithful to the model input and avoids the potential hallucinations or bad rewrites. We conducted extensive offline experiments for both global and personalized query rewriting to show the effectiveness of the proposed approach. Our online experimental results also demonstrate that the proposed CGF indeed generates rewrites of better quality.

Query Rewriting
In dialogue systems, query rewriting benefits dialogue state tracking especially coreference resolution (Rastogi et al., 2019;Vakulenko et al., 2020;, and in general can seamlessly replace the user's utterance in order to remove friction and unsatisfactory experience to users (Ponnusamy et al., 2019;. To do this, Ponnusamy et al. (2019) proposed to reformulate the queries with a Markov Chain.  proposed a retrieval-based model with a pre-training method.  and Cho et al. (2021) leveraged multi-stage search-based systems to perform global and personalized query rewriting. In this work, we propose CGF based on Seq2Seq models to generate a rewrite of the initial user query.
Another thread of work that is related to query rewriting is the Grammatical Error Correction (GEC) task. GEC is the task of correcting different kinds of grammatical errors in text such as spelling, punctuation, and word choice errors. Recently, Seq2Seq based models have become the state-ofthe-art approach for GEC Kaneko et al., 2020). The main difference between GEC and our query rewriting is that GEC is more concerned with grammatical corrections, and we focus on the errors from users, ASR or NLU systems to reduce the friction.

Constrained Generation
Constrained generation has been applied in many tasks such as machine translation and web search. Hokamp and Liu (2017) introduced grid beam search to allow the inclusion of pre-specified lexical constraints. Mohankumar et al. (2021) applied constrained decoding with a diverse sibling search algorithm for search advertising. To the best of our knowledge, ours is the first work that introduces the constrained decoding into query rewriting for conversational AI agents. Moreover, we extend the approach to personalized rewriting to take full advantage of the constrained generation.

CGF for Query Rewriting
As shown in Figure 1, we introduce the sequenceto-sequence (Seq2Seq) model to generate the rewrite, where a bidirectional encoder takes the context and current request as input, and an autoregressive decoder relies on the pre-defined index to perform the constrained decoding in order to generate the target rewrite.

Context-enhanced Modeling
We adopt the Seq2Seq pre-trained model BART . It has the same model architecture as the widely-used Transformer model (Vaswani et al., 2017) and is pre-trained with a denoising way (Devlin et al., 2019). In this work, we flatten the previous dialogue turns (including both user requests and agent responses) and the current user request into a single sequence for the encoder input, as shown in Figure 1, and fine-tune BART. Formally, given a context-enhanced request sequence Q = {q 1 , ..., q M }, where q i denotes a token in the sequence, and the corresponding rewrite R = {r 1 , ..., r N }. The encoder is responsible for reading the input request and its previous dialogue turns, and the decoder autoregressively generates the rewrites. Given the hidden representations of the context-enhanced request and the rewrite, the conditional probability of the n-th target word r n is calculated as following: where h n is the n-th hidden representation of H Dec . P roj() and Sof tmax() are two transformation functions in the output layer of the decoder (Vaswani et al., 2017).

Constrained Decoding
Neural language generation approaches are known to hallucinate content, resulting in generated text that conveys information that does not appear in the input. For example, if a user has a request "play broadway girls", the model with free-style generation can generate a rewrite "play broadway girls by morgan wade". This is factually wrong since "morgan wade" never sings the song "broadway girls". This is because general generative models leverage the beam search over the entire vocabulary and thus there is a chance of generating fluent but factually incorrect sentences. Thus, the inability to effectively control the generated text has become one of the biggest obstacles for adopting generative models for query rewriting in conversational AI. In this work, we propose to use constrained decoding in the generative models to reduce the potential bad rewrites.
Beam search has been widely used in Seq2Seq models during inference to improve the search quality. The standard beam search consists of selecting the top B hypotheses with the highest log probability S(r t ,r <t |Q) = S(r <t |Q)+logP (r t |r <t , Q) at each time step t, wherer t denotes the token in the generated hypothesis. Allowing to generate any token from the vocabulary at every decoding step might lead the model to generate output strings that are not valid (i.e., bad rewrite). Hence, we resort to constrained beam search, forcing to only decode valid rewrites from a predefined candidate set. We define our constraint in terms of a prefix tree T , where nodes are tokens from the vocabulary. For each node t ∈ T , its children indicate all the allowed continuations from the prefix, which is defined as traversing the trie from the root to t. More formally, when decoding the token r t at time step t, the constrained probability distribution is calculated as: P = P (r t = r|r <t , Q), if r ∈ suffix T (r <t ) 0, otherwise Figure 2: A snapshot of the utterance trie we construct based on the global index. When the model has generated a sequence "[BOS] play staring at" during the decoding process, in the next step, using the pre-defined trie, the model is only allowed to generate either "the" or "it". Then, if the model generates "the" next, it is only allowed to generate one of the three words "sun", "moon" or "sky" in the step after it.
where we remove all the tokens r that are not a suffix of the already generated sequencer <t in the trie. In this way, we can ensure that the model is only allowed to generate the rewrites from the predefined candidates set. 1 In the trie shown in Figure 2, each path from the root node to the leaf node (e.g., [BOS] → play → staring → at → it → [EOS]) represents an utterance that we allow the model to generate. "[BOS]" is the special token indicating the beginning of a sequence. Similarly, "[EOS]" denotes the end of a sequence.

Global and Personalized Query Rewriting
Constrained generation with the predefined decoding space can not only reduce the risks, but also offer flexibility to conduct rewrite with utterance sets predefined at different granularities. In this section, we introduce how to conduct the global and personalized query rewriting with CGF.
Global Query Rewriting Global query rewriting means that the rewrite for a request is applicable for all the users. For example, for a query "tv to input three", the ideal rewrite for this query is "turn tv to h. d. m. i. three", which is applicable to all the users who might say this request. In the proposed CGF, we pre-define the global constrained decoding space to include all the rewrite candidates that the model is allowed to generate. To achieve this, inspired by the approach to construct the global index in , we build the global trie that provides rewrite candidates extracted from all the users' interactions. The global trie is generated from the aggregated, anonymized historical interactions between the users and the agent within a period of time (e.g., 30 days). In addition, after collecting all the user historical interactions, we rely on a defect detection model  to filter out the defective utterances. Note that since constrained decoding with the trie doesn't need to store dense vectors of the index, we can reduce the memory footprint greatly and thus potentially enlarge the trie comparing to the index of searchbased models in real online systems.
Personalized Query Rewriting A crucial nature of query rewriting is that often it needs to reflect personal preference or personalized error types to recover from the defect (Cho et al., 2021). For example, for the same defective request "turn on the moon", the intended request for user A may be "turn on the moonlight sonata", whereas user B might want to "turn on the moon lamp". Thus, the global query rewriting described above can not handle such cases. It is necessary to have a personalized query rewriting system to fill this gap. The vanilla Seq2Seq models are not able to perform personalized generation naturally. In contrast, our proposed CGF can allow the generation-based models to perform personalized query rewriting by using a personalized constrained decoding space for each user. For a request coming from a specific user, the model is only allowed to generate a rewrite from the pre-defined personalized decoding space. We follow Cho et al. (2021) to build the constrained decoding space for each user, leveraging their individual interaction history. The utterances included in the constrained decoding space (i.e., trie) reflect satisfied experiences for each user within the past 30 days. In this work, we utilize the model trained with the global training data and apply the personalized trie on it for personalized rewriting.

Data
We train our proposed method with weak-labeled data annotated by a model (Machine-Annotated). Specifically, we first leverage a defect detection model  to find two consecutive deidentified user utterances, where the first turn was defect, but the second turn was successful. Then, we further filter out consecutive utterances with a time gap larger than 35 seconds and edit distance larger than 5. For evaluation, we curated human-annotated test data (Human-Annotated). For both global and personalized test sets, we make sure the target rewrites are in the global/personalized constrained decoding space. Table 1 gives the statistics of the data set. Note that all the data has been de-identified.

Data Type
Machine Human

Train Valid Test
Global QR 6.5m 0.4m 6k Personalized QR 6.5m 0.4m 5k Table 1: Statistics of the query rewriting data sets. "Machine" denotes the Machine-Annotated data. "Human" denotes the Human-Annotated data.

Model Setup
In this work, we fine-tune the pre-trained BART model . We compare our proposed model with several baselines. For global query rewriting task, we have two baselines: 1) DPR (Karpukhin et al., 2020): we follow a recent retrieval model DPR to train a dual BERT model. 2) UFS-QR : we implement the search-based approach UFS-QR that contains a retrieval layer and ranking layer. For personalized query rewriting, we have Personalized UFS-QR (Cho et al., 2021) and DPR as the baselines. Personalized UFS-QR extends the UFS-QR by incorporating the personalized features into the ranking model and index construction. In addition, we also compare with the CGF model that uses the global trie. More details of model training fro the CGF and baselines training can be found in Appendix A.1. We follow  to build the global trie, which contains 27M unique utterances, and Cho et al. (2021) to build the personalized trie. On memory (disk space) footprint, the global trie we built is 856M, in contrast, the  Table 2: Global query rewriting evaluation. We compare our proposed CGF with the existing search-based query rewriting systems on human annotated test sets. "CE" denotes context-enhanced encoding. "CD" denotes the constrained decoding. All the numbers are relative differences with respect to the baseline: "DPR".
built FAISS index is 36G for UFS-QR and 89G for DPR with the same utterances.

Evaluation Metrics
For evaluation, we use utterance level precision and trigger rate. Precision denotes how often the triggered rewrite matches the correct rewrite. The trigger rate is the fraction of instances for which the model makes a prediction with the final beam score above a predefined threshold 2 . We set the threshold to -0.2 for our proposed CGF models. Table 2 shows the CGF main results with ablations on the two human-annotated test sets. CGF with context-enhanced encoding and constrained decoding achieves the best performance on precision and trigger rate on the two test sets. Our approach outperforms the search-based UFS-QR system and retrieval system DPR by more than 14% and 21% on precision respectively. Moreover, the proposed approach can confidently trigger more cases. Table 2 also lists the ablation study results for the global query rewriting task using CGF. "w/o both" denotes the CGF without context-enhanced encoding and constrained decoding, in which the model takes only the query as the encoder input and conduct the unconstrained generation. In particular, although we see that the overall performance of the "w/o CD" model is not bad, it still suffers from hallucinations. Examples with factually incorrect generation can be found in Appendix A.3  Table 3: Personalized query rewriting evaluation. All the numbers are relative differences with respect to the baseline: "CGF (Global trie)".

Global Query Rewriting Results
constrained decoding proved useful. Combining them together is better, resulting in higher precision and at the same time higher trigger rate.

Personalized Query Rewriting Results
Results for the personalized query rewriting on the Human-Annotated test set using our proposed CGF are in Table 3. We use the same trained model as the global query rewriting task. The only difference is that during inference, the constrained decoding space is changed to the personalized one based on each user's historical interactions and thus varies across users. As can be seen, CGF outperforms the CGF global model (i.e., Global trie). Also, it outperforms search-based Personalized UFS-QR and DPR respectively by 2.7% and 3.3% on precision, with a higher trigger rate.

Deployment
In the online system, we run the global and personalized CGF models in parallel. When both the global and personalized components return a rewrite candidate, we prioritize the results from the personalized model over the global one to support any possible personalization of QR. No rewrite will be output from the system if neither model manages to generate a rewrite.

Online Results
To investigate the effectiveness of the introduced techniques, we leverage the proposed model CGF to generate the rewrites and deploy them into the online environment. We compare it with the no-CGF rewrites within the English speaking users environment. The data was collected for more than one week over a significant percentage of traffic via the A/B testing framework. We use one primary metric to evaluate the performance of our proposed CGF approach during A/B: Defect Rate. It denotes the total number of rewritten utterances that are defective divided by the total number of rewritten utterances. We leverage the defect detection model proposed by  to measure if an utterance is defective. From A/B results, we observed significant 3 relative reduction of defect rate: 28.97% and 1 million of new rewrites generated by the proposed approach per week. Table 4 shows the cases where the original requests had unsatisfying responses from the agent and after the rewrite, the friction was removed with satisfying responses. For example, due to an ASR error, the agent response to the original request "how old is tommy in it" cannot fulfill the user's need. Even without context information, i.e., when the request is the first turn, the CGF can successfully rewrite it, yielding the right response from the agent. More online examples can be found in Table 4.

Limitations
Trie Coverage Although we have used 27M unique utterances in the global trie and 30 day's non-defective turns for each user for constrained decoding, the proposed system cannot handle the cold start (e.g., a recent released song) or out-oftrie rewrite cases (the rewrite for the request is not in the trie). To mitigate this, we plan to update the global trie weekly and personalized trie daily. Also we will work on constraining a part of utterance generation (i.e., entity) instead of the entire utterance to enlarge the decoding space.
Latency Generation-based models always suffer from the latency issue due to its autoregressive generation process. In the CGF deployment, we changed the model to ONNX 4 version and speed up the inference by 30.6%. However, CGF is still 1.5 times slower than the search-based system . Considering this, we will explore the non-autoregressive approach and related model optimization approaches like distillation, pruning, etc.

Conclusion
In this work, we propose CGF, a novel paradigm for query rewriting: generate target rewrite autoregressively with context-enhanced encoding and constrained decoding. CGF is a general framework for different query rewriting purposes where one can  freely define the decoding space (e.g., global, personalized or domain-specific space). Both offline and online experiments show that our approach consistently and significantly improves query rewriting performance.