A Scalable Framework for Learning From Implicit User Feedback to Improve Natural Language Understanding in Large-Scale Conversational AI Systems

Natural Language Understanding (NLU) is an established component within a conversational AI or digital assistant system, and it is responsible for producing semantic understanding of a user request. We propose a scalable and automatic approach for improving NLU in a large-scale conversational AI system by leveraging implicit user feedback, with an insight that user interaction data and dialog context have rich information embedded from which user satisfaction and intention can be inferred. In particular, we propose a domain-agnostic framework for curating new supervision data for improving NLU from live production traffic. With an extensive set of experiments, we show the results of applying the framework and improving NLU for a large-scale production system across 10 domains.


Introduction
For a conversational AI or digital assistant system (Kepuska and Bohouta, 2018), Natural Language Understanding (NLU) is an established component that produces semantic interpretations of a user request, which typically involves analysis in terms of domain, intent, and slot (El-Kahky et al., 2014). For instance, the request "Play a song by Taylor Swift" can be interpreted as falling within the scope of Music domain with Play Song intent and Taylor Swift identified for Artist slot.
Without an accurate semantic understanding of the user request, a conversational AI system cannot fulfill the request with a satisfactory response or action. As one of the most upstream components in the runtime workflow (Sarikaya, 2017), NLU's errors also have a wider blast radius that propagate to all subsequent downstream components, such as dialog management, routing logic to back-end applications, and language generation. * Equal contribution. I didn't find the item called "lights" on your shopping list. Did you want me to … … Figure 1: An example of implicit user feedback, specifically an indication of user dissatisfaction and user rephrase behavior, that can be used to create new supervision data to correct NLU errors. The left side shows the dialog history and the right side shows the ranked NLU interpretations for each user request.
A straight-forward way to improve NLU is through human annotations, but they are laborintensive and expensive. Such annotations require at least multiple tiers of annotations (e.g., end user experience, error attribution, and semantic interpretation), and it is hard to consider all relevant contextual conditions. They are also limited by the existing annotation guidelines that may be outdated or that may not accurately reflect user expectations. Due to these limitations, leveraging user feedback, both implicit and explicit, from real production systems is emerging as a new area of research.
Our work makes three main contributions. First, this work is the first in the literature to introduce a scalable, automatic and domain-agnostic approach for leveraging implicit user feedback to continuously and directly improve the NLU component of a large-scale conversational AI system in production. This approach can be applied week over week to continuously and automatically improve NLU towards better end-to-end user experience, and given that no human annotation is required, the approach also raises minimal user privacy concerns. Our approach of using implicit feedback is based on our insight that user interaction data and dialog context have rich information embedded from which user satisfaction and intention can be inferred (see Figure 1). Second, we propose a general framework for curating supervision data for improving NLU from live traffic that can be leveraged for various subtasks within NLU (e.g., domain/intent classification, slot tagging, or cross-domain ranking). Last, we show with an extensive set of experiments on live traffic the impact of the proposed framework on improving NLU in the production system across 10 widely used domains.

Background and Problem Definition
The NLU component typically has three main types of underlying models -domain classifiers, intent classifiers, and slot taggers (El-Kahky et al., 2014). The three modeling tasks can be treated independently (Gao et al., 2018) or as a joint optimization task (Liu and Lane, 2016;Hakkani-Tür et al., 2016), and some systems have a model to rank across all domains, intents and slots on a certain unit of semantic interpretation (Su et al., 2018).
Leveraging implicit feedback from the users has been widely studied in the context of recommendation systems (Hu et al., 2008;Liu et al., 2010;Loni et al., 2018;Rendle et al., 2012;He and McAuley, 2016;Wang et al., 2019) and search engines (Joachims, 2002;Sugiyama et al., 2004;Shen et al., 2005;Bi et al., 2019). In such systems, common types of implicit user feedback explored include a history of browsing, purchase, click-through behavior, as well as negative feedback. Leveraging implicit feedback in the context of conversational AI systems is relatively unexplored, but it has been applied for rewriting the request text internally within or post the Automatic Speech Recognition (ASR) component (Ponnusamy et al., 2019), improving the Natural Language Generation component (Zhang et al., 2018), and using user engagement signals for improving the entity labeling task specifically focused on Music domain (Muralidharan et al., 2019). We note that compared to explicit feedback (Petrushkov et al., 2018;Iyer et al., 2017), using implicit feedback is more scalable and does not introduce friction in user experience. But it comes with a challenge of the feedback being noisy, and leveraging the feedback is more difficult when there is no sufficient data such as for improving tail cases (Wang et al., 2021a,b).
In this paper, we specifically focus on two types of implicit user feedback -dissatisfaction of expe-rience (to understand what to fix, e.g., users prematurely interrupting a system's response) and clarification of intention through rephrase (to understand how to fix, e.g., users clarifying their requests by rephrasing the previous request in simpler terms). In this work, we assume that there are mechanisms already in place to automatically (1) infer user dissatisfaction (f def ect in Section 2.3) and also (2) detect whether a given request is a rephrase of a previous request (f rephrase in Section 3). There are many ways to build these two mechanisms, either rule-based or model-based. Due to space limitation, we leave more details of the two mechanisms outside the scope of this paper. For completeness and better context to the reader however, we briefly describe various ways to build them, which would be straight-forward to adapt and implement.

User Dissatisfaction Detection
Unless we specifically solicit users' feedback on satisfaction after an experience, user feedback is mostly implicit. There are many implicit user behavior signals that can help with detecting user dissatisfaction while interacting with a conversational AI system. They include termination (stopping or cancelling a conversation or experience), interruption (barging in while the system is still giving its response), abandonment (leaving a conversation without completing it), error-correcting language (preceding the follow-up turn with "no, ..." or "I said, ..."), negative sentiment language showing frustration, rephrase or request reformulation, and confirmation to execute on an action (Beaver and Mueen, 2020;Sarikaya, 2017).
Although not strictly from the user behavior, there are other signals from the system action and response that are also useful. They include generic error-handling system responses ("I don't know that one."), the templates executed for generating natural language error-handling responses (the song entity is not found for playing music), and the absence of a response (Beaver and Mueen, 2020;Sarikaya, 2017). There are also component-level signals such as latency or low confidence scores for the underlying models within each component such as ASR or NLU.
For more advanced approaches, we can combine the signals from the user behavior and the system together, try to model user interaction patterns, and use additional context from past interaction history beyond immediate turns (Jiang et al., 2015;Ultes and Minker, 2014;Bodigutla et al., 2020). Furthermore, user satisfaction can depend on usage scenarios (Kiseleva et al., 2016), and for specific experiences like listening to music, we can adapt related concepts such as dwell time in the search and information retrieval fields to further fine-tune.

User Rephrase Detection
There are many lines of work in the literature that are closely related to this task under the topics of text/sentence semantic similarity detection and paraphrase detection. The approaches generally fall into lexical matching methods (Manning and Schutze, 1999), leveraging word meaning or concepts with a knowledge base such as WordNet (Mihalcea et al., 2006), latent semantic analysis methods (Landauer et al., 1998), and those based on word embeddings (Camacho-Collados and Pilehvar, 2018) and sentence embeddings (Reimers and Gurevych, 2019). In terms of modeling architecture, Siamese network is common and has been applied with CNN (Hu et al., 2014), LSTM (Mueller and Thyagarajan, 2016), and BERT (Reimers and Gurevych, 2019). The task is also related to the problems in community question-answering systems for finding semantically similar questions and answers (Srba and Bielikova, 2016).

Problem Definition
Denote T = (Σ, Π, N, A) to be the space of all user interactions with a conversational AI system with each request or turn t i = (u i , p i , c i , a i ) ∈ T consisting of four parts: u i ∈ Σ is the user request utterance, p i ∈ Π is the semantic interpretation for u i from NLU, c i ∈ N is the contextual metadata (e.g., whether the device has a screen), and a i ∈ A is the system action or response. Here, we are proposing a general framework that allows a scalable and automatic curation of supervision data to improve NLU, and we keep the unit of the semantic interpretation abstract for generalizability, which can be for one or a combination of NLU subtasks of domain classification, intent classification, and slot tagging. For instance, one possible interpretation unit would be domain-intent-slots tuple, which is what we use in our experiments described in Section 4. Although we only focus on NLU in this paper, the approach here can be extended to improve other components in a conversational AI system such as skill routing (Li et al., 2021).
We define a session of user interaction s = {t 1 , t 2 , . . . , t q } ⊆ T which is a list of time-consecutive turns by the same user. Denote m t to be the NLU component at timestamp t. We collect the interaction session data S live = {s 1 , s 2 , . . . , s n } from live traffic for a certain period of time ∆ (e.g., one week) starting at time t, from which we curate new supervision data to produce m t+∆ with improved performance. Specifically, given a tool f def ect for automatic analysis of user dissatisfaction for each turn, we process S live to identify all turns that indicate user dissatisfaction, t i ∈ D def ect , which we call a defective turn or simply a defect. The key challenges then are how to (1) identify target defects which are high-confidence defects that can be targeted by NLU (i.e., there is sufficient disambiguation power within NLU that it can learn to produce different results if given specific supervision) and that are likely causing repeated and systematic dissatisfaction of user experience, and (2) find a likely better interpretation for the target defects to change system action or response that leads to user satisfaction.

Solution Framework
The framework involves two deep learning models -Defect Identification Model (DIM) for addressing the first challenge of identifying target defects and Defect Correction Model (DCM) for the second challenge of correcting them by automatically labeling them with a likely better semantic interpretation (see Figure 2). It is straight-forward to apply DIM and DCM on the production traffic log to curate new supervision data for improving NLU.
Data Preparation: We collect the user interaction session data from the production log S live for an arbitrary period of time (e.g., past one week). Given a user dissatisfaction analysis tool f def ect and a rephrase analysis tool f rephrase , we tag t j ∈ s i as a defect if f def ect detects user dissatisfaction for the turn and we tag t j ∈ s i as a rephrase if there exists t i ∈ s i where j > i (i.e., temporally t j occurred after t i ) and f rephrase detects t j to be a rephrase of t i . We then extract each turn in S live to create turn-level data D live = {t j ∈ s i | s i ∈ S live } with t j containing two binary labels of defect e d and rephrase e r .

Defect Identification Model (DIM)
We define DIM as f dim : T → {0, 1}, which takes as input each turn t i ∈ D live and outputs whether t i is a target defect or not. It uses the same contextual  . For DIM, the prediction is for target defect probability, and for DCM, it is for correction probability (i.e., whether the alternate domain and intent is a good alternate ground-truth label).
features (and architecture) as the underlying individual NLU model we wish to improve and uses the results of f def ect , or e d , as the ground-truth labels for training. This allows us to filter down the defects into those that can be targeted by the NLU model of interest (since the same features could predict the defects, suggesting enough disambiguation capacity). By tuning the probability threshold used for binary model prediction, we can further reduce noise in defects and focus on more highconfidence defects that are repeated and systematic failures impacting the general user population. Figure 3 shows an example DIM architecture for a cross-domain interpretation re-ranking model (more detail in 4.1). The model architecture consists of three main modules: embedding, aggregation, and classification. Given each feature f j extracted from t i , the embedding module H emb converts f j into an embedding. For each sequential or categorical feature f j , denoting w f j ,t i as the value of f j with m tokens (where m=1 for categorical), as each feature is already represented by numeric values. The aggregation module H agg then converts v f j ,t i of each feature f j to an aggregation vector u f j ,t i that summarizes the information of v f j ,t i . Based on the feature type, H agg applies different aggregation operations. For example, we apply a Bi-LSTM (Schuster and Paliwal, 1997) to the utterance text embeddings v f 1 ,t i to capture the word context information. Finally, the classification module H cls takes as input all aggregation vectors to make a prediction whether t i is a target defect or not. Specifically, we first concatenate all aggregation vectors to get a summarization vector u t i = f j u f j ,t i . Then, a two-layer highway network (Srivastava et al., 2015) is applied to u t i to make a binary prediction. The model is trained using binary cross-entropy loss.
When developing DIM, we split D live into the training set D train and the validation set D valid with a ratio of 9:1. Once we have DIM trained with D train , we use D valid to further tune the prediction probability threshold used to extract target defects from all defects tagged by f def ect . Specifically, for each turn t i ∈ D def ect , we pass it to f dim to get the confidence score o i = f dim (t i ) of being a defect. Then, we generate the target defect set D target = {t i | o i > τ }, i.e., we collect all turns satisfying the defect prediction confidence being greater than a threshold τ . In order to select the value for τ , we perform a binary search on D valid as shown in Algorithm 1, which takes as inputs two additional parameters λ (to set the minimum prediction accuracy we want) and .

Defect Correction Model (DCM)
We define DCM as f dcm : T × Π → {0, 1}, which takes as input a pair (t i , p j ) with t i ∈ D live and p j ∈ Π to make a prediction whether p j is a proper semantic interpretation for t i . As the space of the semantic interpretation Π is too large, we can make the process more efficient by restricting to find a better interpretation in the k-best predictions P k i ⊆ Π (i.e., k interpretations with the highest prediction confidence) by the NLU model of interest. Note that it is not difficult to force more diversity into the k-best predictions by only allowing top predictions from each domain or intent. For training, we leverage rephrase information from the logged data to automatically assign a corrected semantic interpretation as the new ground-truth label for the defects, with the following assumption: Given a pair of turns t i and t j , if (a) the utterance of t j rephrases the utterance of t i in the same session and (b) t j is non-defective, then the semantic interpretation of t j is also the correct interpretation for t i .
Following the example DIM architecture for the cross-domain interpretation re-ranking model in Figure 3, the DCM architecture extends that of DIM with the main difference that we can generate other features based on domain, intent and slot information from p j . To obtain the training data, we first examine all turns in D live to generate the high value set D h ⊆ T × T . Each instance (t i , r i ) ∈ D h is a pair of turns satisfying (a) t i ∈ D live is a defect and (b) r i ∈ D live is a non-defective rephrase of t i in the same session (defects and rephrases are described in Section 2.3 and Section 3:Data Preparation). We then generate the training data D train using the high value set D h . Specifically, for each pair (t i , r i ) ∈ D h , we generate k training instances as follows. First, we get the k-best interpretations P k r i of r i . Then, we pair t i with each candidate p j ∈ P k r i to get a list of tuples (t i , p 1 ), (t i , p 2 ), . . . , (t i , p k ). Next, we expand each tuple (t i , p j ) by assigning a label c indicating whether p j can be a proper interpretation for t i . Denote p * ∈ P k r i as the correct interpretation for r i , assumed since it is executed without a defect (note that the top-1 interpretation is not necessarily the executed and correct one, although it is most of the time). We generate one positive instance (t i , p * , c = 1), and k − 1 negative instances {(t i , p j , c = 0) | p j ∈ P k r i ∧ p j = p * )}. Only using the k-best interpretations from r i to generate D train may not be sufficient, as in practice the value k is small and many interpretations observed in real traffic does not appear in the training data. To make the model generalize better, we augment the training data by injecting random noise. For each pair (t i , r i ) ∈ D h , in addition to the k − 1 generated negative instances, we randomly draw  Table 1: Overall side-by-side win-loss evaluation results across 10 domains, comparing the top interpretation prediction between the baseline NLU and the updated NLU improved with our framework. "W," "L," "T" and "O" represent "Win," "Loss," "Tie" and "Others" respectively. A win means that the updated NLU produced a better top interpretation than the baseline (* denotes statistical significance at p<.05).
q interpretations P q noise = {p n 1 , p n 2 , . . . , p n q } ⊆ Π that are not in P k r i , and we generate q new negative instances {(t i , p n j , c = 0) | p n j ∈ P q noise }. In short, DCM's role is to find the most promising alternate interpretation in t i 's k-best interpretation list given that t i is a defect.

New Supervision Data Curation:
Once we have f dcm trained, the last step of the framework is to curate new supervision data by applying f dcm to each turn t i ∈ D target identified by f dim and automatically assigning a better semantic interpretation for correction. Specifically, we pair each turn t i ∈ D target with every interpretation candidate p j ∈ P k i as the input to f dcm . The interpretation with the highest score p * = arg max p j ∈P k i f dcm (t i , p j ) is used as the corrected interpretation for t i .

Experiment Methodology
Dataset and Experiment Settings: Given a baseline NLU in production, m base , which produces a ranked list of interpretations with each interpretation comprising domain-intent-slots tuple, we inject a re-ranking subtask at the very last layer of the NLU workflow to build an improved NLU, m new . We call the subtask re-ranking because it takes in an already ranked list (i.e., the output of m base ) and makes a final adjustment. We leverage the new supervision data obtained through our framework to train the re-ranking model for improv-     ing the overall NLU performance. Figure 4 shows the model architecture of the re-ranker, which is a simple extension of the DIM architecture, and it learns from the new supervision data when to toprank a better interpretation that is not at the top of the list (trained with sigmoid activation functions at the output layer and binary cross-entropy loss). We note here that the specific model architecture is not as important as the new supervision data obtained through our framework that is the key for bringing NLU improvements. This experiment setup is appealing in that it is straightforward and simple, especially in the production setting. First, NLU consists of many domain-specific models that are spread out to multiple teams, making it difficult to coordinate leveraging the new supervision data for improvement across multiple domains. Second, working with the final re-ranking model allows us to improve NLU performance domain-agnostically without needing to know the implementation details of each domain. Third, it is easier to control the influence of the new supervision data since we need to manage only one re-ranking component. Given sampled and de-identified production traffic data from one time period D period1 , which have been analyzed by f def ect and f rephrase 1 , we first train DIM according to Section 3.1, with over 100MM training instances from D period1 and over 10MM defects identified by f def ect . Then, we extract over 8MM high-value rephrase pairs (a defective turn and non-defective rephrase in the same session) from D period1 to train DCM according to Section 3.2. To train the re-ranker, we randomly sample over 10MM instances D s ⊆ D period1 and over 1MM defects identified by f def ect . We apply 1 In today's production system, f def ect and f rephrase show F1 scores over 0.70. the trained DIM to the sampled defects F def that filters them down from over 1MM defects to over 300K target defects F dim that the NLU re-ranker has sufficient features to target and produce different results. Then, all target defects F dim are assigned a new ground-truth interpretation label by the trained DCM (note that not all defects have corresponding non-defect rephrases, hence the value of DCM for finding the most promising alternate interpretation from the ranked list), which serve as the new curated supervision for building m new , while the rest of the non-defective instances keep the top-ranked interpretation as the ground-truth label. In other words, most of the instances in D s are used to replicate the m base results (a pass-through where the same input ranked list is outputted without any change), except for over 300K (over 3% of the total training data) that are used to revert the ranking and put a better interpretation at the top.
Overall Side-by-Side Evaluation: The overall performance between m base and m new was compared on another sampled production traffic from non-overlapping time period D period2 in a shadow evaluation setting, in which the traffic flowing through m base was duplicated and simultaneously sent to m new that is deployed to the same production setting as m base but without end-user impact. Both m base and m new produced the same ranked list of interpretations for over 99% of the time. Note that this is by design since incremental improvements are preferred in production systems without drastically changing the system behavior and that our approach can be applied continuously, week over week (changing the proportion of the new supervision data will have an impact on the replication rate). Furthermore, even 1% change in the overall system behavior has a huge impact at the scale of tens of million of requests per week in a large-scale production system. We performed win-loss annotations on the deltas (when m base and m new produced different results) with in-house expert annotators who follow an established NLU annotation guideline to make a side-by-side evaluation whether m new produced a better interpretation (i.e., win) on the top compared to m base or not (N = 12, agreement = 80.3%, Cohen's kappa = 0.60 indicating moderate agreement; note that the annotators are trained to reach agreement level that is practical given the high complexity of the NLU ontology). We randomly sampled 200 such requests per domain that produced different results 2 .
DIM Analysis: We randomly sampled 100 defects per domain from F def and F dim respectively and performed error attribution annotations (i.e., ASR error for mis-transcribing "play old town road" to "put hotel road", NLU error for misinterpreting "how do I find a good Italian restaurant around here" to Question Answering intent instead of Find Restaurant intent, Bad Response for having a correct interpretation that still failed to deliver a satisfactory response or action, and Others for those that the annotators could not determine due to lack of context or additional information; N = 12, agreement = 71.3%, Cohen's kappa = 0.63 indicating substantial agreement).
DCM Analysis: We perform the same win-loss annotations as described in overall shadow evaluation on 100 random samples per domain, specifically on the curated supervision data F dim with new ground-truth assigned by DCM.
Training Setup: All the models were implemented in PyTorch (Paszke et al., 2019) and trained and evaluated on AWS p3.8xlarge instances with Intel Xeon E5-2686 CPUs, 244GB memory, and 4 NVIDIA Tesla V100 GPUs. We used Adam (Kingma and Ba, 2014) for training optimization, and all the models were trained for 10 epochs with a 4096 batch size. All three models have around 12MM trainable parameters and took around 5 hours to train.

Results and Discussions
Overall Side-by-Side Evaluation: Table 1 shows the overall shadow evaluation results, making NLU-level comparison between m base and m new . The column Total shows the number of requests annotated per domain. The columns Win, Loss, and Tie show the number of requests where m new produced better, worse, and comparable NLU interpretations than m base respectively. The column Others shows the number of requests where the annotators could not make the decision due to lack of context. The column ∆ 1 shows the difference in the number of win and loss cases, and ∆ 2 shows the relative improvement (i.e., ∆ 1 / Total in percentage). First, we note that m new overall produced a better NLU interpretation on 367 cases while making 196 losses, resulting in 171 absolute gains or 8.5% relative improvement over m base . This indicates that applying our framework can bring a net overall improvement to existing NLU. Second, analyzing per-domain results shows that m new outperforms m base (7.5-26.0% relative improvements) on 5 domains, while making marginal improvements (0.5-3.5% improvements) on the other 5 domains.
Analysis on DIM: Table 2.(a) summarizes the results of error attribution annotations between the defects in the production traffic (denoted as DEF) and target defects identified by DIM (denoted as DIM). The results show that the target defects identified by DIM help us focus more on the defects that are caused by ASR or NLU (the ones that can be targeted and potentially fixed, specifically NLU Error which is at 39.0% of total for DIM compared to 14.3% for DEF) and filter out others (Bad Responses and Others). Per-domain results show that the target defects identified by DIM consistently have a higher NLU error ratio than that of original defects for all domains.
Analysis on DCM:  Table 3: Qualitative analysis comparing m base and m new in the overall side-by-side evaluation. For each example, the user request in bold is the turn for which the evaluation was performed. We show subsequent interaction dialog for context (U * for user requests, A * for system answers). The first two examples are "wins" (i.e., m new better than m base ), followed by two "losses" (i.e., m new worse than m base ), and a "tie" (i.e., m new comparable to m base ).
new interpretation labels for correction with DCM.
The results show that overall DCM correctly assigns a better, corrected NLU interpretation on 399 cases and fails on 77 cases, resulting in 322 absolute gains or 32.2% relative improvement. Perdomain results show that DCM consistently assigns a comparable or better interpretation on the target defects on almost all domains with a large margin (with 8.0%-79.0% relative improvements on 9 domains).

Qualitative Analysis
The first two examples in Table 3 are wins where m new produced a better top interpretation than m base . In Win 1, m base produced an interpretation related to playing a title for a specific type of multimedia, while the user wanted to play the corresponding title in another multimedia type (e.g., music, video, or audio book). The updated NLU model m new produced the correct interpretation, most likely having learned to favor a multimedia type depending on the context, such as device status (e.g., music or video currently playing or screen is on). Similarly in Win 2, m base mis-interpreted the request as a general question due to not understanding the location "Mission Beach," which is corrected by m new . The next two examples are losses where m new top-ranked incorrect interpretations such that they produced worse results than m base . In Loss 1, the user is in the middle of trying out a free content experience for a specific multimedia type, and we suspect the reason m new produced the incorrect interpretation is that there are similar requests in live traffic to "Play Wings of Fire" with another multimedia type, such that the model learns to aggressively top-rank the interpretations associated with a more dominant multimedia type. In Loss 2, the request is for a general event query in the area, and although the Q&A still failed to correctly answer, it was determined that it would be worse to fail in Calendar domain.
The last example is a "tie" where m new and m base both produced incorrect top interpretations that are equally bad in terms of user experience. Specifically, m base mis-interpreted the request as a Q&A, while m new mis-interpreted the meaning of "play" for playing multimedia instead of sports. As in Loss 1, We suspect many live utterances with the word "play" tend to be multimedia-related and biases DCM towards selecting multimedia-related interpretations.
From the qualitative analysis, especially losses, we observe that we can make our framework and new supervision data more precise if we consider more interaction history context spanning a longer period of time when we train DCM, use more signals such as personalization or subscription signals (for multimedia content types such as music or audio book). Furthermore, for truly ambiguous requests, instead of aggressively trying to correct through a new interpretation, we could offer a better experience by asking a clarifying question.

Conclusion
We proposed a domain-agnostic and scalable framework for leveraging implicit user feedback, particularly user dissatisfaction and rephrase behavior, to automatically curate new supervision data to continuously improve NLU in a large-scale conversational AI system. We showed how the framework can be applied to improve NLU and analyzed its performance across 10 popular domains on a real production system, with component-level and qualitative analysis of our framework for more in-depth validation of its performance.