HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations

Open-domain dialog systems have a user-centric goal: to provide humans with an engaging conversation experience. User engagement is one of the most important metrics for evaluating open-domain dialog systems, and could also be used as real-time feedback to benefit dialog policy learning. Existing work on detecting user disengagement typically requires hand-labeling many dialog samples. We propose HERALD, an efficient annotation framework that reframes the training data annotation process as a denoising problem. Specifically, instead of manually labeling training samples, we first use a set of labeling heuristics to label training samples automatically. We then denoise the weakly labeled data using the Shapley algorithm. Finally, we use the denoised data to train a user engagement detector. Our experiments show that HERALD improves annotation efficiency significantly and achieves 86% user disengagement detection accuracy in two dialog corpora.


Introduction
Evaluation metrics heavily influence a field's research direction. The ultimate goal of open-domain dialog systems is to provide an enjoyable experience to users. Previous research mainly focuses on optimizing automatic dialog evaluation metrics such as BLEU, which models the distance between the system responses and a limited number of references available. However, it has been shown that these metrics correlate poorly with human judgments (Liu et al., 2016).
Open-domain dialog system evaluation has long been one of the most difficult challenges in the dialog community for several reasons: (1) The goal of 1 Equal Contribution. dialog evaluation should be to evaluate users' conversational experience. Existing automatic evaluation metrics such as BLEU are mostly constrained to a static corpus, and do not capture the user experience in a realistic interactive setting. (2) Currently, self-reported user ratings are widely used to evaluate open-domain dialogs. However, self-reported ratings suffer from bias and variance among different users (Liang et al., 2020e). Although we could tell which dialog system is better by running statistical tests on a large number of noisy ratings, it is challenging to locate dialogs with bad performance reliably. Only by identifying these bad dialogs effectively can we correct errors in these samples to improve dialog system quality.
User engagement has been recognized as one of the essential metrics for open-domain dialog evaluation . Previous research also confirms that incorporating user engagement as real-time feedback benefits dialog policy learning (Yu et al., 2016). One of the most costly bottlenecks of learning to detect user disengagement is to annotate many turn-level user engagement labels (Ghazarian et al., 2020). In addition, the data annotation process becomes more expensive and challenging for privacy-sensitive dialog corpora, due to the privacy concerns in crowdsourcing (Xia and McKernan, 2020).
To improve annotation efficiency, we reframe the training data annotation process as a denoising problem. Specifically, instead of manually labeling each training datum, we automatically label the training samples with a set of labeling heuristics. The heuristic functions primarily consist of regular expressions (Regexes) and incorporate open-sourced natural language understanding (NLU) services. Since the automatically generated labels might contain noise, we then denoise the labeled data using the Shapley algorithm (Jia et al., 2019a,b). We use the Shapley algorithm to quantify the contribution of each training datum, so that we can identify the noisy data points with negative contribution and then correct their labels. Our experiments show that HERALD achieves 86% accuracy in user disengagement detection in two dialog corpora.
Our proposed framework HERALD is conceptually simple and suitable for a wide range of application scenarios: First, since our model could detect user engagement in real-time (i.e., after each user utterance), our model could be plugged into existing dialog systems as a real-time user experience monitor module. In this way, dialog systems could detect and react to user's disengagement in both open-domain dialogs (Yu et al., 2016) and taskoriented dialogs (Yu et al., 2017). During training, our model could also be used as real-time feedback to benefit dialog policy learning (Yi et al., 2019). Second, HERALD could quantify user engagement and be used as an automatic dialog evaluation metric. It could locate dialogs with poor user experience reliably to improve dialog system quality (Ghazarian et al., 2020;Choi et al., 2019). Third, user engagement is an essential objective of dialog systems, but few dialog datasets with user engagement ratings are available. Our heuristic functions, combined with the proposed workflow, can be readily deployed to annotate new dialog datasets.

Open-Domain Dialog System Evaluation
Open-domain dialog system evaluation is a longlasting challenge. It has been shown that existing automatic dialog evaluation metrics correlate poorly with human judgments (Liu et al., 2016;Lowe et al., 2017;Novikova et al., 2017). A wellknown reason is that these automatic dialog evaluation metrics rely on modeling the distance between the generated response and a limited number of references available. The fundamental gap between the open-ended nature of the conversations and the limited references  is not addressed in methods that are lexical-level based (Papineni et al., 2002;Lin, 2004;Banerjee and Lavie, 2005), embedding based (Rus and Lintean, 2012;Forgues et al., 2014), perplexity based (Adiwardana et al., 2020), or learning based (Tao et al., 2018;Lowe et al., 2017). Mehri and Eskénazi (2020) simulate user response using DialogGPT and evaluate the probability of user complaint. Given the limitations above, self-reported user ratings are widely used to evaluate open-domain dialogs. However, self-reported ratings suffer from bias and variance among different users . Denoising human ratings is still an open research problem (Liang et al., 2020e;.

User Engagement in Dialogs
User engagement is commonly defined as the user's willingness to continue conversing with the dialog system (Yu et al., 2016(Yu et al., , 2017. Existing work on measuring user engagement primarily resorts to human rating (Yi et al., 2019;Hancock et al., 2019), or proxy metrics. Example proxy metrics include conversation length like number of dialog turns , and conversational breadth like topical diversity . Sporadic attempts have been made to detecting user disengagement in dialogs (Yu et al., 2004;Ghazarian et al., 2020;Choi et al., 2019). A major bottleneck of these methods is that they require hand-labeling many dialog samples for individual datasets. Although Liang et al. (2020e) denoise user self-reported ratings with the Shapley algorithm for dialog system evaluation, their method cannot be directly applied to dialogs without user ratings as in our setting. Our work is focusing on the problem that it is expensive and difficult to obtain user ratings. The core insight of our work is to reframe the training data annotation process as a process of denoising labels created by heuristic functions pre-defined. To the best of our knowledge, we are the first to combine automatic data labeling with the Shapley algorithm to perform dialog evaluation. Our method could potentially generalize to other classification tasks if different weak labelers are provided.

Learning from Weak Supervision
Learning from weak supervision reduces annotation costs by utilizing noisy but cost-efficient labels (Ratner et al., 2020(Ratner et al., , 2016Liang et al., 2020e). One of the most popular forms of weak supervision is distant supervision, in which the records of an external knowledge base are heuristically aligned with data points to produce noisy labels for relationship extraction tasks (Bunescu and Mooney, 2007;Mintz et al., 2009;Hancock et al., 2018). Other applications of weak supervision to scene graph prediction (Krishna et al., 2019), intent classification (Mallinar et al., 2019), and medical imag- Figure 1: Schematic of the HERALD two-stage workflow. Stage 1: Auto-label training data with Heuristic Functions. We first design heuristics rules for detecting user disengagement by investigating multiple dialog corpora. The heuristics rules are implemented as heuristic functions based on regular expressions and dialog acts. Then, we use the heuristic function to label the training set automatically. Stage 2: Denoise weakly-labeled training data with Shapley Algorithm. We calculate the Shapley value for each data point and correct the noisy data points with negative Shapely values by flipping their labels. Finally, we fine-tune the model on the denoised training data.
ing (Varma et al., 2017) have observed similar benefits in annotation efficiency. Unlike the existing work, we leverage weak supervision to improve annotation efficiency for detecting user disengagement in social conversations.

Problem Formulation
We defined engagement as the degree to which users are willing to continue conversing with the dialog system Yu et al. (2016Yu et al. ( , 2017. We focus on identifying the dialog turns with "disengaged" user response, since they usually indicate poor conversation experience. We formulate the user engagement prediction as a binary classification problem: Our goal is to learn a parameterized user engagement predictor M θ that, given a dialog turn (along with its dialog context) x ∈ X, predicts the turn-level user engagement label y ∈ Y = {0, 1}, where label y = 1 means "disengaged" and y = 0 means "engaged". We start from an unlabeled train set contains the ground-truth label y i . The development set D dev has a similar structure as the test set D test but the development set can be much smaller than a train set (i.e., N dev N train ), making it economical to obtain. Following the general architecture of neural classifiers, we formulate our model M θ = M(φ, f ) = f (φ(x)): Here BERT (Devlin et al., 2019)-based φ is a text encoder that maps each dialog turn x to a feature space φ(x) ∈ R d . f is the final linear layer with softmax activation.
Gunrock Movie Dataset Gunrock Movie dataset consists of dialog data collected from Gunrock, an ASR-based open-domain social chatbot originally designed for Amazon Alexa Prize (Liang et al., 2020a). The Gunrock dataset comes from a user study where in-lab users were recruited to carry on conversations. We have consent to use the data and we also removed any sensitive information in the conversation. Two dialog experts (co-authors of this paper) randomly annotated 134 dialogs and split them evenly into the test set and development set. In total, the experts labeled 519 turn-level disengaging user responses and 2,312 engaging user responses. They reached a high inter-annotator agreement score (Cohen, 1968) with kappa κ = 0.78. The training set contains 276 unlabeled dialogs, with 5644 dialog turns. In addition, we ensure that the data annotation is independent of the labeling heuristics collection, so there is no data leakage problem. A full example dialog can be found in Appendix A.4.

ConvAI2 Dataset
ConvAI2 dataset contains text-based dialog collected from the second Conver- sational Intelligence (ConvAI) Challenge (Dinan et al., 2019). We select dialogs from the main eight participated chatbots (Bot 1, 2, 3, 4, 6, 9, 11) and exclude dialogs that are one-sided or shorter than three turns. The dialog experts annotated 207 dialogs in total. The dialogs are evenly distributed over all the eight bots to ensure system diversity, and are randomly sampled within each bot. The annotated data consist of 209 disengaging turns and 1684 non-disengaging turns. They reached a high inter-annotator agreement score (Cohen, 1968) with kappa κ = 0.76. We split the annotated dialogs evenly into the test set and develop set. The training set contains 2,226 dialogs, with 18,306 dialog turns.
Google Meena Dataset Meena (Adiwardana et al., 2020) is the largest end-to-end neural chatbot so far, trained on 867M public domain social media conversations. We study the 93 example Human-Menna conversations released by Google.
Facebook Blender Dataset The Blender bot (Roller et al., 2020) is an open-domain chatbot with several conversational skills: providing engaging talking points and listening to their partners, displaying knowledge, empathy, and personality appropriately while maintaining a consistent persona. We study the 108 example Human-Blender conversations released by Facebook.

Method
Our goal is to train a user engagement detector with minimum data annotation efforts. Traditional supervised learning paradigms require annotating many training samples. In addition, it requires additional data annotation to extend the model to a new dialog corpus. To reduce annotation work, we propose HERALD, a two-stage pipeline that annotates large-scale training data efficiently and accurately ( Figure 1). Instead of hand-labeling training data points, we use heuristic functions to label each training datum automatically. The heuristic functions are built upon a set of user disengagement heuristics rules. Since the training data are automatically labeled, their labels would be noisy. We then clean the noisy training data with Shapley algorithm (Ghorbani and Zou, 2019) to improve the labeling accuracy. The Shapley algorithm denoises training data by identifying data with wrong labels and flip their labels. Finally, as we received clean training data, we use them to fine-tune a BERTbased model and obtain the final user disengagement detection model.

Stage 1: Auto-label Training Data with Heuristic Functions
Since labeling large-scale training data is timeconsuming, we propose heuristic labeling functions to label training data automatically. The heuristic functions focus on detecting disengagement from user responses, as it directly indicates poor user experience. To build the heuristics functions, we first summarize the heuristic rules shared among users. We investigate the disengaged dialog turns from the four datasets mentioned above and identify four groups of user disengagement patterns: "complain system responses", "dislike current topics", "terminate or change topics", and "end with non-positive responses" (Table 1). We then discuss the implementation of heuristics functions.

Disengagement Heuristic Rules
Group 1: Complain system responses. Complaints are an evident sign of user disengagement. We identify six related disengaged intents. The first three intents ("complain system repetition", "complain system ignoring them" and "complain system misunderstanding") usually appear when the bot makes errors like repeating the same content, ignoring, forgetting, and misunderstanding the user's response. In these cases, users express their disengagement by indicating the bot's error (e.g. "You already told me that", "You're not listening"). Another intent "not understanding system" happens when users cannot understand the system's response (e.g. "I don't know what you're talking about."). In the last two intents, users reveal negative emotions by cursing the system (e.g. "you're dumb") or express frustration (e.g. "sigh") about the conversation.
Group 2: Dislike current topics. When discussing a given topic, users might show their disengagement by expressing negative opinions or low interest. For example, given the bot's response, "I write romantic novels under a pen name. ", for users who are not interested in reading, users might say "reading is boring", "I don't like to read", or "I'm not interested in this". We also make sure to handle the corner cases where the user utterance should be labeled as engaged but contains negative opinions. For instance, to respond to the bot's question, "do you want to not work?", a user might say, "Yes. my job is boring. I have to work with mail". Though the user mentions a negative feeling ("boring"), the user agrees with the bot and shares further information.
Group 3: Terminate or change topics Group 3 considers the cases where users express disengagement to the current topic in a more straightforward fashion. For example, if users are not interested in the current topic, instead of just expressing their dislike to it, they may request to switch topics with "Let's talk about something else". In some cases, users might show strong disengagement by requesting to end the conversation if the user is no longer interested in continuing the conversation.
Group 4: End with non-positive responses A more subtle but common clue of disengagement is when users end the response with non-positive content. For example, non-positive responses like "I don't know", "No", "Yeah", "uh", "Probably", imply that users do not have much to talk about the current topic. To keep the precision of our heuristics high, we carefully consider the counterexamples. One case is that the user follows up with more responses such as questions (e.g., Bot: "Have you seen any movies lately? ", User: "No. Have you?"), and opinion (e.g. Bot: "What's your favorite animation movie?", User: "I don't know, but it might actually be frozen two. My sister loves it.") in the same dialog turn. These turns should not be labeled as disengaged since the user is still interested in sharing more content or asking followup questions. Therefore, we take a conservative approach: we label the dialog turn as disengaged only if no more responses follow the non-positive response.

Heuristic Functions Implementation
Next, we discuss how to use heuristic functions to auto-label disengaged user utterances. First, we split user responses into segments since user responses may consist of multiple units with different semantic meanings. We use NLTK Sentence Tokenizer for text-based system, and a segmentation model (Chen et al., 2018) for ASR (Automatic Speech Recognition)-based system as the segmentation tool. We then apply the heuristic functions on each segment to detect disengaged intents. For heuristic groups 1 to 3, if any segment contains a disengaged intent, the user response is auto-labeled as disengaged. For heuristic group 4 ("End with non-positive responses"), we assign disengaged labels only if the disengaged intents are detected in the last segment. We detect disengaged intents with Regexes. The benefit of using Regexes is that they have minimum dependencies and are easy to modify. We design Regexes for each intent. Following common Regexes complexity metrics (Luo et al., 2018), our Regexes for each intent contains 43.9 Regexes groups and 87.7 or clauses on average.
Our framework also supports incorporating additional resources to improve the intent detection accuracy for automatic training data labeling. For example, we can enhance the recall of Regexes intent detection by incorporating existing deep learning-based NLU (Natural Language Understanding) models. Specifically, we re-purpose an open-sourced dialog act classification model (Yu and Yu, 2021) to enhance disengagement intent detection: we select 6 out of the 23 supported dialog act labels that are associated with disen-gaged intents, and map each selected dialog act label to the heuristic groups. The dialog act "complaint" is mapped to the heuristic group "complain system repetition";"closing" is mapped to the disengaged intent "request termination"; "hold" to "hesitation";"other_answers" to "unsure answer"; "back-channeling" to "back-channeling", and "neg_answer" to 'negative answer'". If a user utterance is detected with disengaged intent by either Regexes or the deep learning model, then the utterance is auto-labeled as disengaged.

Stage 2: Denoise with Shapley Algorithm
& Fine-tune Overview Next, we denoise the labeled data using Shapley algorithm (Ghorbani and Zou, 2019). Shapley algorithm has been studied in the cooperative game theory (Dubey, 1975) and economics (Gul, 1989) as a fair distribution method. Shapley algorithm computes a Shapley value for each training datum, which quantifies the contribution of each training datum to the prediction and performance of a deep network. Low Shapley value data capture outliers and corruptions. Therefore, we can identify and denoise the incorrectly labeled data by computing their Shapley values and finetune the model on the cleaned training set.
Shapley Algorithm Shapley algorithm comes originally from cooperative game theory (Dubey, 1975). Consider a cooperative game with n players D = {1, ..., n} and a utility function v : 2 [n] → R which assigns a reward to each of 2 n subsets of players: v(S ) is the reward if the players in subset S ⊆ D cooperate. Shapley value defines a unique scheme to distribute the total gains generated by the coalition of all players v(D) with a set of appealing mathematical properties. In our setting, we can consider D train = {(x i , y i )} N train 1 as N train players. We define the utility function v(S ) as the performance on the development set D dev . The Shapley value for player i is defined as the average marginal contribution of {(x i , y i )} to all possible subsets that are formed by other players (Jia et al., 2019a,b): As suggested by the definition of Shapley value, computing Shapley value requires an exponentially large number of computations to enumerate O(2 N train ) possible subsets and train the model M θ on each subset, which is intractable. Inspired by (Jia et al., 2019a,b), HERALD tackles this issue by reducing the deep model M θ to a Knearest neighbors (KNN) model and then apply the closed-form solution of Shapley value on KNN: We reduce our BERT-based classification model M θ = M(φ, f ) = f (φ(x)) to a KNN by first finetuning M θ on the auto-labeled training samples. We then use the feature extractor φ to map each training datum to the feature space {φ(x i )} N train 1 . We construct a KNN classifier in the feature space to compute the closed-form Shapley value.
Next, we discuss the closed-form solution of Shapley value. We first consider a special case where the development set D dev only contains one datum D dev = {(x dev , y dev )}. Given any nonempty subset S ⊆ D train , we use the KNN classifier to classify x dev . To do this, we sort the data points in the training set {x i } N train 1 based on their euclidean distance in the feature space φ(x) to the datum in the development set x dev , yielding (x α 1 , x α 2 , ..., x α |S | ) with x α 1 , ..., x α K as the top-K most similar data points to x dev . The KNN classifier outputs the probability of x dev taking the label y dev as P[x dev → y dev ] = 1 K K k=1 1[y α k = y dev ], where α k is the index of the kth nearest neighbor. We define the utility function as the likelihood of the correct label: (1) Jia et al. (2019a,b) proves that the Shapley value of each training point s α i can be calculated recursively in O(N log N) time as follows: The above result for a single point in D dev could be readily extended to the multiple-point case, in which the utility function is defined by where α   on the auto-labeled training samples. This step injects the knowledge in the labeling heuristic into the model M θ .
(2) We then map each auto-labeled training datum to the feature space {φ(x i )} N train 1 , since we want to apply the closed-form KNN formula of Shapley value in the feature space. (3) Next, for a binary classification problem, we duplicate each training datum 2 times with labels [0, 1]. This generates a large training set D large with 2 × N train data points, and we note that the origin training set D train is a subset of D large , since D large enumerates all C possible labels for each each training datum. (4) We then calculate Shapley value for the 2 × N train data points in D large using the closed-form KNN formula. (5) We remove the data with negative Shapley value in D large , and get a cleaned training set D clean . The duplicate-and-remove procedure "flips" the labels of the noisy data points with low Shapley value. (6) Finally, we fine-tune the classification model M θ on D clean to get the final user disengagement detection model.
To sum up, the Shapley value quantifies the contribution of each training datum. Low Shapley value data capture outliers and corruptions that are not consistent with the distribution of other data points. We identify and correct these outliers and corruptions to provide a clean training set.

Experiments
Model Setup We use K = 10 for the KNN Classifier. We use BERT (Devlin et al., 2019) as the text encoder φ of our classification model M θ = M(φ, f ) = f (φ(x)). Additional implementa-tion details are included in Appendix.

Model Comparisons and Ablations
We compare HERALD to its several ablations (Table 2) and evaluate the performance on the test set. We report balanced accuracy (bACC) and F β Score with β = 2 (Baeza- Yates et al., 1999). (1) Heuristics uses the labeling heuristic function with both Regex and dialog act to predict the test set. (2) Heuristics (Regex only) uses the labeling heuristic function only with Regex to predict on the test set. Results Our first takeaway is that our labeling heuristics produce decent predictions and generalize to different datasets. As shown in Table 2, Heuristics prediction (Heuristic, 78.32%, 76.58%) is better than the BERT-based model with limited training samples (BERT(dev), 73.98%, 74.94%) on both datasets. It also shows that our labeling heuristics are generalizable to different corpora.
Our second takeaway is that learning from a large number of noisy labels works better than learning from a limited number of clean labels. As shown in Table 2, BERT fine-tuned on the autolabeled training set (BERT(Auto), 80.55, 78.76) outperforms BERT fine-tuned on clean but small development set (BERT(dev), 73.98, 74.94) by a large margin. In addition, we also observe that the BERT model fine-tuned on the auto labeled training data (BERT(Auto), 80.55%, 78.76%) generalizes beyond the labeling heuristics (Heuristics, 78.32%, 76.58%).
Our third takeaway is that using the expertannotated development set for denoising is more efficient than using the development set as additional training data. After fine-tuning BERT on the weakly labeled training data (BERT(Auto), 80.55%, 78.76%), having an additional fine-tuning step using the development set slightly improves the model's performance (BERT(Auto+dev), 80.73%, 80.46%). In contrast, using the development set for the Shapley denoising algorithm gives a significant performance gain (HERALD, 86.17%, 86.22%). Annotation Cost The cost of annotating the DEV set is small for the Shapley algorithm. For Gunrock Movie Dataset, we used 67 annotated dialogs as the DEV set. For ConvAI2, we used 52 annotated dialogs as the DEV set. The annotation takes less than 1 hour in both cases, which is negligible compared to the cost of annotating all training data.
Heuristics Group Analysis We perform ablation studies to analyze the importance of each of the four heuristics groups in Table 1. As shown in Table 2, excluding heuristics group 4 leads to the most significant performance drop in both datasets (Heuristics w/o Group 4, 58.34%, 68.32%), indicating that "end with non-positive response" is the most prevalent form of user disengagement.
In addition, each heuristics group has different importance in different datasets. For example, dropping heuristics group 1 ("complain system responses") only leads to a marginal performance drop on the Gunrock Movie dataset but incurs a significant performance drop on the ConvAI2 dataset. We also notice that heuristic group 4 ("End with non-positive responses") plays a more critical role in the Gunrock Movie dataset than in the ConvAI2 dataset. This might be mainly due to the difference between ASR-based (Gunrock Movie) and text-based (ConvAI2) systems. When asked an open-ended question in ASR-based systems, since users have less time to think, they are more likely to reply with responses such as "I'm not sure", "let me think". While in text-based systems (Con-vAI2), users have more time to think and formulate their responses. Hence, heuristics group 4 covering these responses happen more in Gunrock Movie than ConvAI2.

Generalizability of Heuristic Functions
The results show that our heuristic functions are generalized to both ASR-based and text-based systems. As indicated in Table 2, our Regexes reach a decent accuracy of 62.81% and 72.04% on the expert annotated test set respectively on Gunrock Movie and ConvAI2 dataset, and thus can serve as a relatively reliable source for auto-labeling. In addition, although the dialog act model (MIDAS) is initially designed for ASR-based systems and thus has a better performance on the Gunrock Movie data, it should be generalizable to other ASR-based systems, as the six selected dialog acts are general and independent of topics. Therefore, the combination of dialog acts and Regexes should be sufficient to be applied to various corpora. Figure 3: An example dialog turn from the Gunrock Movie dataset with an incorrect auto label "nondisengaged" identified by data Shapley. In this case, the user actually says "I don't wanna talk about movies anymore," but an ASR error happens, and thus the labeling heuristics fail to capture this dialog turn. Figure 4: An example dialog turn from Gunrock Movie dataset that is incorrectly auto-labeled as "disengaged" because the labeling heuristics see the negative word "disagree". This data point is also identified and corrected by data Shapley.

Shapley Value Analysis
We also present an analysis to show how Shapley denoising works, as shown in Figure 2. We examine the Shapley value for each training datum in Stage 2. We first show two example dialog turns from the Gunrock Movie dataset with a negative Shapley value in Figure 3 and Figure 4. In Figure 3, the dialog turn is incorrectly auto-labeled as "non-disengaged". This is because an ASR error happens, and the user utterance "I don't wanna talk about movies anymore" is transcribed as "I wanna talk about movies anymore". In Figure 4, the user says, "Oh I disagree. I think the movie was fantastic!". The labeling heuristics see the negative word "disagree" and auto-label this turn as "disengaged". Both data points are with negative Shapley values and are corrected in Stage 3.
Next, we present a quantitative analysis of Shapley value. According to the Shapley value, we remove data points one by one, starting from the least valuable (low Shapley values) to the most valuable (high Shapley values). Each time, after removing the data point, we create new KNN classifier models on the remaining dialog turns and labels and evaluate them on the test set with expert annotations. As shown in Figure 2, removing training data with low Shapley values increases the performance to a certain point before convergence for K of all choices. We observe a similar trend when re-training a model on the remaining data. In contrast, removing data randomly or removing data starting from high Shapley values decreases the performance on the test set ("Random" and "Retain-Hurtful" in Figure 2). This shows that low Shapley value data effectively capture outliers and corruptions, which further justifies our design choice of denoising with Shapley value.

Alternative Data Valuation Methods
We also explored alternative methods to data Shapley like influence function (Koh and Liang, 2017) and TracIn (Pruthi et al., 2020): on Gunrock Movie, Influence Functions and TracIn achieve 82.96% and 83.15% accuracy, respectively. Both methods outperform BERT(Auto+dev) (80.73%) significantly but perform slightly worse than HERALD (86.17%). Overall, results show that our data annotation workflow also works well with other data valuation methods.  Figure 5 shows an error example of HERALD, where both the labeling heuristics and the Shapley algorithm fail to identify this turn as low engagement. In this example, the chatbot system asks whether the user is interested in movies, but the user does not directly answer the question. Instead, the user says "I have a question for you social bot", indicating that the user does not like the current topic and wants to talk about something else. HERALD fails to identify this dialog turn as low engagement, partly because the Regexes in the "request topic change" heuristic rule does not cover this example. One way to fix this error is to upgrade the Regexes. A more general solution is to consider the chatbot system's expectations on user responses conditioned on the chatbot's question. If the chatbot receives an "unexpected" user response, then the user is probably not interested in discussing the current topic.

Conclusion
The ultimate chatbot evaluation metric should be user-centric, as chatbots are there to provide humans with enjoyable experiences. Previously detecting user disengagement typically requires annotating many dialog samples for each individual dataset. We propose a two-stage pipeline HER-ALD to automatically label and denoise training data and, at the same time, build a user disengagement detector. Our experiment shows that HER-ALD significantly reduces the annotation cost of a new corpus. HERALD's disengagement detection results highly correlate with expert judgments on user disengagement in both datasets (86.17% bACC in Gunrock Movie, 86.22% in ConvAI2).

Additional Shapley Value Analysis
We also present addition analysis to show how Shapley denoising works as shown in Figure 6. We present the experiments on both Gunrock Movie Dataset and ConvAI2 Dataset. Figure 6 presents a quantitative analysis of Shapley value. According to the Shapley value, we remove data points one by one starting from the least valuable to the most valuable. Each time, after the data point is removed, we create new KNN classifier models on the remaining dialog turns and labels and evaluate them on the test set with expert annotations. As shown in Figure 6, removing training data with low Shapley values increases the performance to a certain point before convergence for K of all choices. We observe a similar trend when re-training a model on the remaining data. In contrast, removing data randomly or removing data from the most to least valuable data decreases the performance on the test set. This shows that low Shapley value data effectively capture outliers and corruptions, which further justifies our design choice of denoising with Shapely value.

A.3 Addition Dialog Examples
We show additional dialog examples. Figure 7 shows a full dialog example from ConvAI dataset. Figure 8 shows a full dialog example from Gunrock Movie dataset.