EnDex: Evaluation of Dialogue Engagingness at Scale

We propose EnDex, the first human-reaction based model to evaluate dialogue engagingness. EnDex is trained on 80k Reddit-based Engagement Dataset (RED) curated using a novel distant-supervision framework. Engagingness is a key measure that captures high-level quality of AI dialogue systems and closely reflects actual user experience. However, data shortage, plus the abstract and extensive definition of engagingness makes it challenging to develop an automatic metric. Our work departs from mainstream approaches that use synthetic negative examples to train binary classifiers, and instead, proposes a solution using distant-supervision from human-reaction feedback. To support the soundness of our EnDex metric, we offer a theoretical foundation for engagement, an extensive ablation study, and empirical evidence of high correlation on five engagingness related datasets. We will release code, off-the-shelf EnDex model, and a large-scale dataset upon paper publication to facilitate future research.


Introduction
Many modern generative language models are trained to maximize a likelihood objective, but this paradigm tends to assign high probability to generic responses (Li et al., 2016), such as "I don't know.".Prior research has established that people prefer to converse with interesting, creative, and informative agents (See et al., 2019), all concepts broadly related to the notion of engagingness.Furthermore, engagingness is recognized as a key evaluation metric for the quality of dialogue systems (Zhang et al., 2018;Ghazarian et al., 2020).For example, FAIR's ParlAI (Miller et al., 2017) incorporated Engagingness as the default testing metric in the Blenderbot system (Roller et al., 2021); dialogue data challenges, like ConvAI2 (Dinan et al., 2019), Amazon Alexa Prize2 , and ensemble metrics like FED (Mehri and Eskenazi, 2020), all measure engagingness to benchmark dialogue quality.
However, the current evaluation of engagingness still primarily relies on expensive human annota-Figure 1: Example of an online post with scores for emotional engagement (EE), attentional engagement (AE), and behavioral engagement (BE) in blue to represent the 3 dimensions of human engagement; reply engagement (RE) in red; and the aggregated ENDEX score in green.We apply z-score to EnDex Score and pick a hyper-parameter threshold to cluster posts into positive and negative samples.
tion rather than off-the-shelf automatic tools, due to several theoretical and technical challenges: firstly, unlike more well-characterized properties such as fluency, the definition of engagingness is significantly more abstract and multi-dimensional (See et al., 2019), requiring well-tuned quality metrics for each sub-dimension to aggregate a final score.Secondly, what qualifies as engaging is openended and many different answers may embody the concept (Ghazarian et al., 2020).Therefore, reference-based metrics requiring unique ground truth, such as BLEURT (Sellam et al., 2020) and BERTScore (Zhang et al., 2020), cannot apply.Thirdly, there's an acute shortage of large-scale, high-quality data annotated for engagingness.Ghazarian et al. (2020) jump-started efforts to automatically measure dialogue engagement, where they fine-tuned a BERT-based model (Devlin et al., 2019) on the ConvAI2 and DialyDialog datasets (Li et al., 2017) to predict an engagingness score.However, finetuning on small size supervised data could easily lead to overfitting and generalization problems.Another high performing metric on engagingness USL-H (Phy et al., 2020) assumes a positive set and generates synthetic negative samples to train model.However, credible positive samples are not always available, and synthetic negative samples may not be challenging enough to further advance classifier performance.
In light of the above challenges, we propose ENDEX, a novel metric trained with distantly supervised data to predict turn-level dialogue engagingness (Figure 1).ENDEX requires neither human annotations nor direct disentanglement of engagingness.Instead, we leverage observed user reactions to posts as distant signals to model engagingness, which marks a departure from mainstream approach to train on synthetic negative samples (Lan et al., 2020;Ghazarian et al., 2022;Tao et al., 2018;Sato et al., 2020).ENDEX trains on real conversations sourced from Reddit, that are automatically annotated as positive and negative examples with our framework.The novoel dataset is named RED (Reddit Engagement Dataset) with over 80k labelled samples.ENDEX framework derives its theoretical underpinning from relevant HCI works, and has shown superior performance on five benchmark datasets.

EnDex Metric
Engagingness is not only a linguistic concept useful for dialogue systems, but also manifests itself in multi-modalities and is extensively leveraged to benchmark gaming and online learning experiences (Silpasuwanchai et al., 2016;Chen et al., 2005;Mcmahan, 2003;Schoenau-Fog, 2011).Our work is inspired by HCI study of Human Engagement (Ma, 2018), which decomposes engagingness into three major dimensions including attentional engagement (e.g., clicks and scrolls), behavior engagement (e.g., facial expressions), and emotional engagement (e.g., heart rate).
ENDEX metric follows the same intuition: we can infer engagingness of a text by analyzing human reactions to it, for which there is abundant data in social media.ENDEX metric learns from our distant-supervised RED dataset, which measures dialogue engagement along four dimensions as shown in Figure 1; three-dimensions correspond to the original Human Engagement definition, and one distinct Reply Engagement dimension for the dialogue specific task.

Reddit Engagement Dataset (RED)
We curate the Reddit Engagement Dataset (RED), a distant-supervision set, with 80k single-turn conversations.We source RED from Reddit, sampling from 43 popular subreddits, and processed a total of 5 million posts, filtering out data that was either non-conversational, toxic, or posts not possible to ascertain popularity; the resulting data distribution of RED is shown in Table 1.The following sections will explain the procedure to automatically annotate ENDEX scores and cluster samples into positive and negative sets.
We also curated a RED testset with 150 human annotated samples obtained from a different split from RED.The inter-annotator agreement is 0.34 Fleiss-Kappa, indicating fair agreement, which reflects the challenge of determining engagingness.

Distantly-Supervised Engagingness Scores
We use distant-supervision to provide samples in RED an ENDEX Score, which is the aggregate of 4 engaging dimensions.Section 2.2 discusses the intuition for each engagingness dimension; section 2.3 explains how to adjust raw score by thread popularity; section 2.4 lays out the formula to normalize and aggregate sub-dimensions into the overall engagingness score; section 2.5 explains sampling with z-score to convert the task into binary classification.
• Emotional Engagement (EE): Emotional connection is a key sign of human engagement (Savin-Baden et al., 2014); and we model EE using a multi-class emotional classifier (Demszky et al., 2020) on post replies.If post receives positive and emotional replies, it's engaging; negative or neutral replies indicates non-engaging.
• Attentional Engagement (AE): More user time spent indicates higher engagement (Attfield et al., 2011).We model AE of a post by examining whether it has editted replies, and the information specificity in the replies.• Behavioral Engagement (BE): Human behavioral features closely correlate with their engagement state (Attfield et al., 2011), and we model BE by examining Reddit post scores, adjusted by popularity.• Reply Engagement (RE): Following definition from (Ghazarian et al., 2020), if a certain post is very likely to be continued by following threads, it is considered engaging; reply_counts are also popularity adjusted.

Adjustment for Popularity
Raw score for Behavior Engagement(upvotes) and Reply Engagement(reply counts) are heavily influenced by the popularity of the thread in which the post appears.A non-engaging post may receive high user interaction because it simply receives a lot of exposure; on the flip side, a very engaging most may receive zero user interaction simply because it is rarely seen.To mitigate the imbalanced exposure problem, we calculate a popularity value for each thread, and adjust posts scores by the popularity value of the thread it resides.
Popularity Value(PV) The PV of a post is given by the amount of exposure its parent post attracts.Let the target post be θ and its parent σ, R eply obtains the reply counts of a post, and U pvote obtains the upvotes of a post.The PV is defined in equation (1), where coefficient 2 is adopted to give equal weight for reply and upvotes; popularity value adjusted RE score is given by PVRE in equation ( 6), where M pv and M re are the median of popularity value and reply counts in the entire dataset.Only popularity adjusted scores are used for calculating ENDEX score.

Monotone Submodular Normalization
The final ENDEX score is essentially a weighted sum of the 4 respective sub-dimension scores; an importance nuance is the usage of submodular normalization (shown in Eq. 8) for 3 dimensions to bring raw scores to the scale of 0-1.We observe that unit increase in raw score lead to diminishing positive effect on engagingness.For example, a sentence with 100 replies should be more engaging than one with 1 replies, but not 99 times more; thus, we normalize engagingness score with a monotone submodular function f (x) = x x+α .
N is the normalized score for sample x, x i is x's raw score on dimension i, where i ∈ {RE, BE, AE}; α i is the median of i-th dimension; w i is the weight for i dimension; w EE is the weight for EE dimension.The weight can be tuned for your own usage of RED;3 .

Clustering with z-score
Essentially, engagingness prediction is a classification task, and we want to prepare dataset for binary classification.We use z-score on the ENDEX Score to easily sample and cluster the data according to standard deviation from mean.A confidence threshold κ(ours is 1) needs to be picked, which means that we regard samples that fall between κ standard deviation from mean as uncertain, and are thus discarded.And we cluster positive and negative samples using the following equation ( 9).
The EnDex metric is then trained as a binary classification task by finetuning a RoBERTa-large model (Liu et al., 2019) on turn-level RED data.

Experiment Set-up
We test the performance of the ENDEX metric on 5 golden evaluation sets that have turn-level labels.Among them, BETTER (Ghazarian et al., 2019), PREDEENG-600 (Ghazarian et al., 2020) are annotated specifically for engagingness with high annotator agreement.BETTER samples are taken from human conversation, while half of PREDENG-600 are chatbot generations.FED (Mehri and Eskenazi, 2020) annotates dialogue for 9 different dimensions, and we use their engagingness scores as target labels.GRADE (Huang et al., 2020) contains quality annotations for dialogue coherence, and we include it to test whether our model can also have good zero-shot performance on related tasks.Lastly, our own RED-TEST is sourced from Reddit, contains discussions on various topics.A table of evaluation set statistics is provided in the Appendix3.

Ablation Study on Engaging Dimensions
To test the robustness of the 4 engagingness dimensions 2.2 of ENDEX, we conducted ablation study to train model using only signals from each of the 4 dimensions.We hypothesize that dimensions with high positive contribution towards final results should have very successful clustering of engaging and non-engaging samples by itself; so, if we train model on data clustered by such dimension, we can still get good performance models.
We train five different models on different subsets of RED.All datasets included the same 40k negative (i.e.non-engaging) samples drawn according to our overall engagement score.However, the other 40k positive (i.e.engaging) samples were selected according to a particular dimension score (e.g.EE, AE, BE, and RE), except for ENDEX, which is our aggregate score model.Figure 2 shows that all four dimensions correlate with engagingness to some degree, but RE, AE, and EE are especially effective.We also observe a synergistic effect of training on a composite score rather than any one dimension individually.The experiment highlights and corroborates the multi-dimensionality of engagingness previously reported in the literature (See et al., 2019).Overall, having an aggregate score is crucial for successful distant-supervised annotation of negative and positive examples.

Comparison with Related Works
We compare our ENDEX metric, and heuristicsaugmented ENDEX+NS metric with five baselines.Three baselines are rule-based, including Random, information Specificity (See et al., 2019) that counts number of non-stopword tokens, and Inquisitiveness (Ghandeharioun et al., 2019) that examines question asking ability.We included them because in some dataset, rule-based system could work surprisingly well (Yeh et al., 2021).
We selected USL-H (Phy et al., 2020) as a baseline because it is the top performing metric on the PREDENG-600 and FED engagingness evaluation sets Yeh et al. (2021).USL-H is designed to measure high-level dialog quality, including understandability, sensibleness, and likability; it trains 3 BERT-based (Devlin et al., 2019) classifiers for each component, and uses a composite score named USL-H for overall assessment.PRED_EN (Ghazarian et al., 2020) uses BERT embedding plus MLP layer and train on ConvAI dataset (Dinan et al., 2019) to make engagement score predictions.PRED_EN (FT+DD) further finetunes the original PRED_EN metric on the DailyDialogue dataset, to get better results.
Our model has two versions: ENDEX is solely trained on human-reaction based data.+NS means non-engaging samples set is mixed with some rulebased negative samples, created by random insertion, random deletion, copying, and generic replies; The experiments in Table 2 demonstrate that our model achieves strong performance on 4 engagingness related datasets, and good correlation with one coherence dataset(GRADE).ENDEX surpasses PRED_EN and USL-H by a large margin on two real human conversations, BETTER and RED-TEST.USL-H still leads in PREDENG-600, and ENDEX+NS's best model is a close second.Yeh et al. (2021) shows achieving high score on FED-ENG is challenging, with no one surpassing 0.3 spearman in 12 tested metrics.A strong rule-based question detection algorithm surprisingly claims the highest result, and ENDEX a close second.
We find that training solely with human reaction distant supervision signals suffices for building competitive models on par or even surpassing mainstream metrics, and it shows better generalization capability in new domains, which seems to echo recent success on modeling human preferences via upvotes in Reddits (Gao et al., 2020).

Conclusion
This paper proposes the first human reaction based model, ENDEX, to evaluate dialogue engagingness, and curates an 80k Reddit Engagement Dataset (RED) using a novel distant-supervision framework.The success of ENDEX demonstrates the validity of training automatic metrics with human reaction signals, offering a strong complement to a synthetic negative sampling approach.We also release an offthe-shelf ENDEX model, and a large scale dataset to facilitate future research.

Limitation
One limitation is that we only curated data for turnlevel dialogue.Multi-turn dialogues could also be useful, but it was computationally infeasible to interactively query Reddit for entire threads of conversation.Future work can explore this direction to produce dialogue-level and system-level engagingness metrics.
We also haven't fully explore our model's performance on non-dialogue domains, such as on story or creative generations.The training data distribution from the Reddit corpus is diverse enough that it could potentially achieve good performance in non-dialogue settings.A valuable direction of future work is to adapt our method for more general engagingness, or another evaluation metric for open-domain generation.

Ethics
A caveat of using framing our approach around human attention is that not all texts attracting high attention are good and ethical.Since being engaging often carries a positive connotation, we made a deliberate design decision to mitigate forms of negative engagement in our metric.For example, we assign lower scores to samples flagged by Reddit as controversial, and our behavioral engagement dimension subtracts downvotes from upvotes to punish negative, biased (Liu et al., 2021), and aggressive comments.Moreover, we implemented our emotional engagement algorithm to reward posts with positive emotional replies and punish posts that prompt negative emotions.Future may try to account for the darker aspects of engagingness into our framework and improve the ENDEX metric to differentiate between positive and negative engagement.
Human annotations for RED-TEST were obtained via Amazon Mechanical Turks.We filtered out toxic samples to reduce the likelihood of offensive content and paid $0.30USD per instance for an expected hourly wage of $20 USD.engaging according to our metric.We apply a cutoff to sample high probability engaging and nonengaging samples, and arrive at the RED dataset.

A.2 Model training details
Our RED-TEST set contains 300 human labeled data.The train validation split during training is 0.8 and 0.2.
We used 4 Nvidia A6000 GPUs for training, and 1 Nvidia A6000 GPU for inference.The average runtime for training one model is 2 minutes per epoch, and inference time is in seconds, negligible for the testset.The estimated energy cost per model is, assuming per second gpu energy cost of 245W: 245W*4*60 = 58800 per model.
We trained our model for 2 epochs, and only save the best checkpoints, with learning rate of 5e-5 with no extensive hyperparameter search.
We used specificity and question examination inspired from (See et al., 2019); USL-H (Phy et al., 2020) and PREDEN is taken from a GitHub repo5 and modified to use a local bert-base-uncased since the original 'bertas-service' code no longer functions.
The Formula for calculating each dimension is given in the following: • Attentional Engagement; It is calculated using maximum information specificity, or the the maximum number of non-stopword tokens in a post's replies, and whether its children posts are editted; t is the maximum reply specificity, and e stands for number of edited replies.
AE(x) = t + 10 * e (7) • Emotional Engagement; The EE score is the aggregate probability for all positive emotion categories, produced by the go-emotion classifier (Demszky et al., 2020).

A.3 Submodular Normalization and z-score Clustering
After we obtained the sub-dimension scores, we want to aggregate them into a single normalized ENDEX Score, and lastly cluster them into positive and negative sets to train a binary classifier.The formulas are list in the following: N is the normalized score for sample x, x i is x's raw score on dimension i, where i ∈ {RE, BE, AE}; α i is the median of i-th dimension; w i is the weight for i dimension; w EE is the weight for EE dimension.The weight can be tuned for your own usage of RED; A confidence threshold κ(ours is 1) needs to be picked, which means that we regard samples that fall between κ standard deviation from mean as uncertain, and are thus discarded.And we cluster positive and negative samples using the following equation ( 9).P olarity(x) = 1 if z_score(x) > κ 0 if z_score(x) < −κ (9)

A.4 Annotation data and test data
We performed annotation on Amazon Mechanical Turk, and selected annotators based in the United State; in implemented restrictions to annotator with 98% pass rate.We give four examples and clear  instruction for the task carried out.A screenshot of our annotation interface is provided below.Table 3 gives summary of the evaluation datasets we used.

Figure 2 :
Figure 2: Ablation study of our four engagement dimensions.The ENDEX model was trained on our aggregate engagingness score, while RE, EE, AE, and BE indicate models trains only on scores reflecting that particular dimension.

Figure 3 :
Figure 3: The screenshot of the task description of our Amazon MTurk questionnaire.We have prepared instructions, demonstrations, and proper warning of offensive content.

Figure 4 :
Figure 4: The screenshot of the labeling area of our Amazon MTurk questionnaire.Each pair will be labelled by three annotators.

Table 2 :
The correlation between engagement scores and ground truth human judgment.Best scores are emboldened and second-best are underlined.We train ENDEX and ENDEX+NS 10 times and report the mean with * and ** indicating a stdev < 0.05 and < 0.03, respectively.ENDEX-BEST is the best score observed over the 10 runs.Compared to existing metrics, the ENDEX-framework achieves SOTA correlation with human judgement on engagingness, leading by far on our newly proposed RED-TEST dataset with more complex and longer texts than chitchats.