Proxy Indicators for the Quality of Open-domain Dialogues

The automatic evaluation of open-domain dialogues remains a largely unsolved challenge. Despite the abundance of work done in the field, human judges have to evaluate dialogues’ quality. As a consequence, performing such evaluations at scale is usually expensive. This work investigates using a deep-learning model trained on the General Language Understanding Evaluation (GLUE) benchmark to serve as a quality indication of open-domain dialogues. The aim is to use the various GLUE tasks as different perspectives on judging the quality of conversation, thus reducing the need for additional training data or responses that serve as quality references. Due to this nature, the method can infer various quality metrics and can derive a component-based overall score. We achieve statistically significant correlation coefficients of up to 0.7.


Introduction
Recently, dialogue systems powered by machine learning have gathered much attention from industry and academia alike (Chen et al., 2017). These systems have various applications, such as personal speech assistants, customer service, technical support, and training and education. In most cases, these systems are task-specific and help with tasks like booking a restaurant. Nevertheless, they can still benefit from open-domain conversational skills, such as the ability to chit-chat to enable natural dialogues, rather than repeating the input utterance like a parrot.
Nowadays, people working in this field have to use human annotators to evaluate the quality of a conversation (Dinan et al., 2019;Logacheva et al., 2018;Yoshino et al., 2019), which can be very costly in terms of resources. Thus, these systems' research and development could benefit significantly from an automated approach to evaluate conversations.
Research in the related fields of text summarization and machine translation has developed automated measures for evaluation. Some notable examples are, for the former, ROUGE (Lin, 2004) and, for the latter, BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005). These are also adopted by works researching dialogue systems (Ritter et al., 2011;Serban et al., 2016;Yoshino et al., 2019). However, Liu et al. (2016) demonstrated that these metrics are not suitable for replacing human evaluators. Also, Sai et al. (2020) reported that even using multiple instead of single references and an overlap-based approach still underperforms. Thus, more advanced techniques are needed that consider the context and semantics of a dialogue.
Human annotators can distinguish bad from good quality dialogues from an intuitive perspective, not because they necessarily have been taught to do so. Instead, people have a notion of a fluent text or when a response is relevant (or not) to a previous utterance. Thus, this work's primary goal is to investigate a similar approach that mimics component-wise human intuition 1 .
We investigate whether natural language processing (NLP) tasks can serve as proxy indicators for a conversation's quality. For that purpose, we use a fine-tuned BERT (Devlin et al., 2019) model trained on the GLUE benchmark (Wang et al., 2019). GLUE provides a comprehensive evaluation of general language understanding. We demonstrate that a few of the tasks exhibit a limited potential of serving as proxy indicators. The rest shows negative results. Lowe et al. (2017) propose a work that approximates human judgment using scored dialogues together with the context, reference response, and utterance generated by a dialogue system. However, the approach is hard to scale since reference responses and human annotation scores are still necessary. In another work, Tao et al. (2018) propose a method consisting of two parts. The first measures similarity to a reference response using a word embedding vector pooling. The second is a neural network that evaluates the relatedness of a reply given the context. The first component also uses reference responses, which are hard to acquire. Moreover, both approaches lack the interpretability of the scores they output regarding different dialogue quality features, such as coherency or fluency.

Related Work
More recently, Ghandeharioun et al. (2019) propose a framework that uses self-play and two NLP tasks as an additional source of knowledge to evaluate dialogues in a multi-turn mode scenario. They perform an ablation study using sentiment and natural language inference as proxy supervision to see whether their system can better approximate human judgment. Their work shows that dialogue systems can benefit from using them. Also, Welleck et al. (2019) frame the dialogue consistency issue as a natural language inference problem and propose the DialogueNLI dataset. Its purpose is to benchmark a model's ability to select relevant utterances relative to a given context. Finally, Nedelchev et al. (2020b) offered to treat dialogue evaluation as an anomaly detection problem. Their results were negative and suggested that the approach suffers from insufficient training data.
Until very lately, Nedelchev et al. (2020a), Sai et al (2020), and Mehri et al. (2020 propose the usage of language models as indicator of quality. All of their approaches require no references or supervision. However, their proposed methods do not separate the different quality aspects and only indicate a dialogue's overall quality.

General Language Understanding Evaluation
This section briefly introduces the General Language Understanding Evaluation benchmark (Wang et al., 2019), its sub-tasks, and their relevance to this work. GLUE has two categories of tasks -single-and pairwise-sentence tasks. They provide annotated data for training models to solve various natural language understanding problems. The section also discusses how these NLP tasks could be related to dialogue evaluation since they are initially irrelevant to this paper's core topic. The presentation of each of the tasks follows:

Single-Sentence Tasks
Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2018) comprises samples in the En-Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) contains text excerpts from the movie reviews that have their sentiments annotated by humans as positive (one) or as negative (zero). Common sense would suggest that attitude provides no apparent relation to dialogue quality. Nonetheless, Ghandeharioun et al. (2019) perform an ablation study as part of their work to see if knowledge distillation based on sentiment offers any benefits to evaluating a conversation. Their research shows that there can be an improvement depending on the neural network model and the target dataset. So, we investigate how it relates to annotator scoring on dialogue evaluation.

Pairwise-Sentence Tasks
The pairwise-sentence tasks consider a pair of utterances that appear sequentially in a dialogue.

Microsoft Research Paraphrase Corpus (MRPC)
(Dolan and Brockett, 2005) is a dataset of sentence pairs extracted from news media, where each couple has scores as having the same meaning or not. Formally, it is a binary classification problem. A paraphrase has a label as positive, and non-semantic equivalence is negative. In the context of dialogues, a machine learning prediction for this task could imply that a response to an utterance is just repeating the former. At the same time, a partial degree could be suggesting some relevance. The negative case does not have a straightforward interpretation.
Quora Question Pairs (QQP) 2 is a corpus of question pairs extracted from the community question-answering platform Quora. Similar to MRPC, The focus is to flag a duo of questions as having the same semantics or not.
Semantic Textual Similarity Benchmark (STS-B) (Cer et al., 2017) is a dataset of paired-up media captions, news headlines, and sentences from natural language data that are given a similarity score from one to five by a human annotator. From a formal perspective, this is a regression problem where the output ranges between one and five. In a similar fashion to the last two tasks, this task can provide insights into the relevance and coherence of a response to its preceding utterance by assessing its semantic similarity.
Question Natural Language Inference (QNLI) (Wang et al., 2019) dataset is a re-adapted version of the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016). The original dataset contains question-paragraph pairs, where an excerpt of the paragraph is an answer to the question. Wang et al. (Wang et al., 2019) convert it such that a question is paired up with each sentence from the context paragraph. Only the sentence with the answer for the questions has a label for textual entailment; the rest do not. The question is a hypothesis that could entail the sentence or not. It is treated as a relevance ranking problem, where a question can be more relevant to a sentence than others. Regarding dialogue quality, such a task can help with a response's relevancy assessment more straightforwardly than MRPC, QQP, and STS-B.
Recognizing Textual Entailment (RTE) datasets (Wang et al., 2019) consist of series of challenges: RTE1 (Dagan et al., 2005), RTE2 (Bar-Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009). Pairs of sentences have been sampled from news and Wikipedia articles, which have been marked, similarly to QNLI, as textual entailment or no textual entailment 3 , a binary classification problem. In a similar fashion to QNLI, RTE can be used to determine the relevancy of a response to an utterance. However, unlike QNLI, RTE does so for general statements rather than just questions.
Multi-Genre Natural Language Inference Corpus (MNLI) (Williams et al., 2018) is a compilation of sentence couples collected via crowdsourcing that have been annotated for textual entailment, similarly to QNLI and RTE. However, MNLI does that as a three-class classification problemtextual entailment, contradiction, and neutrality. The task is not used for the paper due to the lack of a straightforward mapping of those three classes to an ordinal/continuous variable like a dialogue quality score.

Winograd
Schema Challenge (WNLI) (Levesque et al., 2012) aims at reading comprehension where a system must gain an understanding of a sentence with a pronoun and then choose the suitable referent from a list of choices. Due to its nature, this task is not relevant and not used for this work.

Dialogue Datasets
To evaluate the ability of a deep-learning model trained on GLUE to indicate the quality of dialogues, we use the English datasets (TopicalChat, PersonaChat) provided by Mehri et al (2020). They train a few different dialogue system models and use different sequence generation techniques to generate responses for certain dialogue contexts. The researchers then evaluate 660, in total, dialogue contexts and responses according to six criteria: Understandable, Natural, Maintains Context, Interesting, Uses Knowledge, and Overall Quality. For a complete description of the metrics mentioned above and further details about the dataset, we forward the reader to the original work of Mehri et al. (2020).

BERT as a Proxy Indicator for Dialogue Quality
Since the GLUE benchmark is about general language understanding, we are interested to know whether a model trained on it can indicate the quality of the dialogue. To conduct the investigation, we use BERT (Devlin et al., 2019) and its finetuned models on the GLUE benchmark (Wolf et al., 2019;Morris et al., 2020). We use the version with 110M parameters. For each investigated GLUE task, there is a separate copy of the whole model trained to solve that specific problem. While we did not train the models ourselves, the inference is less demanding. It takes about 30 minutes on a laptop with an eight-generation Intel i7 CPU. For encoding the text sequence, we use BERT, a pre-trained bidirectional transformer encoder language model. The pre-training has been done using two unsupervised tasks: masked language modeling and next sentence prediction. This way, it can learn a contextualized semantic representation of the input text usable for downstream tasks. BERT can create a vector encoding for a whole sequence by always inserting a control token, [CLS], at the beginning. For the case of pair-wise sentence tasks, e.g., next sentence prediction, it uses an additional control token, [SEP ], between the two sentences to distinguish them.
When fine-tuned for a specific task, the pretrained language model weights are reused. In addition, a layer is added to act as a transformation from BERT's semantic representation to the space of the target variable, e.g., the classes of RTE or CoLA.

Scoring
For obtaining model predictions, the dialogue data is provided as input in three possible ways: 1. single utterance, 2. a dialogue context and a response, or 3. related facts to a conversation and a response. Depending on the GLUE task, the model can give four different types of output scores: Single-sentence classification output provides softmax output for CoLA and SST-2. Given the contextualized semantic representation of a single utterance from the dialogue U the probability whether it is linguistically acceptable or with a positive sentiment is: where W are the task-specific weights, c is the output class for the target task.
Pairwise text similarity outputs a similarity score, for the STS-B task, between a pair of a context or fact and a target response from the same dialogue C (or F for a fact) and R, concatenated and jointly encoded by BERT as U : W are the weights specific to STS-B, and U is the concatenation of a dialogue context or fact with a target response.
Pair-wise text classification is used for the three relevant tasks of RTE, QQP, and MRPC. It functions in the same manner as single-sentence classification, with one difference. Two, instead of one, sequences are used as input to the model. The dialogue context or fact and the target response are concatenated. Between the two, a special token is inserted to signify that the input sequence has two components: Pairwise ranking finds its application in the QNLI task. Likewise to pairwise text similarity, The dialogue context or fact and the target response are concatenated C (or F ) R from the same dialogue are encoded as one U to calculate a relevance score: After model predictions are made on all utterances and sequential pairs of those across all tasks, the outputs have been rescaled between 0 and 1 for each GLUE task independently, as well as the scores given by the human annotators.
Finally, similarly as Mehri et al. (2020), we train a regression that combines all the scores in one overall score:

Evaluation
Here, we analyze the dialogue datasets (Mehri and Eskénazi, 2020) for possible relations between the GLUE task predictions and the annotator scores.

Baseline: UnSupervised and Reference free (USR) evaluation metric
To bring the results into context, we compare our results to the work of Mehri et al. (2020). Their approach is reference-free and unsupervised. So, it acts as a baseline against which we compare the method proposed in this work. The algorithm has three components. The first component, RoBERTa (Liu et al., 2019b), is fine-tuned on either PersonaChat (Zhang et al., 2018) or Topical-Chat (Gopalakrishnan et al., 2019). A concatenation of the input dialogue context and the target response is provided to its masked language modelling (MLM) objective. The tokens in the response part are iteratively replaced. In the end, the approach provides a probability score for the whole target sequence that indicates its fluency given the dialogue context. It is referred to as USR-MLM.
The second component again uses RoBERTa as its foundation. However, this time, it is fine-tuned on the Ubuntu Corpus (Lowe et al., 2015) to perform dialogue retrieval using negative sampling. It is trained to distinguish between the proper response of a given context and a randomly sampled one. Mehri et al. (2020) report that this metric is appropriate for evaluating Maintains Context, Interesting, and, Uses Knowledge. They refer to it as USR-MLM (x = c) or USR-MLM (x = f ) for calculating it against the dialogue context or dialogue facts, respectively.
Finally, the third component is a combination of the other two. Mehri et al. (2020) propose using a regression model to obtain one single score based on two separate metrics. This enables measuring the overall quality of a conversation. It is referred to as only USR.
While Mehri et al. (2020) report turn-and system-level correlation scores. We benchmark only against turn-level scores due to a lack of detail of how the system-level ones are calculated.

Quantitative Assessment
In Tables 1 and 2, we present the correlation analysis between the automated quality metrics and human annotator scores.
In almost all of the criteria, the combined proxy indicators via linear regression outperform the combined USR metric and its best-performing components. Whereas, in the few cases where USR performs better than the proxy indicators, it is within a minor relative difference.
Looking at the Understandable and Natural criteria, we see that CoLA as a single proxy indicator can weakly infer the two measures on the Topi-calChat dataset. However, it is outperformed by STSB and MRPC in PersonaChat, which suggests that the dialogues have a different nature, that involves context more strongly. This difference is also visible in the weaker performance of USR-MLM for Understandable and the shift to contextbased USR-DR for Natural.
Maintains Context is the only criterion where USR outperforms the proxy indicators. Among the proxy indicators, Semantic Textual Similarity Benchmark (STSB) is the best performer, suggesting that some partial semantic overlap between context and response is necessary to model a dialogue's cohesiveness. Although, it is common sense that a reply does not need to have a high degree of semantic overlap with its context. Ultimately, the context-based USR-DR is the best-performing measure. We contribute its performance to the fact that it has been trained on dialogue data to distinguish between a correct and randomly sampled response.
We turn our attention to the Interesting quality measure, where USR struggles on the PersonaChat dataset. The linear regression of the proxy indicators outperforms the rest by a considerable margin. It is curious to see that the calculated STSB against the conversation data has a relatively higher correlation score. This performance suggests that responses that used the facts from the dialogue were also considered as engaging, i.e., there is an overlap between the criteria Interesting and Uses Knowledge. Aside from that, we recommend using Recognizing Textual Entailment (RTE) to indicate the interestingness of dialogue using only its context. Our results show a weak correlation with Pearson's and Spearman's coefficients ranging from 0.11 to 0.21.
The lastly mentioned metric is also the best performer for the latter criterion. Furthermore, the fact-based STSB that is compared against Uses Knowledge delivers the highest correlation score among all metrics. Thus, a kind of semantic similarity measure can be very indicative of whether a knowledge base is mentioned in a conversation or not.
The linear regression of all proxy indicators appears as the most consistent performer delivering the highest scores among several specific criteria and for Overall one except for the context-based

Ablation Study
We investigate four configurations for using a different subset of the proxy indicators to calculate a combined score using linear regression and check the correlation coefficients against the various dialogue criteria:   The combination of single sentence tasks shows signs of capability only on the criteria which can be evaluated utterance-wise, Understandable, Natural, and Interesting. While in the others, there is a drop in correlation coefficients and statistical significance, which agrees with general intuition. The single-sentence tasks cannot model dialogue quality metrics that require a view beyond the single utterances.
Turning to Maintains Context, we see the inverse perspective. The pair-wise sentence proxy indicators applied to the dialogue context, and target response demonstrate the best ability, while the single sentence is the worst. Furthermore, the observation is partially supported by the pair-wise tasks applied to the dialogue facts.
In regards to Interesting, it is evident that the pair-wise tasks outperform the single-sentence ones since context dictates what is engaging in a conversation rather than the single utterances.
Moreover, the fact-based pair-wise proxy indicators demonstrate their strong ability to model the Uses Knowledge criterion since these are the only automatic metrics that have access to the fact information. In comparison, the others underperform since they are not evaluated against the relevant data.
Finally, it is evident that to calculate an Overall score, one needs to use all of the proxy indicators. All of the subset combinations perform worse than the linear regression combining all of the metrics. Moreover, we see how the correlation improves for the combined score regarding the specific criteria like Maintains Context, and Interesting.

GLUE Predictor Feature Importance
In Figure 1, we present the inferred weights of the single GLUE predictors via linear regression.
It is immediately evident that in both datasets, the single sentence tasks, CoLA and SST, have an insignificant influence on the prediction of the overall quality score.
Semantic overlap between the utterances via STS-B and MRPC plays in both cases a significant role. However, in TopicalChat, the latter of the two has an even more substantial part. The trivia-like nature of the conversations explains the behavior. The significant scores of QQP and QNLI between facts and conversation utterances support the observation.
Looking at the influence of knowledge-baserelated predictors, we see that in PersonaChat, it is essential to have semantic similarity (STSB) with the knowledge base facts, i.e., that the dialogue systems use the personal traits in the conversation.

Error Analysis
In Appendix B, we present regression plots with 95% confidence intervals in order to inspect for errors. We present the following conclusions: • The linear regression on all scores has a decent general performance. Its weakness is the lower-end spectrum of the human-annotator overall quality criteria. There is a higher score variance, i.e., higher disagreement between the annotators.
• STS-B performs well on the "clear-cut" samples where knowledge is used or not. However, on borderline cases, where annotators disagree, i.e., some say knowledge is used and others not, it performs worse.
• CoLA performs excellently on the samples that were marked as Understandable by all an-notators. As the scores for understandability decrease, so does the inter-annotator agreement. Hence, also the performance of CoLA.
Overall, it appears that the approach suffers the most when there is a high disagreement between the annotators, which are on the lower end of the human annotator scoring.
The USR dataset includes information about the annotators in the form of nicknames. Based on those, one can assume that they were non-native English speakers with various backgrounds. Hence, there is a low inter-annotator agreement on "Understandable" and "Natural." For example, native speakers of a Romance and a Slavic language are more likely to disagree on these two criteria. Furthermore, it is also confirmed by the higher variance in the annotator score on the lower spectrum of CoLA predictions, i.e., annotators agree well, what understandable language is, but not the opposite.

Conclusion
This work considered a model trained on GLUE as a proxy indicator for the quality of knowledgegrounded dialogues offering different perspectives on dialogue quality criteria. It does not need any references or supervision and can outperform other competing approaches like USR (Mehri and Eskénazi, 2020). Pearson's and Spearman's correlation coefficients suggest that single proxy indicators and their various combinations via linear regression can infer dialogue quality either on specific criteria or in general. This composable nature can be used to tune the approach to focus more on particular criteria than others.
While one might be concerned that using the approach might offer an advantage to dialogue systems incorporating BERT, we think it poses little to no risk. BERT is an encoder approach and is considered uncommon for sequence generation applications. Hence, the risk of bias is reasonably low. In addition, one could also use any other base model architecture for training GLUE predictors.
The model has no training or fine-tuning that is specifically geared towards dialogues. However, we showed that lack of exposure to conversational data could be problematic for metrics like Maintains Context. Hence, we set as future work to investigate additional pre-training on dialogue data similarly as Mehri et al. (2020), but also considering other proxy indicators like Dia-logueNLI (Welleck et al., 2019), which frame the natural language inference task in a conversational setting.
Finally, while we used separately trained instances of BERT for each of the GLUE tasks, one could also consider using a multi-tasking method. For example, Liu et al (2019a) present Multi-Task Deep Neural Networks (MT-DNN) that employ a single instance of BERT for all GLUE tasks. We believe using multi-tasking and BERT together would make its application in a productive environment much more effortless, since model weights are to a greater extent shared between the tasks.

B Regression plots between predictions and human annotator scores
We provide regression plots with 95% confidence intervals between predictions and human annotator scores. Figures 2, and 3 show the correlation between the single GLUE predictions and the human annotator scores for TopicalChat and PersonaChat, respectively. While, Figures 4, and 5 show the correlation between the various combinations using linear regression and the human annotator scores for TopicalChat and PersonaChat, respectively. The vertical lines represent the prediction distribution for the given averaged annotator score within a 95% confidence interval. The dot signifies the mean value. For example, looking at Figure 5, subplot "lin-reg_fact | Uses Knowledge," the line overlaps well with the lowest (0) and the highest score (1), meaning that the prediction can distinguish well between when a dialogue uses knowledge or not. However, in the cases where the annotators could not agree, the predictor tends to overestimate them using knowledge since the intervals are below the regression line.