Towards Quantifiable Dialogue Coherence Evaluation

Automatic dialogue coherence evaluation has attracted increasing attention and is crucial for developing promising dialogue systems. However, existing metrics have two major limitations: (a) they are mostly trained in a simplified two-level setting (coherent vs. incoherent), while humans give Likert-type multi-level coherence scores, dubbed as “quantifiable”; (b) their predicted coherence scores cannot align with the actual human rating standards due to the absence of human guidance during training. To address these limitations, we propose Quantifiable Dialogue Coherence Evaluation (QuantiDCE), a novel framework aiming to train a quantifiable dialogue coherence metric that can reflect the actual human rating standards. Specifically, QuantiDCE includes two training stages, Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) fine-tuning. During MLR pre-training, a new MLR loss is proposed for enabling the model to learn the coarse judgement of coherence degrees. Then, during KD fine-tuning, the pretrained model is further finetuned to learn the actual human rating standards with only very few human-annotated data. To advocate the generalizability even with limited fine-tuning data, a novel KD regularization is introduced to retain the knowledge learned at the pre-training stage. Experimental results show that the model trained by QuantiDCE presents stronger correlations with human judgements than the other state-of-the-art metrics.


Introduction
Dialogue coherence, which requires a response to be fluent, consistent and context-related, is an essential property for developing promising dialogue I like cats more too and I work in a pet store.  Figure 1: Likert-type multi-level human rating vs. twolevel automatic evaluation. Human rating always considers multiple coherence degrees, while most of the existing automatic metrics only learn to distinguish the coherence dialogues from the incoherent ones and give relatively extreme coherence scores.
systems (Cervone et al., 2018). However, it is still challenging to evaluate the coherence of a response generated by a dialogue system. Although human evaluation is always considered as the most accurate way to evaluate the coherence, it is expensive and high-latency, which cannot meet the evaluation demand of the frequent development of dialogue systems. Therefore, automatic evaluation metrics are developed to serve as human proxies that can rapidly compute the dialogue coherence and return relatively accurate results. The current widely used metrics measure the lexical word-overlap between generated responses and reference responses, such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004). However, they have been demonstrated to be biased and correlate poorly with human judgements since no semantic information is considered (Liu et al., 2016;Novikova et al., 2017). To overcome this issue, researchers turned to develop learnable metrics based on neural networks that incorporate the semantic information, such as RUBER (Tao et al., 2018), BERT-RUBER (Ghazarian et al., 2019) and GRADE (Huang et al., 2020). However, these metrics deviate from the actual human rating due to two limitations. First, they simplify the coherence evaluation task in a two-level setting, i.e., coherent or incoherent, by maximizing the differences between the positive coherent dialogues and the negative incoherent ones obtained by some negative sampling strategies. In contrast, humans usually adopt Likert scaling and give coherence scores from multiple levels like 1 to 5, as shown in Figure 1. Second, to avoid relying on large-scale human-annotated data, they are mostly trained in a purely unsupervised manner and cannot align with the human rating due to the absence of introducing the actual human rating standards during training.
To address the above limitations, we propose a novel dialogue coherence metric training framework, named as Quantifiable Dialogue Coherence Evaluation (QuantiDCE). This framework consists of two training stages: Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) finetuning. At the MLR pre-training stage, a new multilevel ranking (MLR) loss is proposed for learning the coarse judgement of coherence degrees. Specifically, the MLR loss separates the context-response pairs with different coherence levels and compacts the pairs within the same level in one-dimensional score space. As a result, the pretrained model is able to distinguish different coherence-level dialogue responses for a given context and predicts more accurate coherence scores. At the KD finetuning stage, the pretrained model is further finetuned to learn the actual human rating standards with only very few human-annotated coherence scores. To mitigate overfitting into the scarce annotated data during fine-tuning, a novel knowledge distillation regularization loss is introduced to retain the knowledge learned at the pre-training stage, where the pretrained model (teacher) provides the soft targets for the model during fine-tuning (student). Experimental results show that the metric trained by our QuantiDCE obviously outperforms the other state-of-the-art metrics in terms of the Pearson, Spearman and Kendall correlations with human judgements by around 5% points on average. To summarize our contributions: 1) We propose QuantiDCE, a novel quantifiable training framework for dialogue coherence evaluation, which aims to align the automatic scores with the actual human rating standards via MLR pre-training and KD fine-tuning. To the best of our knowledge, it is the first attempt to consider the quantifiable problem for dialogue coherence evaluation.
2) Extensive experiments demonstrate the effectiveness of our QuantiDCE, which enables the trained metric to have obviously stronger correlations with human judgements than the other stateof-the-art metrics.

Related Work
Automatic Coherence Evaluation. The widely used automatic metrics, such as BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005) and ROUGE (Lin, 2004), use statistical rules to measure the degree of lexical word-overlap between generated responses and reference responses. However, these metrics have been demonstrated to correlate poorly with human judgments due to the absence of semantic information (Liu et al., 2016;Novikova et al., 2017). Therefore, the subsequent metrics are considered to incorporate the semantic information. For instance, BERTScore (Zhang et al., 2020) turns to measure the soft semantic word-overlap rather than the hard lexical wordoverlap like BLEU. Moreover, learnable metrics encoding the semantic information have been attracting interests recently, which are trained in a supervised manner with large-scale human-annotated data, such as ADEM (Lowe et al., 2017), or trained in an unsupervised manner with automatically constructed data, such as RUBER (Tao et al., 2018) and BERT-RUBER (Ghazarian et al., 2019). Furthermore, the recently proposed coherence metric, GRADE (Huang et al., 2020), introduces the graph information of dialogue topic transitions and achieves the current state-of-the-art results. Note that these learnable metrics are trained in a two-level training objective to separate the coherent dialogues from the incoherent ones, while our QuantiDCE models the task in a multi-level setting which is closer to the actual human rating.
Knowledge Distillation. Knowledge distillation (KD) is a method that transfers the knowledge from a large trained teacher model to a smaller student model by using the soft targets provided by the teacher (Hinton et al., 2015). In recent years, KD has been applied to many specific tasks (Sun et al., 2020;Kim and Rush, 2016;Sourty et al., 2020). Unlike these previous works, we use KD to retain knowledge learned at the pre-training  Figure 2: The overall pipeline of our QuantiDCE, consisting of two training stages which are marked by the blue and the black one-way arrows. Each input dialogue example contains one context with three-level candidate responses and five responses for each level, shown as red, orange and green rectangles respectively. The solid circle represents the centroid score for each level of the i th dialogue. At MLR pre-training stage, the contextresponse pairs are encoded with BERT and transformed into the coherence scores through the MLP prediction network, and then MLR loss is applied to optimize the network. The dotted two-way arrows indicate that both ends should be separated, while the solid two-way arrows indicate that both ends should be compact. And at the KD fine-tuning stage, the student model is first initialized with the teacher model and optimized by KD-MSE loss.
stage during fine-tuning and do not compress the model size of the student model.

QuantiDCE Framework
In this section, we present QuantiDCE, a two-stage framework for dialogue coherence metric learning, consisting of Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) finetuning. As illustrated in Figure 2, given a metric model M (Section 3.1), QuantiDCE enables M to learn multi-level representations for contextresponse pairs with different levels of coherence degrees during the pre-training stage (Section 3.2), and further to learn the rating standards of humans with only a fraction of data during the fine-tuning stage (Section 3.3). After these two training stages, the quantifiable gap between automatic metrics and humans can be obviously reduced.

Model Architecture
In our QuantiDCE framework, the metric model M is composed of: (1) an encoder network for encoding the input context-response pairs into features and (2) a predictor network for transforming the encoded features into coherence scores. Specifically, we adopt BERT (Devlin et al., 2019) as the encoder network and a multi-layer perceptron (MLP) as the predictor network. Given a context c = {c 1 , · · · , c m } and a response r = {r 1 , · · · , r n } where c i and r i are tokens of the context and the response respectively, the c and r are concatenated as {[CLS], c 1 , · · · , c m , [SEP], r 1 , · · · , r n , [SEP]}, denoted as [c; r]. Then the coherence scoreŝ of the response r w.r.t. the context c is predicted by: where M LP is a three-layer fully-connected network in which the activation functions of the three layers are two exponential linear units (Clevert et al., 2016) and a sigmoid function, respectively.

MLR Pre-Training
For learning the coarse judgement of coherence degrees without the direct supervision of score annotations, the model M is first pretrained by minimizing a new multi-level ranking (MLR) loss on a large-scale dialogue dataset. Concretely, the MLR loss is composed of a separation loss, a compactness loss and an ordering loss. Formally, given a training dataset where c i is a dialogue context and is a response set with L coherence levels 2 and K responses for each level, the model M is trained by minimizing the following MLR loss: where sep i , com i , and ord i refer to the separation loss, the compactness loss and the ordering loss of the i th example, respectively.
The separation loss aims to separate the features of context-response pairs with different coherence levels by separating the coherence scores of the different pairs 3 . Moreover, to efficiently compute the loss, we first compute the centroids of the context-response pairs belonging to the same coherence level for the i th dialogue example, i.e., , and the separation loss between the centroids is then computed as follows: where d(·) is the L1 distance, λ is the lower bound for the distance between two centroids, and w = l − j is the distance weight used for amplifying the lower bound w.r.t. the coherence-level gap.
The compactness loss aims to compact the pairs within the same level, which served as a regularization role to avoid the occurrence of outliers for each coherence level. Specifically, the coherence scoreŝ j i,k is forced to be closer to the corresponding centroid e j i as follows: where µ is the upper bound for the distance between the centroid of a certain coherence level and the score within this level.
The ordering loss is finally introduced to assure that the rank order of the predicted scores satisfies the pre-defined order of coherence degrees, i.e., It is critical since the separation loss only restricts the scores of the pairs from different coherence levels to be separated and this restriction is also satisfied when the scores of the highest level are lower than the scores of the lowest level. Similar to the separation loss, the ordering loss is also computed between each two centroids as follows:

KD Fine-Tuning
The model M pretrained by the MLR loss is further trained at the KD fine-tuning stage to directly learn the actual human rating standards with only a fraction of annotated data. Formally, given a training dataset where c i , r i and s i are the dialogue context, the corresponding response and the human-annotated coherence score of r i w.r.t. c i respectively, the previous fine-tuning approach for the scoring task usually optimizes the model M with an MSE loss between the predicted scoreŝ i and the human score s i : However, by minimizing mse i for each example, the model M will be easily over-fitting on the very few annotated data, and thus the model generalizability will be dramatically reduced. To overcome this issue, a novel knowledge distillation (KD) regularization is introduced for retaining the knowledge learned at the MLR pre-training stage. Concretely, the pretrained model M is treated as the teacher model that provides the soft targets for the student modelM which is entirely copied from M . And we adopt the distillation objectives of TinyBERT (Jiao et al., 2020), including the distillations of the embedding layer, the Transformer layers and the prediction layer. The KD loss is then formulated as: where || · || 2 2 indicates the squared L2 norm, T is the number of the Transformer layers, O t i andÔ t i are   updateM to minimize L kd mse 18: end for 19: return student modelM the t th layer outputs of M andM respectively, A t i andÂ t i are the attention matrices of the t th transformer layer. Note that the layer 0 and the layer T+1 refer to the embedding layer and the prediction layer respectively.
Overall, the loss function for KD fine-tuning, named as KD-MSE loss, is the weighted sum of mse i and kd i across the whole training dataset D f t : where α and β are hyperparameters, and we empirically found that α = 1 and β = 5 performs well. The overall training procedure is summarized in Algorithm 1.
Evaluation. Our QuantiDCE and the baselines are evaluated by computing the correlations between the model-predicted scores and the humanrated scores. Specifically, we adopt Pearson, Spearman and Kendall as the correlation measures and a large-scale human judgement benchmark (Huang et al., 2020) to provide the human-rated scores. This benchmark contains 1,200 unique (context, response, human-rated score) triplets for metric evaluation where the contexts were randomly selected from the test set of three chit-chat datasets including DailyDialog (Li et al., 2017), ConvAI2 (Dinan et al., 2019) and EmpatheticDialogues (Rashkin et al., 2019), and the responses were produced by both the retrieval-based dialogue models and the generation-based ones to assure response diversity.
Training Datasets. We use two datasets, Daily-Dialog++ 4 and DailyDialogEVAL 5 , to support the pre-training and fine-tuning of QuantiDCE, respectively. The DailyDialog++ dataset (Sai et al., 2020) contains over 11K conversations, which augments the original DailyDialog dataset with multiple responses of different quality levels including five golden reference responses, five adversarial irrelevant responses and five random selected responses for each context. Therefore, in this work, we set the number of coherence levels L = 3 where the pairs containing the random responses, the adversarial responses and the reference responses respectively belong to the levels from 1 to 3. As to the fine-tuning data, we use the DailyDialog human judgement dataset, denoted as DailyDialogEVAL, which is a subset of the adopted evaluation benchmark (Huang et al., 2020), with 300 human rating data in total, and randomly split the data into training (90%) and validation (10%) sets. Implementation Details. We use BERT BASE to initialize the encoder network, which is in line with the current SOTA metric, GRADE. For the MLR pre-training, we pretrain our model for 5 epochs with batch size 3 and learning rate 2e-5 where the lower bound for the separation loss λ = 0.3 and the upper bound for the compactness loss  µ = 0.1. For the KD fine-tuning, we further finetune the pretrained model for 20 epochs with batch size 10 and learning rate 5e-6. For all the training, BERTAdam is used as the optimizer with β 1 = 0.9 and β 2 = 0.999. For the Transformer-layer distillation, we distill all the Transformer layers since the model architectures of the teacher and the student are exactly the same.

Experimental Results
Metric Performance. The correlation results of QuantiDCE and the other baseline metrics on the large-scale human judgement benchmark are presented in Table 1, including the ConvAI2 and the EmpatheticDialogues datasets. 6 For a fair comparison, the learnable baseline metrics, ADEM, BERT-RUBER and GRADE, are trained on the training dataset we adopted, i.e., DailyDialog++. 7 Generally, QuantiDCE achieves an absolute averaged correlation improvement by around 5% points over the current SOTA, GRADE. Besides, all the results of QuantiDCE are statistically significant with p-value <0.01. 6 The DailyDialogEVAL dataset was not used for evaluation since we used it for fine-tuning. 7 BLEURT was not trained on DailyDialog++ since this dataset is not suitable for the BLEURT pre-training strategy. Instead, we trained BLEURT with the fine-tuning data we used. The training details of these baseline metrics are provided in Appendix A.  Pre-Training Objective. To verify the superiority of our pre-training objective, namely the MLR loss, we investigated the performance of several existing loss functions for pre-training compared with ours. Specifically, two categories of loss functions used for metric training are adopted, including (a) the two-level setting and (b) the multi-level setting. The binary cross entropy (BCE) loss and the margin ranking loss are adopted for the two-level setting, while another three loss functions are adopted for the multi-level setting, including the supervised contrastive (SupCon) loss (Khosla et al., 2020), the fast-approximated triplet (FAT) loss (Yuan et al., 2019) and the vanilla MLR loss  8 . As shown in Table 2, the performance of our MLR loss is the best among all the pre-training objectives. And we also found that the multi-level setting losses perform better than the two-level ones, especially on the ConvAI2 dataset. Moreover, in order to more intuitively analyze the performances of these pre-training objectives, we also visualize the encoded features and the predicted scores of the model M after being pretrained by the above loss functions on the DailyDialog++ dataset without fine-tuning. 9 As shown in Figure 3, (a) the BCE loss cannot separate the level-1 scores from the level-2 ones and the corresponding features are also mixed; (b) the FAT loss, on the other hand, separates the features of different levels well, but does not consider the relative gaps where the distances between the level-1 and level-3 features are   Table 3: Correlations between human judgements and the metric model M further trained with different finetuning losses after MLR pre-training. not larger than those between level-1 and level-2; (c) in contrast, our MLR loss separates both the features and the scores well and also considers the relative gaps between different levels.
Fine-Tuning Objective. Furthermore, we also verified the effectiveness of our KD-MSE loss during fine-tuning by comparing with other fine-tuning losses, including the pure MSE loss without KD regularization as shown in Equation 6 and the same MSE loss except for freezing the encoder network and only finetuning the predictor network i.e. the MLP, denoted as MSE (fix encoder). As the results shown in Table 3, compared with the other two losses, the model finetuned by our KD-MSE loss has the highest correlation results on both ConvAI2 and EmpatheticDialogues. Moreover, by compar- ing the results of MSE and KD-MSE, we can find that introducing KD regularization leads to obvious averaged correlation improvements by 20.2% points on ConvAI2 and 11.3% points on Empa-theticDialogues, which verifies the effectiveness of the KD loss. Besides, we also reported the lastepoch correlation results on the training dataset, DailyDialogEVAL. And the results of MSE and MSE (fix encoder) indicate the phenomena of overfitting and under-fitting into DailyDialogEVAL respectively, which explain the reasons of their low performance on the two evaluation datasets. In contrast, our KD-MSE loss enables the model to learn the actual human rating standards from the scarce annotated data and avoid overfitting it si-   multaneously. Finally, in Figure 4, we present the visualization of the scores predicted by our Quan-tiDCE after KD fine-tuning. Compared with the score distributions before fine-tuning in Figure 3(c), the finetuned score distributions of the level-1 and level-3 are wider and partly overlap with the level-2 distribution. It is predictable as the judgements of coherence are always subjective and humans tend to give vague and middle scores instead of extremely high or low scores.

Ablation Studies
Component Analysis. To verify the contributions of the core components in our QuantiDCE, we further conducted ablation studies on the ConvAI2 dataset. As shown in Table 4, both the MLR pretraining and KD fine-tuning contribute to the better performance of QuantiDCE. Besides, we also conducted ablations by removing one of the secondary loss during MLR pre-training, including the separation loss, the compactness loss and the ordering loss. The results show that the performance benefits from all these losses in which the separation loss and the ordering loss are crucial for training a metric with strong and positive human correlations.  Number of Data for Fine-Tuning. Moreover, we also investigated how the scale of data for finetuning effects the model performance by increasing the number of fine-tuning data 5% each time from zero. The trend of the model performance is presented in Figure 5. We observed that minimizing our KD-MSE loss made the correlation results have a gradually increasing trend after an initial decrease. 10 More specifically, the result achieved the standard before fine-tuning at around the 70% data scale and continued increasing until 100% with a final improvement by around 2% points. For comparison, the performance trends of MSE and MSE (fix encoder) are also provided. And the results present overall decreasing trends of the model performance, which indicates that the model trained by MSE or MSE (fix encoder) cannot benefit from the increasing of data scale, due to the severe overfitting or under-fitting. Therefore, to effectively utilize the limited data, it is important to enable the update of the entire network and add some constraints to avoid over-fitting, such as our proposed KD regularization.

Case Study
To illustrate the performance of QuantiDCE, two representative examples are shown in Table 5 . The first example shows the strength of QuantiDCE where the coherence score given by ours is closer to the human rating score compared with the extremely high score given by GRADE. However, in the second example, both our QuantiDCE and GRADE deviate from the human score, possibly because the number of coherence levels we adopted in this work (L = 3) is insufficient as humans usually consider more levels of dialogue coherence.

Conclusion
In this paper, we propose QuantiDCE, a novel training framework aiming to bridge the gap between the training objective and the actual human rating and train a quantifiable dialogue coherence metric. In general, QuantiDCE includes two training stages, MLR pre-training for learning the coarse human judgements of dialogue coherence degrees, and KD fine-tuning for learning the actual human rating standards. Experimental results show that the metric trained by QuantiDCE presents strong correlations with human judgements. For future work, it is interesting to investigate a more efficient way to obtain multi-level data and extend the multilevel setting into the general evaluation for natural language generation. Margin Ranking Loss. Similarly, the margin ranking loss simplifies the evaluation task as a two-level setting and maximizes the differences between the positive coherent dialogues and the negative incoherent ones. As the name suggests, the focus of the margin ranking loss is ranking, which aims at ranking the scores of positive coherent dialogues ahead of the negative incoherent ones.
SupCon Loss. The supervised contrastive (Sup-Con) loss (Khosla et al., 2020), which pulls the positive anchors closer and pushes the negatives farther away in representation space, can be adopted for the multi-level setting. Here, for our multi-level setting, we consider the dialogues of level-1, level-2, and level-3 as positive anchors successively, and the remaining two levels as corresponding negatives.
FAT Loss. The fast-approximated triplet (FAT) loss (Yuan et al., 2019) replaces the traditional point-to-point distances of the triplet loss with point-to-cluster distances, through an upper bound relaxation of the triplet form, which is first applied for the classification task and obviously reduces the computation cost. To use FAT loss in our evaluation task, we consider the different coherence levels as different classes and perform the FAT loss to separate the context-response pairs with different coherence levels.
11 https://github.com/google-research/ bleurt Vanilla MLR Loss. The vanilla MLR loss  is the extension of the margin ranking loss to a multi-level version by repeatedly applying the original margin ranking loss between different levels, which can be directly applied to our evaluation task.

C Visualizations of the Pre-Training Losses
We have already compared the visualization results of the BCE loss and the FAT loss. For a supplement, here we mainly introduce the visualizations of the margin ranking loss, the SupCon loss and the vanilla MLR loss in detail.
As we can see in Figure 6, (a) the margin ranking loss cannot separate the level-1 scores from the level-2 ones and the corresponding features are also mixed, which is similar to the BCE loss; (b) the SupCon loss, on the other hand, can distinguish the features and scores of the three levels to some extent, and the scores of different levels are also separated but do not follow the real rank order, i.e., level-1 < level-2 < level-3; (c) the final vanilla MLR loss can separate the context-response pairs with different coherence level in feature space and the predicted scores also follow the actual rank order. However, its score distributions are not compact enough for the level-1 and level-3.
(c) Vanilla MLR (b) SupCon (a) Ranking Figure 6: Visualizations of features (the scatter plots in the upper row) and scores (the violin plots in the lower row) on the dailydialog++ dataset. The features and scores in each of the three columns are obtained from the metric model M only pretrained with the margin ranking loss, the SupCon loss and the vanilla MLR loss, respectively.