FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation

Recent model-based reference-free metrics for open-domain dialogue evaluation exhibit promising correlations with human judgment. However, they either perform turn-level evaluation or look at a single dialogue quality dimension. One would expect a good evaluation metric to assess multiple quality dimensions at the dialogue level. To this end, we are motivated to propose a multi-dimensional dialogue-level metric, which consists of three sub-metrics with each targeting a specific dimension. The sub-metrics are trained with novel self-supervised objectives and exhibit strong correlations with human judgment for their respective dimensions. Moreover, we explore two approaches to combine the sub-metrics: metric ensemble and multitask learning. Both approaches yield a holistic metric that significantly outperforms individual sub-metrics. Compared to the existing state-of-the-art metric, the combined metrics achieve around 16% relative improvement on average across three high-quality dialogue-level evaluation benchmarks.


Introduction
In the study of generative dialogue systems, we heavily rely on some reference-based static metrics, such as BLEU (Papineni et al., 2002), to measure improvements during system development and to compare various model variants.These metrics still need improvement due to their poor correlations with human judgment (Liu et al., 2016) and poor interpretability (Mehri and Eskenazi, 2020b).
Recently, model-based reference-free metrics (Yeh et al., 2021) represent one of the ways to address the limitations of static reference-based metrics.Although such metrics exhibit promising correlations with human evaluation, most of them (Tao et al., 2018;Ghazarian et al., 2019;Huang et al., 2020;Sinha et al., 2020;Mehri and Eskenazi, 2020b;Phy et al., 2020;Pang et al., 2020;Zhang et al., 2021c) target turn-level evaluation, i.e., they focus on single-response quality, such as contextual relevance and naturalness.When evaluating a multi-turn human-chatbot dialogue, turn-level metrics do not model the dialogue in totality, but frame it as a set of context-response pairs.They assign scores to every chatbot responses in the dialogue.Hence, an aggregation strategy is required to derive the single dialogue-level metric score, such as taking the average of all the responselevel scores.Both prior works (Zhang et al., 2021a;Yeh et al., 2021) and our experimental results in §5 suggest that such an approach yields sub-optimal dialogue-level evaluation.The reason may be that turn-level metrics do not model the dependency among utterances within multi-turn interactions, it is difficult for them to spot errors that are only obvious after observing the entire conversation (Ghandeharioun et al., 2019;Ghazarian et al., 2022).
There are some metrics that perform multi-turn evaluation.However, they focus only on a single dimension, such as coherence or overall impression (Mesgar et al., 2020;Zhang et al., 2021a;Li et al., 2021;Ghazarian et al., 2022).When evaluating a dialogue, they assign a single score to quantify one aspect of dialogue quality.As pointed out in Mehri et al. (2022), dialogue quality is inherently multi-faceted.By breaking down the quality of the dialogue into multiple fine-grained dimensions, we may provide a more interpretable and descriptive dialogue evaluation.With such an interpretable metric, dialogue researchers know exactly which aspect of the dialogue system to improve.
To this end, we propose a multi-dimensional metric, dubbed FineD-Eval 2 , which consists of specialized sub-metrics.Each sub-metric targets a specific fine-grained dimension and all sub-metrics are trained in a self-supervised manner without reliance on any human annotations.
To develop FineD-Eval, our first step is to identify the dimensions for metric design.It is a wellknown phenomenon that human judges do not provide completely independent assessments for various fine-grained dimensions.For instance, Sai et al. (2021) analyzes the human ratings with respect to (w.r.t.) different fine-grained dimensions on four text generation tasks and has observed moderate correlations for most dimension pairs.Intuitively, we want to select dimensions that are less correlated such that our metric can holistically capture the dialogue quality from different perspectives.The selection process is guided by an analysis on fine-grained human ratings of dialogue-level evaluation data ( §2).Through the analysis, we want to cluster the dimensions into relatively independent dimension groups and then, select representative dimensions from different dimension groups.
Next, we propose dimension-specific strategies for training the sub-metrics.( §3.3).The submetrics, which target the representative dimensions, can also be applied to evaluate other dimensions in their respective dimension groups.Furthermore, both Yeh et al. (2021) and Zhang et al. (2021d) highlight that the combination of different metrics leads to better correlations with human evaluation than individual specialized metrics.We are motivated to explore how to combine the sub-metrics into a unified one.Specifically, both the metric ensemble and multitask learning (Caruana, 1997) are examined ( §3.4).
Finally, in the experiments ( §5), we demonstrate that (1) the sub-metrics highly correlate with human judgment for their target dimensions.(2) The scores assigned by FineD-Eval are more interpretable than the existing metrics.(3) With either metric ensemble or multitask learning, FineD-Eval significantly outperforms existing state-of-the-art metrics as well as individual sub-metrics on three high-quality dialogue-level evaluation benchmarks.

Grouping of the Dimensions
In this section, we analyze the human ratings of FED (Mehri and Eskenazi, 2020a), a high-quality dialogue-level evaluation benchmark.Each dialogue in FED is annotated by five human judges  for 11 different quality dimensions3 , as shown in the axis labels of Figure 1.We choose FED for our analysis because the dataset covers the most comprehensive list of dialogue quality dimensions.
In addition, the human annotation quality of FED is high as evidenced by the strong inter-annotator agreements w.r.t.different dimensions4 .
Figure 1 presents the Spearman correlations of different dimension pairs on FED.We can observe that all dimensions are interdependent, with correlations ranging from 0.38 to 0.88.Based on their extent of interdependence, we cluster the 10 dimensions (excluding the "Overall" category) into six groups, as shown in Table 1.We adopt the first three letters of the representative dimension within each group as the corresponding group name.The representative dimension in each group is chosen based on criteria discussed in §2.2.
A dimension is treated as an independent group if it does not correlate strongly with any of the other dimensions (≥ 0.75).Hence, consistency, inquisitiveness, and error recovery can be per-ceived as three independent dimension groups: Con, Inq, and Err respectively.The remaining dimensions are more or less correlated with each other.Based on the following four observations: (1) coherence strongly correlates with understanding (0.83); (2) The likability-flexibility and likabilityinformativeness correlations are both 0.82; (3) The correlation between topic depth and informativeness is as high as 0.84; and (4) Diversity only strongly correlates with topic depth (0.8), the remaining seven dimensions can be clustered into three groups: Coh, Lik, and Top.
The categorization may not be perfect as Coh, Lik, and Top are not completely independent from each other.For example, informativeness can be found in both group Lik and group Top.A possible explanation is that humans generally like knowledgeable chatbots, which can discuss different topics in depth rather than those that generate dull responses (See et al., 2019;Roller et al., 2021).To improve the categorization, future work may conduct similar analysis on large-scale dialogue-level human annotations.

Dimension Selection
As metioned in §1, we want to identify fine-grained dimensions that are less similar.Hence, we select only one dimension from each group and avoid those that are shared between two different groups.In addition, to further reduce the complexity of FineD-Eval, we implement the following rules to narrow down the selection to only three finegrained dimensions.
First, dimensions that highly correlate with the "Overall" category (> 0.75) are considered.The intuition is that a high correlation with "Overall" indicates more influence from the fine-grained dimension on human annotators' overall impression about a dialogue.Second, we filter out dimensions with low inter-annotator agreement (< 0.6)5 , because low inter-annotator agreements may suggest the dimension is complex to evaluate and human annotators have different understandings of the dimension (Mehri et al., 2022).Lastly, we choose dimensions based on how often they are marked as "N/A" (not applicable) by the human judges.A high frequency indicates that the dimension is not generally applicable in different contexts.Most dimensions do not contain a "N/A" rating except "Error recovery", which has been marked as "N/A" 25% of the time.
Based on the rules, we choose the following three dimensions: coherence, likability, and topic depth.In addition to the rules, we choose these dimensions because they are also widely studied in open-domain dialogue systems.Researchers spend significant amount of efforts on developing coherent, engaging, and knowledgeable chatbots (Adiwardana et al., 2020;Hedayatnia et al., 2020;Shuster et al., 2021).Designing meaningful metrics along these three dimensions can benefit the current open-domain dialogue research.Though other dimensions, such as consistency (Nie et al., 2021), inquisitiveness (See et al., 2019), and long-term memory (Xu et al., 2022) are equally important, their evaluation deserves a thorough study on its own.Hence, we leave them for future work.

Problem Formulation
We formally define the dialogue-level evaluation task.Suppose that we have a dialogue evaluation dataset, D, which contains n human-chatbot dialogues, D = {d 1 , d 2 , . . ., d j , . . ., d n }. d j is annotated by several human judges for a set of quality dimensions, Q.Each human judge provides a rating to d j for individual dimension, q ∈ Q.We use r q d j to denote the average Likert rating provided by all human annotators to d j for q.
Our goal is to learn dimension-specific metrics, M q (d j ) → s q d j , where s q d j is the metric score reflecting how good d j is for dimension q as perceived by M q .To assess the performance of M q on D, the correlation, denoted as ρ q , between S q = {s q d 1 , . . ., s q d j , . . ., s q d n } and R q = {r q d 1 , . . ., r q d j , . . ., r q d n } are calculated.Higher ρ q indicates better performance of M q on D.

General Framework
We propose a multi-dimensional dialogue-level metric, FineD-Eval, which is a combination of three specialized sub-metrics, M q , where q ∈ {coherence, likability, topic depth}.We explore two approaches for combining the sub-metrics, metric ensemble and multitask learning.Metric ensemble is a late fusion approach whereby the predictions made by the sub-metrics are combined.Multitask learning, on the other hand, is an early fusion approach whereby the sub-metrics will share a common text encoder while having different output layers.Details of both approaches are discussed in §3.4.Here, we focus on the details of M q .To train M q , we formulate a preference learning approach (Fürnkranz and Hüllermeier, 2011).Given a pair of dimensions-specific positive and negative training dialogue samples, d The following margin ranking loss is adopted to train the model: where (x q 1 , x q 2 , y) can be either (s , -1).The pairwise ranking formulation is motivated by previous works on dialogue evaluation (Mesgar et al., 2020;Huang et al., 2020;Gao et al., 2020;Zhang et al., 2021a).Compared to direct assessment approaches (Zhang et al., 2021c;Ghazarian et al., 2022), the main advantage of pairwise ranking is that the model can implicitly learn the features that distinguish the good dialogues from the bad ones based on a large quantity of dialogue pairs for a specific quality dimension.
The network architecture of M q is straightforward.RoBERTa-base (Liu et al., 2019) is adopted as the text encoder, T , which maps (d q is the corresponding metric score.

Dimension-Specific Sampling Strategies
In this section, we discuss different strategies to obtain dimension-specific training dialogue pairs.All (d Coherence (Coh) We consider two strategies for coherence.The first is utterance order shuffling whereby dialogues from existing human-human dialogue corpora (Li et al., 2017;Dinan et al., 2020) are treated as d + tr .To obtain d − tr , we randomly permute the order of utterances in d + tr .This strategy has been widely adopted in previous dialogue coherence studies (Cervone et al., 2018;Mesgar et al., 2020;Zhang et al., 2021a).
The second strategy, question-answer (QA) relevance scoring, is motivated by the Gricean maxims (Grice, 1975) whereby effective communication involves being relevant, i.e., one should provide information that is relevant to the current exchange.A natural and logical flow of conversation often involves asking and answering questions, which is a form of information exchange.Humans usually prefer answers that are straight to the point rather than those that are vague and off-topic.Concretely, we select dialogues in existing dialogue corpora7 that are more than 4 utterances and contain at least one question-answer pair.Next, we use a pretrained BERT-based QA evaluator from HuggingFace8 to score each QA pair within a dialogue.The evaluator provides a relevance score between 0 and 1 (the higher the better).Then, we average the relevance scores of all QA pairs within the dialogue to derive the dialogue-level QA relevance score.Finally, two thresholds, (τ  (Byrne et al., 1968;Nass and Lee, 2001).During human-human interaction, people tend to favour others who share similar opinions or preferences with them.On the contrary, conveying contradictory opinions or information may lead to disagreement and user dissatisfaction.
To implement this strategy, we adopt a pretrained natural language inference (NLI) model9 to provide contradiction scores (between 0 and 1) to adjacent utterance pairs within human-human dialogues.For a dialogue containing k utterances, we have k − 1 adjacency pairs, thus k − 1 contradiction scores.The dialogue-level contradiction score is derived by computing the average of the k − 1 scores.Finally, two thresholds, (τ The second strategy is based on the number of utterances that carry positive emotions within a dialogue, which we hypothesize can serve as a proxy indicator on how much the interlocutors enjoy conversing with each other.Intuitively, if a user feels a dialogue system is likeable, they tend to produce more engaging responses.To implement the strategy, we adopt a pre-trained sentiment classification model10 and apply it to classify the sentiments w.r.t. all utterances within a dialogue.We treat dialogues, of which all utterances are classified into the positive classes, as d Topic Depth (Top) Discussing topics in depth is an important attribute of engaging conversations.During the human-human interaction, when the interlocutors deeply dive into a topic, they tend to produce semantically diverse utterances, which convey a large amount of information.On the other hand, if an interlocutor is not interested in the topic, they tend to produce dull responses, such as "Ok", "Good to know", and "I don't know".Even though, such responses can be appropriate in a wide range of contexts, they often do not convey much information (See et al., 2019).As most human-human dialogues are topic coherent, we can directly link topic depth to how semantically different the utterances are within a dialogue.Hence, we propose an entailment scoring strategy.
More specifically, given a dialogue of k utterances, a pre-trained NLI model11 is used to provided entailment score to each utterance pair in the dialogue.In total, there are

Combining Dimension-Specific Metrics
Our analysis in §2 suggests that human evaluation across different quality dimensions are positively correlated.Therefore, a sub-metric that is specialized in evaluating one dimension can contribute to the evaluation of other dimensions as well.By combining different sub-metrics into a holistic one, we can achieve better correlations with human evaluation across different dimensions.We implement two FineD-Eval variants, FineD-Eval en (metric ensemble) and FineD-Eval mu (multitask learning).
Metric Ensemble Ensemble is a common technique adopted in machine learning to achieve better predictive performance than individual predictive models.In addition, it also helps improve model robustness by reducing the spread or dispersion of the predictions (Zhang and Ma, 2012).
In our case, FineD-Eval en is expected to achieve better ρ q than M q on D. Given Multitask Learning In multitask learning, a model is trained simultaneously with multiple tasks and a shared representation is learned to capture the commonalities among the related tasks (Crawshaw, 2020; Gao et al., 2022;Chen et al., 2021).Compared to FineD-Eval en , the multitask model, FineD-Eval mu , requires much less model parameters, but can achieve similar performance.
Similarly, FineD-Eval mu is also expected to achieve better ρ q than M q on D. To implement FineD-Eval mu , we need to first identify the related tasks for joint training.As described in §3.2, we have the preference learning tasks for M coh , M lik , and M top respectively.Since the input and output of the three tasks are the same, we can adopt a hard-parameter sharing network to simultaneously learn the three tasks.More specifically, the text encoder T , is shared among the three tasks.On top of T , there are three independent linear layers with output size 1, which serve as the sub-metrics for coherence, likability, and topic depth respectively.
During training, a mini-batch consists data that are uniformly drawn from the three training data sources described in §3.3.The parameter update of T depends on all data instances in the mini-batch while that of the three linear layers depends only on their corresponding task-specific input data.The losses of three tasks are summed together, During inference, given d j ∈ D, FineD-Eval mu outputs three scalar values, s For training, we prepare two datasets leveraging DailyDialog (DD) (Li et al., 2017) and ConvAI2 (CA) (Dinan et al., 2020).DailyDialog covers general day-to-day topics, such as school, work, and relationship.ConvAI2 is an extended version of PersonaChat (Zhang et al., 2018), which contains dialogues grounded by persona profiles.The detailed descriptions of DailyDialog and ConvAI2 are included in Appendix A. We choose DailyDialog and ConvAI2 because they cover a diverse sets of topics and our baseline metrics ( §4.2) are mainly trained with these two datasets.Three benchmarks are adopted to assess the strength of the metrics.They are FED (Mehri and Eskenazi, 2020a), DSTC9-Interactive (Gunasekara et al., 2020), and Persona-Eval (See et al., 2019).The benchmarks' statistics are shown in Table 2 and their descriptions are presented in Appendix B. The definitions of various quality dimensions of the benchmarks are listed in Table 11 and Table 12.All metrics are assessed with dialogue-level Spearman correlations w.r.t. each fine-grained dimension on the three benchmarks.Note that we do not consider inquisitiveness, consistency, and error recovery in the main analysis, because none of the FineD-Eval sub-metrics target these dimensions.Nevertheless, we show the metrics' performance for the three dimensions in the Limitation section.

Baselines
Two groups of metrics are adopted.The first are state-of-the-art turn-level metrics, including USL-H (Phy et al., 2020), MAUDE (Sinha et al., 2020), MDD-Eval (Zhang et al., 2021b), and Dscore (Zhang et al., 2021c).Turn-level metrics need to rely on aggregation strategies when evaluating multi-turn dialogues.In this paper, we adopt mean aggregation whereby the metric scores w.r.t.all chatbot turns in a dialogue are averaged to derive the single dialogue-level metric score.The second group includes DynaEval (Zhang et al., 2021a) and DEAM (Ghazarian et al., 2022), two stateof-the-art dialogue-level metrics.Detailed metric descriptions are outlined in Appendix C.

Implementation Details
The thresholds for the QA relevance strategy (τ rel low , τ rel high ), the contradiction scoring strategy (τ contra low , τ contra high ), and the entailment scoring strategy (τ entail low , τ entail high ) are heuristically set to (0.85, 0.99), (0.20, 0.40), (0.01, 0.10) respectively.These thresholds ensure that there are enough data instances within both the positive and negative class.
Each experiment is repeated five times with different random seeds.Since we have prepared two training datasets, there are 5×2=10 variants for M q , FineD-Eval en , and FineD-Eval mu respectively.In §5, we report the average Spearman correlation scores across the 10 variants.Additional details associated with the training process, such as hyperparameters, model selection criteria, etc. are included in Appendix D.

Experiments & Analysis
In this section, we conduct the main analysis based on the following research questions (RQ): (1) Are dialogue-level metrics better than turn-level metrics for multi-turn dialogue evaluation?(2) Does our  proposed sub-metrics correlate well with human evaluation for their target dimensions?(3) Does combining different sub-metrics help achieve better correlations for different dimensions?(4) Does FineD-Eval offer more interpretable results?(5) How reliable are negative samples constructed with sampling strategies in §3.3?Additional analyses are presented in Appendix E RQ 1. First, we can observe in Table 3 that all dialogue-level metrics perform significantly better than the turn-level metrics across different quality dimensions, this observation is inline with conclusions from previous works (Yeh et al., 2021;Zhang et al., 2021a).However, USL-H and Dscore outperform DEAM on On DSTC9-Interactive (Table 4) and Persona-Eval (Table 5) respectively.The good performance of USL-H and D-score may be attributed to that both metrics are an ensemble of multiple sub-metrics whereas DEAM is a single-model metric.This supports our claim that combining fine-grained metrics yield a holistic one that achieve better correlation with human judgment.Nevertheless, FineD-Eval en and FineD-Eval mu , two dialogue-level metrics, outperform the turn-level metrics across all the dialoguelevel benchmarks.We can conclude that in general, dialogue-level metrics perform better than turnlevel metrics for multi-turn dialogue evaluation.
RQ 2. In the sub-metrics section of Table 3, we present the result of each dimensions-specific submetric on FED.We can observe that for coherence, understanding, and flexibility, M Coh achieves the best performance in the sub-metrics group with 52.86%, 52.35%, and 47.71% Spearman correlations respectively.M Lik achieves the best performance in likability and informativeness with spearman correlations of 52.23% and 49.89% respectively.For topic depth and diversity, M Top performs the best among the three sub-metrics.The empirical results meet our expectation that the three sub-metrics target dimension groups 1, 2, and 3 in Table 1 respectively.For the coherence dimension, M Coh outperforms DynaEval and DEAM, which are also designed for evaluating dialoguelevel coherence.Moreover, M Coh performs exceptionally well on DSTC9-Interactive (Table 4) and Persona-Eval (Table 5) and significantly outperforms the turn-level and dialogue-level baselines on both benchmarks.This showcases the advantage of our utterance shuffling and QA relevance scoring strategies for coherence modeling.
RQ 3. We can observe in Table 3 that combining different sub-metrics generally performs better than individual sub-metrics for various fine-grained dimensions.For example, M Coh + M Lik outperforms

M
Coh for the coherence, understanding, and flexibility dimensions.It also outperforms M Lik for the informativeness and likability dimensions.Furthermore, metrics in the combined group significantly outperform the sub-metrics as well as various baselines for the overall impression dimension.The observations support our claim in the introduction that combining sub-metrics helps achieve better correlations for different quality dimensions.
In addition, FineD-Eval en and FineD-Eval mu achieve a remarkable Spearman correlation of 65.47% and 66.15% respectively for the overall dimension on FED.Both outperforms state-of-the-art metrics as well as individual sub-metrics by a large margin.Such an observation showcases the effectiveness of metric ensemble and multitask learning for a holistic automatic dialogue-level evaluation.
Similar observations can be made w.r.t.DSTC9-Interactive (Table 4 and Table 17).Overall, FineD-Eval en and FineD-Eval mu achieve state-ofthe-art performance on FED, DSTC9-Interactive, and Persona-Eval.Compared to the existing stateof-the-art metrics on the three benchmarks: DEAM, USL-H, and D-Score, FineD-Eval en achieves around 16% relative improvement on average while FineD-Eval mu attains around 14% relative improvement on average.However, on Persona-Eval, the single metric, M Coh performs better than FineD-Eval en and FineD-Eval mu for all fine-grained dimensions except interestingness and enjoyment.A possible reason is that all the dimensions except interestingness and enjoyment are mostly related to coherence, Sub-metrics targeting likability and topic depth may be irrelevant for evaluating those di-mensions.Incorporating scores from irrelevant sub-metrics may reduce the correlation with the corresponding human scores.

Human-Bot Dialogue For Analysis
Human: Hi! Bot: Hi there.Human: What's your favorite food?
Bot: I require only kebabs.Human: Why's that?
Bot: Because my robot machinery breaks down, the meat in kebabs to give me energy, in a similar way that you need food to survive.Human: Why are kebabs special?
Bot: That's just the way it is.Human: Who are you?
Bot: I am System.I want to be your friend.Human: Why?
Bot: Because it's good to have many friends.Human: How many is too many, though Bot: Oh.More than a few.RQ 4. Table 6 is a case study to demonstrate that FineD-Eval provides more interpretable results than existing dialogue-level metrics.DEAM and DynaEval assign a metric score of 0.9935 and 0.2325 to the dialogue respectively.Both metrics only partially capture the dialogue quality.The DEAM score reflects the degree of coherence while the DynaEval score reflects the overall quality.However, even though the dialogue is coherent, human judges do not like the chatbot (0.3 likability rating) and the topics discussed in the dialogue is also not in depth (0.2 topic depth rating).These aspects are not captured by DEAM nor Dy-naEval.On the contrary, either FineD-Eval en or FineD-Eval mu can assign fine-grained scores that capture these aspects.The FineD-Eval en metric scores for coherence, likability, topic depth, and overall impression are 0.6123, 0.1865, 0.0632, and 0.2874 respectively.In this sense, FineD-Eval variants are more interpretable than existing metrics, because it helps dialogue researchers know exactly which dialogue aspect they should improve upon.

QA Relevance (Incoherence)
A: Oh , they're both so beautiful .Let me have this one , I think .B: That one truly is a beautiful piece of work , isn't it ?A: One last question .B: Oh , no .Everything we sell here is ' as is ' .

Contradiction Scoring (Dislikability)
A: We have a special on these skirts this week .
Would you like to try one on ?B: No , thank you .I don't need any skirts.
A: How about a blouse ?This one here is the latest fashion B: No , thank you .

Entailment Scoring (Dullness)
A: All right , so I'll see you then .B: I'll call you later .A: Okay , I'll talk to you later then .B: See you later .A: Bye .

Number of Positive Utterances (Dislikability)
A: Is it okay to have a day off next week ?B: Why ?What's the problem ?A: I need to go to the dentist .B: Okay , I'll get Bob to cover you .

Related Work
Evaluation is a long-lasting problem in dialogue system research (Deriu et al., 2021;Yeh et al., 2021;Mehri et al., 2022;Smith et al., 2022).In open-domain dialogue evaluation, Liu et al. (2016) shows that commonly-adopted metrics, such as BLEU (Papineni et al., 2002), can be misleading due to their poor correlations with human judgment.Recently, interests in automatic evaluation of open-domain dialogue systems have intensified with the introduction of reference-free model-based evaluation paradigm 12 .Most of them focus on turnlevel response quality (Tao et al., 2018;Ghazarian et al., 2019;Huang et al., 2020;Sai et al., 2020;Sinha et al., 2020;Zhang et al., 2021b).Despite their promising correlations with human evaluation, such metrics are insufficient for dialogue-level assessment.Our FineD-Eval targets dialogue-level evaluation specifically.
In addition, existing works on model-based dialogue-level metrics (Li et al., 2021;Zhang et al., 2021a;Ghazarian et al., 2022;Zhao et al., 2022) focus very much on a single quality dimension.On the contrary, FineD-Eval is capable of multidimensional evaluation and it can provide more fine-grained and interpretable scores.
The idea of decomposing overall dialogue quality into fine-grained dimensions has been explored in prior works (Mehri and Eskenazi, 2020b;Phy et al., 2020;Pang et al., 2020;Zhang et al., 2021c) for turn-level evaluation.However, its application on dialogue-level evaluation is under-explored, our work serves to bridge this gap.

Conclusion
In this paper, we propose FineD-Eval, a multidimensional dialogue-level evaluation metric.FineD-Eval consists of three specialized submetrics, which targets three fine-grained dialogue quality respectively, including coherence, likability, and topic depth.Each specialized sub-metric is trained with a pairwise ranking objective on dialogue pairs that are curated according to the corresponding dimension-specific strategies.Two variants of FineD-Eval are proposed to combine the sub-metrics into a holistic metric.One variant is based on metric ensemble and the other is based on multitask learning.We have empirically demonstrated that FineD-Eval strongly correlate with human evaluation for different dialogue quality dimensions as well as exhibits strong generalization across different evaluation datasets.

Limitations
We have identified two limitations that need to be addressed in future work.
First, we can observe in Table 4 that the correlation scores of all the dialogue-level metrics including FineD-Eval en and FineD-Eval mu are much lower than those in Table 3.There are two major reasons.The first reason is due to longer dialogues in DSTC9-Interactive than in FED (28.13 vs 12.72 utterances per dialogue).Existing metrics do not have effective mechanism to handle long dialogues.They often adopt BERT-based language models (Devlin et al., 2019;Liu et al., 2019) as the text encoders.As a result, longer dialogues are truncated to satisfy the input length and GPU memory constraints.Some information that is beneficial for dialogue-level evaluation may be lost due to truncation.In future, we should explore more sophisticated text encoders to model long dialogues.In addition, FineD-Eval should also incorporate mechanisms to pinpoint the most relevant or important information to evaluation within long dialogues, such as a dialogue breakdown detection module.Another reason is that dialogues in DSTC9-Interactive contain much more noise than those in FED.Human judges find it difficult to evaluate the dialogues, resulting in low interannotator agreements w.r.t.different fine-grained dimensions.The inter-annotator agreements for different dimensions range between 0.56 and 0.58 in terms of Spearman correlations.On the contrary, the quality of FED dialogues is better and the inter-annotator agreements of most dimensions are above 0.8.Besides designing more robust metrics, future work should also explore developing more high-quality dialogue-level evaluation benchmarks.
Second, as stated in §2, fine-grained quality dimensions, such as consistency, error recovery, and inquisitiveness are not covered by FineD-Eval.Hence, we do not report the performance of FineD-Eval on these dimensions in the main analysis.For completeness, we present the performance of FineD-Eval for the missing dimensions on the three benchmarks in Table 8.We can observe that the correlations of both FineD-Eval en and FineD-Eval mu for these three dimensions are not as high as those for the other dimensions, such as likability, topic depth, and coherence.The observation is expected as we do not have dedicated sub-metrics to model consistency, error recovery, and inquisitiveness.Hence, the dimensions missing from FineD-Eval are worth a thorough future study on their definitions, application scenarios, and metric designs.Table 8: Additional results on the three benchmarks."Con", "Inq", and "Err" denote "consistency", "inquisitiveness", and "error recovery" respectively.DEAM is the best dialogue-level baseline on all datasets.USL-H is the best turn-level baseline on DSTC9-Interactive while D-score is the best turn-level baseline on FED and Persona-Eval.

A Dialogue Corpora
The two dialogue corpora for constructing our training/validation datasets are outlined below.Their detailed statistics are presented in Table 9.Table 10 shows the number of positive and negative dialogues that are constructed with each strategy (described in §3.3) for different data splits.ConvAI2 (Dinan et al., 2020) is an extended dataset of Persona-Chat (Zhang et al., 2018).Dialogues in ConvAI2 are grounded by the personas of the interlocutors.Two interlocutors in a dialogue play the roles described by the corresponding personas.Each persona contains at least 5 role description sentences.Throughout the dialogue, the two interlocutors try to be engaging, to know each other, and to find their mutual interests.In total, there are 1155 possible personas for training.Topic shifts are common in ConvAI2 dialogues as the interlocutors are continually introducing new information about themselves during their interaction.

B Evaluation Benchmarks
FED (Mehri and Eskenazi, 2020a) consists of 125 dialogues, among which 40 are collected between a human and the Meena chatbot (Adiwardana et al., 2020), 44 are collected between a human and the Mitsuku chatbot, and the remaining 41 are humanhuman dialogues.Each dialogue is annotated by five human judges for 11 different quality dimensions, including coherence, error recovery, consistency, diversity, topic depth, likability, understanding, flexibility, informativeness, inquisitiveness, and overall impression.The definition of each dimension is outlined in Table 11.The ratings of all the dimensions are based on the 1-3 Likert scale except that consistency scores range from 0 to 1 and overall scores range from 1 to 5. The inter-annotator agreements for all dimensions are above 0.8 in terms of Spearman correlations except consistency (0.562), diversity (0.789), and inquisitiveness (0.769).

Coherence
Throughout the dialogue, is the system maintaining a good conversation flow?Error Recovery Throughout the dialogue, is the system able to recover from errors that it makes?

Consistency
Throughout the dialogue, is the system consistent in the information it provides?Diversity Throughout the dialogue, does the system provides a diverse range of responses?Topic Depth Throughout the dialogue, does the system discuss topics in depth?

Likability
Throughout the dialogue, does the system display a likeable personality?Understanding Throughout the dialogue, does the system understand the user?

Informativeness
Throughout the dialogue, does the system provide unique and non-generic information?

Flexibility
Throughout the dialogue, is the system flexible and adaptable to the user and their interests?

Inquisitiveness
Throughout the dialogue, does the system actively ask the user questions?Overall Impression The overall quality and user satisfaction of the dialogue.
Table 11: Definitions of the eleven dialogue quality dimensions of FED (Mehri and Eskenazi, 2020a) and DSTC9-Interactive (Gunasekara et al., 2020).The definitions are adapted from Mehri et al. (2022).100 human judges.The definitions of the eight dimensions are listed in Table 12.

C Metrics
USL-H (Phy et al., 2020) stands for Understandability, Sensibleness, and Likability in Hierarchy.It measures the overall quality of a dialogue response based on a configurable composite function of three scores, which correspond to the three quality dimensions respectively.Understandability refers to the naturalness of a response, and a BERT-base valid utterance prediction model (BERT-VUP) is trained to predict whether a response is syntactically well-formed or not.Sensibleness denotes the contextual relevance of a response.A BERT-base next utterance prediction model (BERT-NUP) is trained to assess sensibleness.Likability quantifies how likeable a response is for a particular task.Likability can be configured to adapt to the end evaluation task.In Phy et al. (2020), specificity is applied as the proxy of likability, which is measured with a BERT-base mask language model (BERT-MLM).The USL-H metric is trained on DailyDialog.MAUDE (Sinha et al., 2020) is a reference-free metric tailored for online dialogue evaluation.MAUDE leverages DistilBERT (Sanh et al., 2019) to extract latent representations of utterances and captures the temporal transitions that exist between them.The authors propose different data augmentation techniques to augment both the positive and negative responses.For positive response augmentation, back-translation and a sequence-tosequence generative model are used to generate positive response variants.For negative response augmentation, word drop, word repeat, and word order shuffle are proposed to create syntactically negative responses.Random utterance selection is adopted to generate semantically negative responses.MAUDE is trained in a contrastive manner with noise contrastive estimation (NCE) loss on the ConvAI2 dataset.(Zhang et al., 2021b) is a referencefree metric for evaluating response appropriateness.MDD-Eval specifically targets multi-domain turnlevel evaluation.It relies on data augmentation techniques and a self-training setup for improving generalization across different dialogue domains.

MDD-Eval
ances are splitted into sub-dialogues, i.e., each sub-dialogue contains less than 10 utterances.The splitting procedure is to avoid too much padding in a mini-batch if a long dialogue is present.In this way, the GPU memory can be better utilized during training.Note that we only apply this splitting procedure during model training, not during the dialogue evaluation process.
at the 119th out of 125 dialogues.FineD-Eval en , FineD-Eval mu , DynaEval, and DEAM rank it at the 113th, the 120th, the 70th, and the 78th positions respectively.We can observe that in both cases, FineD-Eval en and FineD-Eval mu correlate strongly with human evaluation compared to existing state-of-the-art metrics, including DynaEval and DEAM.The examples also demonstrate that dialogues receiving high overall impression scores are generally good in terms of coherence, likability, and topic depth whereas those perceived as low-quality dialogues by human judges also receive low scores for co-Table 15: A low-quality dialogue example from FED. Human judges score it at 1.5/3.0,0.0/3.0,0.9/3.0,1.5/5.0 for coherence, likability, topic depth, and overall impression respectively.

E.3 Ablation Study
As described in Section 3.3, we have five different sampling strategies for three fine-grained quality dimensions.In this section, we show the impact of each strategy on the metric performance in Table 16.It can be observed that all the sampling strategies work as expected.Metrics that adopt "utterance shuffling" or "QA relevance" strategies exhibit better correlations for coherence and understanding than for other fine-grained dimensions.Metric using "entailment scoring" strategy performs better for topic depth and diversity.The "contradiction scoring" and "#utterances with positive emotions" strategies contribute the most to the likability and informativeness dimensions.M Coh , which leverages both "utterance shuffling" and "QA relevance" outperforms metrics that rely on only one of the two strategies.Similarly, M Lik , which combines the strength of both "contradiction scoring" and "#utterances with positive emotions" strategies, performs the best for likability and informativeness.However, metric that leverages only the "contradiction scoring" strategy outweighs M Lik for other fine-grained dimensions, such as coherence and topic depth.This showcases that the "contradiction scoring" strategy can also contribute to the evaluation of these dimensions.

E.4 Additional Results
In Table 17 and Table 18, we show the full results of different metrics on the DSTC9-Interactive and Persona-Eval benchmarks respectively.We can observe that most of the baselines perform poorly, except USL-H, D-score, and DEAM.A possible reason is that these three metrics capture dialogue features from different perspectives rather than focusing only on single aspect.USL-H and D-score are an ensemble of multiple sub-metrics while DEAM relies on four different AMR-based dialogue-level perturbation strategies that help the model spot semantic errors including contradiction, irrelevancy, decreased engagement, and coreference inconsistency.Further, M Coh performs exceptionally well than M Lik and M Top across all the fine-grained dimensions On both DSTC9-Interactive and Persona-Eval.The reason may be that the annotations on these two datasets are biased.For FED, there are five annotators for each dialogue and the interannotator agreements are strong across different dimensions.Hence, the annotation quality is very high.On the contrary, for DSTC9-Interactive, there are only three annotators per dialogue and the interannotator agreements across different dimensions are moderate.For Persona-Eval, there is only one annotator per dialogue.Hence, the annotations on DSTC9-Interactive and Persona-Eval may be biased towards dialogue features that are associated with coherence.The QA relevance and utterance shuffling strategies used by M Coh better capture such features than the other strategies.Moreover, on DSTC9-Interactive, the combined metric, M Coh + M Lik , performs the best.
FineD-Eval en and FineD-Eval mu perform gener-

Figure 1 :
Figure 1: Spearman correlations of dimension pairs on FED.
a token sequence with special token "</UTT>" to delimit different utterances.Next, d j ∈ D, the scalar value s q d j output by M

+
tr , d − tr ) samples are automatically constructed from human-human dialogue datasets without reliance on human annotations.
heuristically determined to ensure sufficient data in both the positive and negative classes.Likability (Lik) Two strategies are applied to construct d + tr and d − tr for likability.The first strategy, contradiction scoring, is motivated by the similartity attaction effect

+
tr and those containing less than two positive utterances as d − tr .
. The dialogue-level entailment score is the average of all utterance-pair entail-ment scores in the dialogue.Similarly, two thresholds d j ∈ D, three submetrics, M coh , M lik , and M top output three scores, s coh d j , s lik d j , and s top d j respectively.The metric score of FineD-Eval en , s en d j is obtained by computing the arithmetic mean of (s coh d j , s lik d j , s top d j ).
coh d j , s lik d j , and s top d j from the three linear layers respectively.Similar to metric ensemble, the final metric score, s mu d j is derived by taking the arithmetic mean of the three scores.4 Experimental Setup 4.1 Training & Evaluation Datasets

Table 1 :
Grouping of the dimensions.We adopt the first three letters of the representative dimension within each group as the corresponding group name.

Table 2 :
The numbers of d Statistics of the three evaluation benchmarks."P-Eval" refers to the Persona-Eval benchmark.

Table 3 :
Spearman correlations (%) of different metrics on FED.Coh, Und, Fle, Lik, Inf, Top, Div, and Ove denote coherence, understanding, flexibility, likability, informativeness, topic depth, diversity, and overall impression respectively.The scores w.r.t. the best performing metric for each quality dimension are highlighted in bold.Statistically insignificant scores (p > 0.05) are italicized.

Table 5 :
Spearman correlations (%) on Persona-Eval."AvoidRep"denotes avoid repetition.D-score and DEAM are the best turn-level and dialogue-level baselines respectively.MCoh is the best performing submetric.Full results can be found at Table18.

Table 6
Table7presents example human-human dialogues that are considered as negative samples.We examine them one by one to validate the reliability of our sampling strategies.First, in the incoherent dialogue obtained by the QA relevance strategy, we can observe that when speaker B asks "isn't it" and expect speaker A to acknowledge, A instead replies by "One last question.",which disrupts the coherent flow of the dialogue to a certain extent.

Table 7 :
Negative human-human dialogue examples obtained with QA relevance, contradiction scoring, entailment scoring, and sentiment strategies.
FED D-score DEAM FineD-Eval en FineD-Eval mu

Table 9 :
Human-Human Dialogue Corpora Statistics

Table 10 :
Training Data Statistics."/" is used to separate the number of positive and negative dialogues.For each data split and each strategy, the maximum number of dialogue pairs is the number of positive dialogues multiply by the number of negative dialogues.Due to the large number of combinations, we only sample 100K and 10K pairs for training and validation respectively.

Table 13 :
Compare the four proxy scores to the proposed model-based sub-metrics of FineD-Eval.The results are Spearman correlations (%) on FED.Best score of each column is highlighted in bold.Statistically insignificant (p > 0.05) scores are italicized.Hello.How's your week coming along?Human: It's great, thanks.I'm trying to learn how to make croissants.Bot: Wow that's interesting.I have baked cookies, but croissants seem much more sophisticated.Did you make any progress?Human: I've done them once or twice so far, but they haven't been flakey enough.I'm trying to figure out why.What kind of cookies have you made?Bot: Mint chocolate chips.I think your croissants not being flakey.could have something to do with your oven's temperature.Human: Ah, good thought, thanks!Bot: Have you thought about melting some chocolate into your croissants?They don't have to be something unhealthy.For example, melted dark chocolate is good for the heart, and makes the resulting croissants taste much better.

Table 14 :
A high-quality dialogue example from FED.