Assessing Dialogue Systems with Distribution Distances

An important aspect of developing dialogue systems is how to evaluate and compare the performance of different systems. Existing automatic evaluation metrics are based on turn-level quality evaluation and use average scores for system-level comparison. In this paper, we propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations. Specifically, two distribution-wise metrics, FBD and PRD, are developed and evaluated. Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.


Introduction
Dialogue generation is a special text generation task, which has drawn booming attention in the natural language processing community. It is widely agreed that one single input query is often associated with multiple valid responses in this task, which is termed as a 1-to-n relationship between a query and its responses (Vinyals and Le, 2015;Zhou et al., 2017;Zhao et al., 2017;Liu et al., 2018;Chen et al., 2021;Chan et al., 2021;Gao et al., 2021). It increases the challenges of automatically evaluating the performance of dialogue systems.
In general, the previous evaluation metrics mainly focus on turn-level quality. For example, unsupervised word-overlapping or embedding-based metrics (Papineni et al., 2002;Lin, 2004;Mitchell and Lapata, 2008;Zhang et al., 2020) calculate the similarity or alignment between generated responses and reference responses, which is not wellsuited for open-end dialogue tasks. Learned classification or regression systems (Lowe et al., 2017; * Equal contribution. Work was done during internship at Tencent AI Lab. Tao et al., 2018;Sellam et al., 2020;Ghazarian et al., 2019) are corpus-dependent because of requiring additional task-specific training or tuning, which run the risk of assigning lower quality to a better model in the overfitting or underfitting cases.
In this paper, we provide a new perspective that distribution distance between generated conversations and real conversations can be applied to measure the performances of dialogue systems. There are three main contributions: (1) We firstly propose two unsupervised distribution-wise metrics (i.e., FBD and PRD) to solve the evaluation issue in this field. (2) The experimental results show that the proposed distribution-wise metrics perform well. Particularly, FBD achieves compelling performances on most evaluation corpora, which shows a promising direction for designing evaluation metrics. (3) We collect the typical evaluation corpora and existing evaluation metrics in order to better assess the performance of dialogue systems, which could be useful for researchers in this community 1 .

Related Work
In this section, we focus on unsupervised automatic evaluation metrics for dialogue system evaluation. In general, existing unsupervised metrics mainly measure turn-level qualities, which can be categorized into two main classes: word overlapping metrics and embedding-based metrics: Word-overlapping Metrics Such metrics quantify the amount of word-overlap between generated response and reference responses. For example, BLEU (Papineni et al., 2002) calculates the geometric mean of the precision for n-gram. ROUGE (Lin, 2004) is a recall-oriented metric. METEOR (Banerjee and Lavie, 2005) computes the harmonic mean of precision and recall with stemming and syn-onyms.
Embedding-based Metrics Embedding-based metrics align the generated response and the reference in latent semantic space. Some adopt the vector similarity of sentence embeddings as a quality measure. For example, Embedding Average (Foltz et al., 1998;Mitchell and Lapata, 2008) calculates sentence-level embeddings by averaging word representations. Vector Extrema (Forgues et al., 2014) computes sentence-level embeddings by taking the most extreme value for each dimension in all word vectors. Others adopt more fine-grained semantic matching. For example, Greedy Matching (Rus and Lintean, 2012) greedily matches each word in a generated response to a word in the reference response, and the final score is defined as the average of word-level similarity scores. Zhang et al. (2020) introduced a better embedding-based metric BERTScore that computes word similarity using contextual embeddings from pre-trained language models.
Our proposed methods are best placed in the literature of embedding-based metrics. However, there are two main differences from previous metrics in this field: (1) We compute the distribution distance between embedding sets as the system-level performance of a dialogue system, which does not require task-specific training/tuning; (2) We propose to extract sentence-level semantic representations directly from pre-trained language models (Devlin et al., 2019;, where there are no operations of converting the wold-level embeddings to sentence-level embeddings.

Proposed Methods
Given a collect of sentence pairs {(x i , y i )} N , we assume that the corresponding semantic representations {v i } N can be extracted in this manner: where LM (·) refers to pre-trained language models(i.e., (Devlin et al., 2019;Yang et al., 2019;Clark et al., 2020)), [·, ·] refers to the concatenation operation. Intuitively, the differences between the distribution R of real samples and the distribution G of generated samples can be applied to measure the performances of dialogue systems (i.e., d(R, G)). Therefore, we propose two distribution-based methods to automatically evaluate the performance of dialogue systems, which are presented in this section in detail.

Fréchet Bert Distance
Semantic representations {v i } N are extracted by a pre-trained language model, which encodes the contextual information of the sentences. The main intuition is that the distribution of semantic representations of generated sentences should be as close as possible to the distribution of semantic representations of real sentences in a successful system. To measure this, we assume that such semantic representations follow a multi-dimensional Gaussian, which can be represented by variables: mean and covariance. The difference between two Gaussians (generated and real sentence pairs) is measured by the Fréchet distance (Dowson and Landau, 1982). We call the Fréchet distance between the distribution R with mean (µ r , Σ r ) obtained from real sentence pairs and the distribution G with mean (µ g , Σ g ) obtained from generated sentence pairs as "Fréchet Bert Distance" (FBD), which is formulated as: Once the distribution of generated data closes to the distribution of real data, the model indeed achieves low FBD scores. Similarly, such distance (Heusel et al., 2017) has been widely verified in various Generative Adversarial Networks (GANs) in computer vision tasks (Karras et al., 2017;Zhang et al., 2018a;Park et al., 2019), which is consistent with increasing disturbances and human judgment. Surprisingly, we observed that FBD works well in evaluating open-end dialogue systems.

Precision-Recall Distance
We notice that FBD is based on the estimated Gaussian parameters (µ, Σ). There is an optional strategy to get rid of estimating the parameters. Inspired by (Sajjadi et al., 2018), we apply a precisionrecall-based method, named as Precision-Recall Distance (PRD), to evaluate the distance between two distributions.
The key intuition is that precision should measure how much of G can be generated by a "part" of R while recall should measure how much of R can be generated by a "part" of G. In general, (a) If R is bimodal and G only captures one of the modes, we should have perfect precision but only limited recall; (b) In the opposite case, we should have perfect recall but only limited precision; (c) If R = G, we should have perfect precision and recall; (d) If the supports of R and G are disjoint, we should have zero precision and recall. The BPD is formulated as: π 2 )|i = 1, · · · , m}, m ∈ N refers to a given angular resolution, Therefore, the better dialogue systems will achieve higher PRD scores.

Datasets & Systems
To verify the two proposed metrics, we conduct experiments on six public dialogue corpora. Baseline Metrics. We mainly compare with several widely-used metrics in text generation field: a) three word-overlapping metrics: BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), ME-TEOR (Denkowski and Lavie, 2014); b) four embedding-based metrics: Greedy Matching (Rus and Lintean, 2012), Embedding Average (Wieting et al., 2015), Vector Extrema (Forgues et al., 2014) and BERTScore (Zhang et al., 2020). All these metrics do not require task-specific training. Datasets. We collect three recently released evaluation corpora which consist of dialogue query and response samples of different systems, and the corresponding human annotations: • Persona(M): USR (Mehri and Eskénazi, 2020) built an evaluation corpus based on Per-sonaChat (Zhang et al., 2018b), in which both four system outputs and the corresponding human evaluation scores were collected.
•     (x i , y i ), we use the last hidden output of [CLS] as its semantics representation without tuning or training the language models. To assess the systemlevel performances of dialogue systems, we calculate the Spearman and Pearson correlations between the rankings of human evaluation and the rankings of evaluation metrics. If a evaluation metric is designed for turn-level evaluation, we average the all turn-level scores as the performance of the corresponding dialogue system. Public Resources. All the compared evaluation corpora and evaluation metrics are available in Table 1. Once the official implementations are not available, we use the repositories with highest "stars" on GitHub. The details of each evaluation corpus, including number of samples and compared dialogue systems in each corpus, are presented in Table 2.

Results
We compute the system-level correlation between all automatic metrics and the quality ratings by using Spearman and Pearson correlation coefficients.  The performances of various evaluation metrics on different public corpora are reported in Table 3. Our proposed two metrics (i.e., FBD and PRD) show comparable performances over the baseline metrics. Especially, FBD R achieves compelling performances on five corpora, which indicates a good ability of generalization and robustness on various corpora. In addition, most evaluation metrics are sensitive to the evaluation corpora. For example, BLEU performs well in Convai2 but fails in Empathetic. Similarly, BERTScore B performs well in Convai2 and Persona(Z) but fails in Daily(H) and Empathetic. It indicates that the selection of evaluation corpora has a great influence on assessing the performances of evaluation metrics. Hence, it's better to use multiple corpora to do the comparisons between metrics. Obviously, our proposed FBD R outperforms the existing evaluation metrics in the view of robustness.
In Figure 1, we compare the evaluation metrics in the perspective of various evaluation corpora, where the results of BERT and RoBERTa language models are averaged. It confirms the superiority of the FBD metric. Compared to the performance of USR (Mehri and Eskénazi, 2020) (1.000/.820 on Persona(M)), a reference-free metric that relies on task-specific training/tuning with task-specific data, the performances of our proposed methods are comparable without any training/tuning. Therefore, we believe that it's a promising direction to explore distribution-wise metrics for assessing dialogue systems in this field.
As shown in Figure 2, we average the performances of each metric on all evaluation corpora. It shows that our proposed FBD has higher performance expectations that outperform BERTScore with different language models. The large models do not show improvements in average performance compared to the base models. In general, FBD metric achieves better Spearman and Pearson correlations compared to PRD. Surprisingly, RoBERTa-based metrics, including BERTScore, the proposed FBD and PRD, perform better than the corresponding BERT-base ones. Given that our FBD metric lies on the assumption of multivariate Gaussian distribution, we hypothesize that the semantic representations extracted by RoBERTa model fit Gaussian distribution better than BERT model. To verify this point, as shown in Table 4  values lead to the rejection of normality whereas a value of one indicates normality of the data.

Fine-Grained Performances
In the evaluation corpus Daily(Z) (Zhao et al., 2020), it provides four fine-grained human evaluation scores, including relevance, grammar, content and overall, which can be used to dive more insights of different evaluation metrics. As shown in Table 5, our proposed metric FBD R achieves the best performances on most evaluations in the fine-grained comparisons. It indicates the distribution-wise metric correlate better with human judgements on various aspects.

Conclusions
In this paper, we propose to measure the performance of a dialogue system by computing the distribution-wise difference between its generated conversations and real-world conversations. Specifically, two distribution-wise metrics, FBD and PRD, are developed on pre-trained language models. Experiments on six public dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.