Enhancing the Open-Domain Dialogue Evaluation in Latent Space

The notorious one-to-many nature of open-domain dialogues poses huge challenges for automatic evaluation methods. Recent studies attempt to mitigate this issue by considering the similarity of the generated response with the conversational context and design discriminative models to learn from multiple positive responses. Despite the promising results, they can not be applied to general scenarios where training data with multiple responses is unavailable. To this end, in this paper, we propose a self-supervised setting to obtain a smooth latent space that can both capture discourse-level context information and implicitly model more references in latent space. Speciﬁcally, we present EMS, an E nhanced dialogue evaluation M etric in latent S pace. Experimental results on two real-world dialogue datasets conﬁrm the superiority of our method for open-domain dialogue evaluation, where both Pearson and Spearman correlations with human judgments outperform all baselines.


Introduction
With the surge of deep learning techniques, generation-based open-domain dialogue systems have witnessed significant improvement in recent years. Plenty of novel and effective models (Sutskever et al., 2014;Li et al., 2015;Zhao et al., 2017;Gu et al., 2018;Qiu et al., 2019;Chan et al., 2019b;Wolf et al., 2019;Hu et al., 2019; are proposed and have greatly promoted the development of the opendomain dialogue generation. Unlike the endless emergence of novel methods, however, there is still no meaningful and widely accepted automatic evaluation metric for dialogue generation yet. As we know, automatic evaluation allows quick and effective comparison between different systems and is crucial for the development of natural language generation (NLG) tasks (Dathathri et al., 2019;Gu et al., 2019;Chan et al., 2019aChan et al., , 2020. The lack of meaningful automatic evaluation metrics has become a significant impediment for open-domain dialog generation research. Over the past decade, many automatic evaluation metrics are proposed to evaluate the opendomain dialogue systems. Among them, the word overlap-based automatic evaluation metrics from NLG tasks, such as BLEU (Papineni et al., 2002) in machine translation and ROUGE (Lin, 2004) in text summarization, are popular. In addition, Embedding Metrics (Mitchell and Lapata, 2008;Forgues et al., 2014;Rus and Lintean, 2012) have been utilized to evaluate the open-domain dialogue systems (Gu et al., 2018;Chan et al., 2019b;Shen et al., 2018). Recently, with the fantastic development of the large-scale pre-training model (Devlin et al., 2018;Radford et al., 2019), researchers proposed to enhance the embedding metrics by converting the dialogue sentences to hidden space via pre-training model Sellam et al., 2020;Xiang et al., 2021). The common idea behind these metrics is that they measure the semantic similarity between a reference response and a generated response, independent on the conversational context. However, due to the notorious one-to-many nature (Li et al., 2015;Zhao et al., 2017;Qiu et al., 2019;Gu et al., 2018) of open-domain dialogue, a good response should be related well to its context yet may be largely different from a reference response in semantics.
Some other works (Tao et al., 2018;Ghazarian et al., 2019;Sinha et al., 2020) thereby proposed to build automatic dialogue evaluation metrics by considering the similarity of the generated responses with the conversational context. Specifically, these works design discriminative models which can judge whether the generated responses match the conversational context well, which learn from {conversational context, response reference, negative sample} pairs in unsupervised learning manner.  further proposed to enhance such discriminative evaluation metrics by finetuning on a few human-annotated data to improve the robustness. These discriminative metrics trained using a single relevant response and multiple negative samples. However, Sai et al. (2020) argued that such discriminative metrics should be trained on multiple relevant responses (i.e., positive samples) and multiple negative samples, to favor the one-to-many nature in open-domain dialogues. Therefore, they collected a new dataset which contains multiple relevant and irrelevant responses for any given conversational context to train their discriminative evaluation model and the model trained by multiple relevant responses shows impressive performance. However, there are no organized relevant multiple responses in most existing datasets. Collecting a new dataset is expensive and timeconsuming. Thus, we aim to learn multiple reference information with limited data.
Inspired by the impressive effectiveness of the Variational Auto-encoder (VAEs) and Conditional Variational Auto-encoder (CVAEs) on the representation learning and dialogue modeling, we propose to learn the dialogue representations via VAEs/CVAEs for better evaluation. Equip with such dialogue representations, we obtain an Enhanced dialogue evaluation Metric in latent Space (EMS). EMS is a self-supervised evaluation metric with a two-stage training procedure. It represents dialogue sentences in a smooth latent space to both capture discourse-level context information and model more feasible latent references. Specifically, in the first stage, we build a VAE based model to map the dialogue sentences into a latent (or semantic) space. Li et al. (2019) showed that VAEs can be viewed as a regularized version of the auto-encoder and learn a smooth latent space through the regularization from the Gaussian prior. Then, we train our model by optimizing CVAEs' objective which forces the prior distribution to capture the feasible latent references information (details in Section 3.3). In the second stage, we combine the dialogue representations and the captured feasible latent reference information to train a discriminative model. Meanwhile, we give a potential explanation of our motivation about why using feasible latent reference information can lead to a better evaluation (details in Section 3.1). Experimental results on two real-world dialogue datasets confirm the superiority of our method for opendomain dialogue evaluation, where both Pearson and Spearman correlations with human judgments outperform all baseline methods.
In a nutshell, our contributions can be summarized as follows: • We proposed a novel automatic evaluation metric, i.e., EMS, for open-domain dialogue systems; • We proposed a pre-training variational model to capture the feasible latent references; • Experiments performed on two large datasets demonstrate the effectiveness of our proposed model and outperform all baseline methods.

Related Work
Word overlap-based Metrics. Several word overlap-based automatic evaluation metrics, such as BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and ROUGE (Lin, 2004), have been widely used to evaluate the quality of generated responses. These word overlap-based metrics measure how many words overlap in a given generated response when compared to a reference response. Liu et al. (2016); ; Tao et al. (2018) argued that these word overlap-based metric scores are weakly correlated to human judgment due to ignoring the notorious one-to-many nature of the open-domain dialogues. Therefore, Yuma et al. (2020) proposed the improved BLEU, which compares the generated response with multiply diverse references.
Embedding-based Metrics. Unlike word overlap-based metrics comparing two raw sentences, Embedding Metrics (Mitchell and Lapata, 2008;Forgues et al., 2014;Rus and Lintean, 2012) map sentences to a high dimensional space, and calculate similarity based on the high-dimensional representations. Embedding Metrics are recently popular for evaluating the generation tasks, such as text summarization Chen et al., 2021), question answer  and text generation (Hashimoto et al., 2019;Chan et al., 2020). Meanwhile, several works (Qiu et al., 2019;Gao et al., 2021) have shown their effectiveness in the open-domain dialogue systems. With the development of the large-scale pre-training model (Devlin et al., 2018;Radford et al., 2019), some studies, e.g., BERTScore  and Mover-Score , further enhance the quality of representations via a large-scale pre-training model. However, these embedding-based metrics perform a better comparison compared to word overlap-based metrics but still ignore the one-tomany nature of open-domain dialogues.
Learning-based Metrics. Recent studies (Tao et al., 2018;Sinha et al., 2020) attempt to mitigate the one-to-many issue by considering the similarity of the generated response with the conversational contexts. The similarity is calculated by a designed discriminative model which learns to evaluate whether a response matches the conversational context well. The discriminative model is learned from tuples of data, {conversational context, response reference, negative sample}, in an unsupervised learning manner. However, these learning-based metrics rely on a sophisticated sampling technique. Lan et al. (2020) proposed a sampling strategy to collect the valuable negative samples for the discriminative training. Bak and Oh (2020) conduct speaker sensitive response evaluation by conducting negative sampling from several levels. To further improve the robustness,  proposed to enhance the discriminative model by finetuning on a few human-annotated data. Sai et al. (2020) argued that these discriminative metrics should be trained on multiple relevant responses and multiple irrelevant samples for any given context. Therefore, they collected such a dataset and improved the evaluation performance greatly. However, collecting a new dataset is expensive and time-consuming. In this work, we propose a method to improve the effectiveness of the discriminative metrics based on the VAEs/CVAEs.

Methodology
In this paper, we propose an Enhanced dialogue evaluation Metric in latent Space (EMS), which contains two training stages (illustrated in Fig. 2). In this section, we first conduct some discussions about our motivations in Section 3.1. Then, we introduce the overall architecture in Section 3.2. The two training stages are described in Section 3.3 and Section 3.4, respectively. Finally, we describe the inference process in Section 3.5.
Tell me your hobby first.
I like play tennis.
It is a secret.
What is your hobby? What is your hobby? Figure 1: Distributions in latent space. Each circle represents a Gaussian distribution while three small circles refer to a special Gaussian distribution for each corresponding response. Naturally, the biggest circle indicates the prior Gaussian distribution. We use the prior distribution to approximate all the response conditional distribution. Dotted lines indicate the latent response.

Discussion about Motivation
We conduct a discussion about our motivation from information theory. Let r i denotes a feasible response coming from {r k } N k=1 which represents N feasible latent references. Assume a binary label l ∈ {0, 1} indicates whether a response matches its context well. Existing works (Tao et al., 2018;Ghazarian et al., 2019;Sinha et al., 2020; training with single relevant response are actually maximizing I(l; c, r i ). Recently, Sai et al. (2020) proposed to training with multiple relevant responses, which actually maximizes I(l; c, {r k } N k=1 ). An intuitive explanation for the surprising improvement in Sai et al. (2020) is that I(l; c, {r k } N k=1 ) ≥ I(l; c , r i ) 1 . However, there are no organized relevant multiple responses in existing datasets and collecting a new dataset is expensive and time-consuming. Therefore, we aim to capture the feasible latent reference information with limited data. Inspired by previous works which model multiple responses for dialogue (Zhao et al., 2017;Qiu et al., 2019;Chan et al., 2019b), we utilize CVAEs (details in Section 3.2) which build a prior distribution P (z|c) to capture the feasible latent reference information in the latent space. Specifically, when training CVAEs, P (z|c) is forced to be close to the posterior distribution Q(z|c, r i ) for any reference response r i as illustrated in Fig. 1. In this sense, if z is sampled from P (z|c), z may contain some information of any r i in some extent, and z can be used as a surrogate of {r k } K k=1 . Therefore, we can expect I(l; c, {r k } N k=1 ) ≥ I(l; c, r i , z) ≥ I(l; c, r i ).

Overall Architecture
Previous works Gururangan et al., 2019;Li et al., 2020) concluded that VAEs can learn a smooth latent space through the regularization from the gaussian prior. Inspired by Li et al. (2020), we propose a novel architecture which can be regarded as a large-scale pretrained language model (PLM) based on VAEs/CVAEs. Encoder. Li et al. (2019) argue that the VAEs might benefit from initialization with a noncollapsed encoder, because the encoder provides useful information from the beginning of training. We use the Masked PLMs (Devlin et al., 2018; as the text encoder because of their impressive effectiveness in natural language understanding tasks. We describe the encoding process as following, where c, r indicate conversational context and response reference, respectively. Latent Variable Modeling. For modeling the latent variable, we hypothesize that the approximated variational prior and posterior follows an isotropic multivariate Gaussian distribution N (µ, σ 2 I), where I represents the diagonal covariance. We use a recognition network q φ (z|h q ) and a prior network p θ (z|h p ) to approximate the posterior Q(z|c, r) and the prior P (z|c), respectively.
Decoder. The reconstruction process 2 forces the latent variable to contain the useful posterior information, which is a crucial step in the variational training. We use another PLM as the decoder to reconstruct the original input texts. For transporting the latent variable to the PLM decoder, we use the memory mechanism mentioned in Li et al. (2020) where the latent variable plays the role of an additional memory vector for the PLM decoder to attend. Specifically, the latent variable z is converted through a Multilayer Perceptron (MLP) and separated into several vectors, each of which is transported to the PLM decoder via attention mechanism.

Stage 1: Representation in Latent Space
Our first stage is to learn the latent representation of the dialogues and capturing the feasible latent reference information. Specifically, we first optimize our model via VAEs' objective to model a smooth latent space. Then, we train our model by CVAEs' objective to capture the feasible latent reference information. We describe the details as following.
A smooth latent space. Following Li et al. (2020), we first train the posterior module by optimizing the VAEs' objective. Li et al. (2019) showed that VAEs can be viewed as a regularized version of the autoencoder and can learn a smooth latent space. Based on this, we convert sentences in a universal smooth latent space. In a smooth latent space, latent representation of similar sentences should be close to each other and vice versa . Therefore, it is a great outset for our model. To train this model, the log-likelihood objective is maximized through pushing up its variational lower bound: where KL(·) represents the KL-divergence term, which serves as the regularization that encourages p θ (z|h q ) to approach to q(z), i.e., a standard Gaussian distribution; E[·] is the term of reconstruction loss, reflecting how well the decoder performs.
Feasible latent reference information. The oneto-many nature of the open-domain dialogues poses that there can be a lot of reasonable responses for the same conversational context. Therefore, we handle this one-to-many nature by CVAE as previous works (Zhao et al., 2017;Gu et al., 2018;Chan et al., 2019b) to capture the feasible latent reference information. As shown in Fig. 1, CVAE builds Gaussian posterior distributions for each feasible reference and forces the prior distribution to approach the posterior distributions. Ideally, a welllearned prior distribution will cover all the feasible latent reference information. We train our model by optimizing the following variational lower bound: where KL(·) represents the KL-divergence term, which serves as the regularization that encourages the prior p φ (z|h p ) to approach the approximated   (Fu et al., 2019). Specifically, we add a hyperparameter α to control the weight of the KLdivergence in Eq. 3. We set α close to zero in the first half of cyclic schedule, linearly anneal α to 1 in the next one-fourth of cyclic schedule and kept α = 1 in the remaining cyclic schedule.
Moreover, the Free Bits (Bowman et al., 2015) is also crucial for the training. It replaces the KLdivergence in Eq. 3 by a hinge loss where γ is a hyperparameter which controls the information space for the each dimension of the latent variable. Finally, an extra bag-of-word loss (Zhao et al., 2017) is also used during the training.

Stage 2: Matching Training
In the second stage, we learn to judge the similarity between the conversational context and the response using the learned representations. Li et al. (2020) argue that the KL regularization applied on z has a large impact on the preceding layer feature, thus, the preceding layer feature also contains the information of z. Therefore, we consider combining h q and z into the final representation, where τ is a hyperparameter and z q indicates the latent representation from the posterior network. Meanwhile, we use the feasible latent reference information, captured by our prior network, to enhance the matching. We combine these two representations as following, where W g and b g are trainable parameters, and e is learned by the gate mechanism that controls the fusion of z p and h p . Note that z p indicates the latent representation from the prior network. The activation function σ is sigmoid. Finally, we infer matching score between the conversational context and generated reference as follows, where W s and b s are trainable parameters and the activation function σ is sigmoid. Finally, we optimize our model with positive sampling and negative sampling (Lan et al., 2020) based on the discriminative training scheme.

Inference
In the inference process, we input the conversational context and response candidate as the c and r in Eq. 1, and conduct the operation as Eq. 5, Eq. 6 and Eq. 7 to obtain the score g c . We use the g c as matching degree of the response candidate.

Dataset
To evaluate the effectiveness of our proposed automatic evaluation metric EMS, we conduct experiments on the following two open-access datasets.
The persona-chat dataset (Zhang et al., 2018) is a large personaconditioned chit-chat style dialogue dataset which consists of 10,907 multi-turn dialogue sessions 3 .
The dailydialog dataset (Li et al., 2017) is an another widely-used large collection of human-human dialogues which consists of 13,118 multi-turn dialogue sessions 4 .
Human-annotated Dataset. We collect the human-annotated datasets from Amazon Mechanical Turk and obtain two human-annotated datasets which consist of 750 context-response pairs in the persona-chat dataset and 800 ones in the dailydialog dataset, respectively. Following , the generated references come from several classical dialogue models, i,e., Seq2Seq (Sutskever et al., 2014), Seq2Seq with Attention , HRED , VHRED , GPT-2 (Wolf et al., 2019).

Baselines
We compare our proposed method with the following highly related and strong baselines.
BLEU. We utilize BLEU score (Papineni et al., 2002) to measure n-grams overlaps between response reference and generated response. Specifically, we follow the conventional setting in Sinha et al. (2020) and use the multi-bleu 5 .
METEOR. The METEOR (Banerjee and Lavie, 2005) is designed as an improvement on BLEU using a harmonic mean of precision and recall, as well as stemming and synonyms.
Embedding Metrics. Embedding Metrics compute the similarity between the embeddings representations of generated results and reference. The used embeddings come from glove 6 . In particular, we calculate three metrics: 1) Average, cosine similarity between the averaged word embeddings in the two sentences (Mitchell and Lapata, 2008); 2) Extrema, cosine similarity between the largest extreme values among the word embeddings in the two sentences (Forgues et al., 2014); 3) Greedy, i.e., greedily matching words in two sentences based on the cosine similarities, and the total scores are then averaged across all words (Rus and Lintean, 2012).
BERTScore. BERTScore  uses a strong PLM model to greedily match each word in a reference response with one word in the generated response. By doing so, it computes the recall of the generated sequence. BERTScore was shown to have strong system-level and segmentlevel correlation with human judgment on several machine translation tasks.
BLEURT. BLEURT (Sellam et al., 2020) is based on the BERTScore and finetuned on human judgments after pretraining on large-scale synthetic data with multiple automatic metrics as supervision signals. BLEURT has shown its strong correlation with human judgment on machine translation tasks.
RUBER. RUBER (Tao et al., 2018) is an unsupervised automatic evaluation metric that considering the similarity of the generated response with conversational context and response reference.
MAUDE. MAUDE (Sinha et al., 2020) proposed an unreferenced automated evaluation metric that uses large-scale PLMs to extract hidden representations of dialogue sentences, and leverages the temporal transitions that exist between them.

Settings
The dimension of latent variable z is set to 768 to improve the information bottleneck. As we mentioned before, the encoder and the decoder in our model are BERT (Devlin et al., 2018) and GPT-2 (Radford et al., 2019), respectively. We use the Tokenizer of BERT to conduct the texts for BERT and tokenizer of GPT-2 for GPT-2 in all experiments. We use the Optimus (Li et al., 2020) to save the time consumption of the VAE training. The recognition network and prior network are consist of 3 layers MLP with Dropout layer and GELU function. When training, we set the mini-batch size to 16. The AdamW optimizer is used to train the VAEs module with the initial learning rate 5e-5, and the learning rate warmup and decay strategy is employed. The value of τ in Eq. 5 is set to 0.01. When we conduct the matching training, we change the initial learning rate to 3e-6, and the learning rate warmup and decay strategy is also employed.

Overall Performance
We examine the performance of our model compared with baselines on two open-access datasets and the results.
The word-overlap metrics based on n-gram perform worst. As shown in Table 1 and Table 2, the word-overlap evaluation metrics, i.e., BLEU, ROUGE, and METEOR, obtain the worst performance in the dialogue evaluation on both two datasets. Among them, BLEU (hybrid) scores on two datasets are both less than 0.1, though it is the most widely used metric in machine translation. Intuitively, the information from n-gram is more accurate with a larger n (the most accurate information comes from the whole sentence). However, as the results shown in the Table 1 and Table 2, the correlation score decays when n increases. The same phenomenon is observed when using ROUGE. It seems using n-grams as the representation of the dialogue sentence is not a good choice.
PLM is an effective representation extractor for dialogue sentences. From Table 1 and Table 2, we can see that most embedding-based metrics, i.e., Average, Extrema, Greedy, BERTScore and BLEURT, using pretrained embedding to represent the sentences, perform better than wordoverlap metrics which uses n-grams as representation. Furthermore, traditional embedding-based metrics with Glove-based embedding, i.e., Average, Extrema, Greedy, perform worse than the embedding-based metrics with PLM-based embedding, i.e., BERTScore and BLEUET. Thus, we can know that using PLM to represent the dialogue sentence is more effective for the evaluation.
Learning-based discriminative metrics outper- Our proposed EMS metric performs the best. Our EMS metric achieves the best performance with 0.5856, 0.5921 in Pearson and Spearman correlation with the human judgment on the personachat dataset, respectively. Meanwhile, on the dailydialog dataset, EMS obtains 0.5331, 0.5253 in Pearson and Spearman score. These experimental results show our method outperforms all existing baselines, indicating the superiority of our method.

Analysis
Our model aims to enhance the dialogue evaluation via variational training. Hence, in this subsection, we examine whether variational training can improve the performance by ablation study. First, we replace the hidden representation ("w/o q") in Eq. 5 by the one from pure BERT, i.e., CLS. From the performance in Table 3, the KL regularization enhances the performance of EMS metric on both two datasets which proves a smooth latent space (via VAE training) is important. Second, as shown in Table 3, without z p in Eq. 6 ("w/o p") which captures the feasible latent reference information, EMS gains a performance drop. Therefore,  Context: what a beautiful home! eos you'll notice that the window treatments, carpeting, and drapes are all new. eos i like the way the blinds give you privacy from the street. eos follow me into the kitchen. you will love it. eos i love that they put a wine storage area in the kitchen.
Reference: the best part is the bedroom and attached bathroom.

Generated:
i'm sure you will.

Case Study
To explain more intuitively, we show two cases of our experiments in Table 4. In the first case, we can observe the golden score from the human is 3.00, however, MAUDE predicts the score as 4.99. We find the MAUDE gives such a high score because there is a keyword ("what kind of buses are they on ?") in the generated response which also exists in the conversational context. In the second case, MAUDE gives an extremely low score, i.e., 1.17, since no repeated words in the generated response and the context. However, our EMS gives scores similar to the Human Score, 3.32 and 2.29 in the first and second case, respectively. It proves that our EMS is more similar to human evaluation.

Conclusion and Future Work
In this study, we propose a two-stage automatic evaluation metric, i.e., EMS, which can obtain a smooth latent space that can both capture discourselevel context information and model more feasible latent references for evaluating the open-domain dialogues. Experimental results on two dialogue datasets confirm the superiority of our method for open-domain dialogue evaluation, where both Pearson and Spearman correlations with human judgments outperform all baseline methods.
Owing to the promising performance of the variational training, we pursue to design the training procedures for better representation in latent space. Besides, we will explore more efficient methods to obtain more useful feasible reference information.

Acknowledgement
We would like to thank the anonymous reviewers for their constructive comments. This work is supported by the National Key Research and Development Program of China (No. 2020YFB1406702), the National Science Foundation of China (NSFC No. 61876196) and Beijing Outstanding Young Scientist Program No.BJJWZYJH012019100020098. Rui Yan is sponsored by Tencent Collaborative Research Fund. Rui Yan is the corresponding author, and is supported as a Young Fellow of Beijing Institute of Artificial Intelligence (BAAI).

Ethics Impact
In this paper, we propose a two-stage automatic evaluation metric EMS for open-domain dialogue systems. The positive impact lies in that it allows quick and effective comparison between different dialogue systems and is crucial for the development of open-domain dialogue tasks. The negative impact may be that in some extreme cases, the system may give high scores to the rude and dirty responses. Hence, in such situation, the training dataset is crucial and should be examined before employed in practice.

A Proof
As well we know, the the mutual information (MI) of X and Y is defined as where H(·) denotes the entropy.
(9) Then, we can compare the MI of using the feasible latent reference information and not as follows (10) where we can observe the MI is enhanced.