RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue

Evaluating open-domain dialogue systems is challenging for reasons such as the one-to-many problem, i.e., many appropriate responses other than just the golden response. As of now, automatic evaluation methods need better consistency with humans, while reliable human evaluation can be time- and cost-intensive. To this end, we propose the Reference-Assisted Dialogue Evaluation (RADE) approach under the multi-task learning framework, which leverages the pre-created utterance as reference other than the gold response to relief the one-to-many problem. Specifically, RADE explicitly compares reference and the candidate response to predict their overall scores.Moreover, an auxiliary response generation task enhances prediction via a shared encoder.To support RADE, we extend three datasets with additional rated responses other than just a golden response by human annotation.Experiments on our three datasets and two existing benchmarks demonstrate the effectiveness of our method, where Pearson, Spearman, and Kendall correlations with human evaluation outperform state-of-the-art baselines.


Introduction
Open-domain dialogue system, which focuses on non-goal-oriented chitchat, may converse on a broad range of arbitrary topics.Recent years have witnessed rapid advances in natural language generation (Zhang et al., 2020c;Roller et al., 2021;Zhao et al., 2023), boosting the development of opendomain dialogue systems.Conversations with such systems resemble human-human interactions as various responses might fit the context, given that users often do not have a specific goal beyond enjoying the conversation.Evaluating these conversations is thus challenging because of the socalled one-to-many problem (Chan et al., 2021;Ji et al., 2022); see Figure 1 where three candidate * * Corresponding author.responses with different semantics fit the context while there is only one golden response.
The most common practice of dialogue evaluation is done with reference-based metrics, which compare the generated response with a pre-created response, commonly referred to as the golden standard (Ji et al., 2022).The reference-based metrics calculate the similarity between the generated and gold responses at either lexical level (e.g., ROUGE (Lin, 2004), BLEU (Papineni et al., 2002)) or semantic level (e.g., BERTScore (Zhang et al., 2020b), ADEM (Lowe et al., 2017)).However, these metrics ignore the one-to-many nature of open-domain dialogues.As illustrated at the bottom of Figure 1, the generated response "Amazon is good but expensive ..." expresses the opposite semantics to the golden response "I shop online..." and is therefore considered a non-good response by the reference-based metrics.Therefore, these metrics may need a higher consistency with humans.Recently, multi-reference methods and referencefree methods are proposed to address the drawback of reference-based metrics.The former explicitly annotates multiple references for dialogue (Eric et al., 2021), whereas the latter discards the golden response in the evaluation and achieves high cor-relations with human judgments (Mehri and Eskenazi, 2020c;Huang et al., 2020).However, drawback still exists in these two classes of methods.Multi-reference methods are costly and hard to generalize to different datasets, while referencefree methods are often unstable and vulnerable to data-induced biases1 .
To overcome the weakness of existing evaluation methods and further resolve the one-to-many problem, we propose a new technique, namely Reference-Assisted Dialogue Evaluation (RADE).RADE considers the pre-created response as a reference instead of the golden standard.
To support RADE, we design a new human annotation task to extend existing datasets, which includes metric decompose and pairwise annotation, where a pre-scored golden response is paired with generated responses for rating following a unified rating score.The final scores are arrived at by aggregating ratings with a weighted sum from different sub-metrics.The human annotation collects labels for three high-quality datasets with 10,112 dialogues, which correspond to three downstream open-domain dialogue system tasks, i.e., chitchat, empathetic dialogue, and personal chat.These multi-domain datasets make RADE more robust when generalizing to cross-domain evaluation scenarios while having a better task-specific performance.
We propose a RADE model under the multitask learning framework for automatic evaluation based on the newly collected datasets.Specifically, RADE first explicitly encodes the relation between dialogue context and generated response with reference assistance.Then RADE discriminates whether the reference or response fits the context better and predicts the scores for each utterance.To relieve the one-to-many problem, we augment RADE with a joint response generation task where RADE learns to generate the reference responses to better perceive the range of candidate responses.
Extensive experiments on our three benchmarks demonstrate that RADE achieves the best correlations with human judgment.We also examine two existing USR benchmark (Mehri and Eskenazi, 2020c) where RADE outperforms the state-of-the-art methods, e.g., pushing the Pearson correlation coefficient to 48% (6.8% absolute improvement) and Spearman correlation coefficient to 46.6% (4.3% absolute improvement).Experiments also verify the generalizability of our proposed method.
Our contributions can be summarized as follows: (1) We propose the reference-assisted evaluation method, i.e., RADE, for open-domain dialogue evaluation; (2) We design a new human annotation task and collect three new dialogue evaluation datasets; (3) Experiments on our benchmarks and two existing benchmarks verify the effectiveness and robustness of the proposed methods; (4) We release three new benchmarks and the pre-trained evaluation model to facilitate future research on dialogue evaluation.

Reference-based dialogue evaluation
Previous reference-based methods compare the generated response with the pre-created response at the lexical or semantic level.Lexical-level metrics, e.g., ROUGE (Lin, 2004), BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005), count the n-gram overlap between the candidate response and the reference response.These methods usually correlate poorly with human evaluation results due to the lexical mismatch problem (Liu et al., 2016).Semantic-level metrics evaluate address lexical mismatch problem by calculating similarity with high-dimension embeddings.For example, Sharma et al. (2017) measures the embedding distance between golden and generated response.Ghazarian et al. (2019) and Zhang et al. (2020b) enhance the text representation using the large pretrain model, which has shown exemplary performance in capturing semantic similarity.However, they suffer from the one-to-many problem when evaluating open-domain dialogues since responses with various semantics may fit the dialogue context.
Recent works tend to relieve this drawback by annotating multiple references for dialogue, commonly referred to as multi-reference methods (Li et al., 2017;Sai et al., 2020), which are costly and hard to generalize to agnostic scenarios.The proposed RADE aims to consider the pre-created response as a candidate instead of the golden standard to address the one-to-many problem of dialogue evaluation.

Reference-free dialogue evaluation
The reference-free methods are gaining more attention as they correlate more with human judgment only with the dialogue context and response.For example, MAUDE predicts the score of dialogue using pre-trained language models, GRADE (Huang et al., 2020) evaluates the coherence of dialogues with the augmentation of the commonsense graph, EMS (Chan et al., 2021) enhances the dialogue evaluation by capturing the representation of the context and response in latent space.Some methods further decompose the evaluation of responses into multiple perspectives (Mehri and Eskenazi, 2020a,c;Phy et al., 2020), such as relevance, fluency, and engagingness, then aggregate the overall score from different sub-metrics with a weighted average.However, some recent studies (Khalid and Lee, 2022;Deutsch et al., 2022) reveal that the referencefree methods are vulnerable to data-induced biases and inherently biased toward models which are more similar to their own.In contrast, this paper proposes a reference-assisted approach, which enhances the robustness of the model using reference responses as a benchmark.

Task Formulation
In this work, we propose two tasks: (1) extending the existing datasets by human annotation, and (2) leveraging the rated references collected in (1) to enhance automatic evaluation.
Human annotation Human annotation aims to extend existing datasets with multiple rated responses to facilitate automatic evaluation.Given a dialogue context c, which is always paired with a golden response (denoted as reference) r h , we employ the generation models, e.g., Blender-Bot (Roller et al., 2021), to generate one more response r a .We then assign a fixed overall score or derive from existing datasets to the reference as s h .The annotators are instructed to rate r a as s a , following the same scale while taking the reference as a benchmark.The annotators are also asked to revise the reference score s h if s h is inappropriate.
Automatic evaluation Given a dialogue context c, the proposed RADE learns to evaluate the response r a with the assistance of reference r h under the multi-task learning framework.The first task explicitly models the relation between reference and response and discriminates which fits the con-Relevance † : Whether the response matches dialogue context semantically.
Engagingness † : Whether the response is engaging or interesting rather than rigid template.
Fluency † : Whether the response is fluent and natural throughout the conversation.
Understandability ‡ : Is there any external knowledge contained in the response.
Emotional-awareness ‡ : Whether the agent capture the emotion of user and support empathic support.
Personality-awareness ‡ : Whether the response conforms to given personality.
Table 1: Criteria in human annotation.Metrics with † are general metrics for all dialogue tasks, while metrics ‡ are metrics for specific dialogue tasks (e.g., understandability for chitchat, emotion-awareness for emotional dialogue and personal-awareness for personal chat).
text better.The scores of reference and response are predicted simultaneously.And the second task enhances the score prediction task by implicitly estimating the distribution of candidate responses.

Human Annotation
Our human annotation task aims to rate the candidate responses following a pre-scored reference as a benchmark.Since there are multiple perspectives to assess the response, we simplify by sorting the possible aspects into two categories: the general view and the task-specific view.As listed in Table 1, the former contains relevance, engagingness, and fluency, which are suitable for all dialogue agents.And task-specific criteria consist of understandability, emotional awareness, and personality awareness, which correspond to chitchat dialogue, emotional dialogue, and persona dialogue.We annotate rates on each metric and calculate the overall rating score by weighting these sub-metrics.Specifically, the weights are obtained based on the preference of users (see section A.1.3 for more details).

Data preparation
We consider three datasets to extend: • DSTC-ChitChat (ChitChat) (Hori and Hori, 2017), a chitchat dataset collected from Twitter, each example derived from the conversation between a customer and an agent.• Empathetic Dialogues (EmpaDial) (Rashkin et al., 2019), which consists of 25k dialogues grounded in emotional situations.• PersonaChat (Zhang et al., 2018), a real-world dataset consisting of 10k dialogues where each participant plays the part of an assigned persona.

Human annotation detalis
We hire 40 annotators for data annotation.Following a five-scale standard, they are asked to label sub-metrics as listed in Table 1.The five-scale allows the annotators to factor in their subjective interpretation of the extent of success or failure of a system's response to satisfy a user's request.The dialogue context, rated reference response, and corresponding score are provided in each example.At least three annotators are required for each example.We annotated about 10k dialogues for the three datasets, and the statistics of the collected datasets are listed in Table 2.The ratings achieve reasonable inter-annotator agreements with Fleiss Kappa scores of 0.540, 0.554, and 0.533 on three datasets, respectively.More details about the annotation guideline and details are provided in Appendix A.1.2.

Reference-Assisted Automatic Evaluation
We propose RADE, a Reference-Assisted Automatic Dialogue Evaluation method under the framework of multi-task learning.Compared with reference-based methods that evaluate based on the distance between the golden and generated response, the proposed RADE explicitly discriminates whether the reference or candidate response fits the dialogue context better.To relieve the one-to-many problem, we augment RADE with a joint response generation task, which aims to perceive the range of feasible candidate responses.Posterior encoder.The posterior encoder encodes the dialogue context c, reference response r h , and model-generated response r a into hidden representation.In particular, we first concatenate c, r h and r a together into X with a specific token [SEP]: Then the concatenated sequence is fed into a transformer-based encoder to get the representation H ∈ R |X|×d : where d is the hidden size of encoder, |X| is the length of sequence X.
Regression layer.The regression layer aggregates the representation H and predicts the scores of both reference and candidate response simultaneously.Specifically, a pooling layer aggregates the token-level representation into a sequence-level representation: h ∈ R d×1 : Then, a feedforward network takes h as input to predict the score of both reference and candidate response: where ŝh and ŝa denote the predicted score of r h and r a , respectively.Candidate response generator.To relieve the one-to-many problem, we devise a candidate response generator to perceive the range of feasible candidate responses (Chan et al., 2021).Specifically, a Transformer-based generator learns to generate reference responses autoregressively for a specific context.We first encode the dialogue context c using a encoder: where the Encoder shares the same parameters with the posteriori encoder in Eq. ( 2).Then, we apply a Transformer-based decoder Decoder to model the generation probability of reference response r h : Decoder(r where T denotes the length of r h .Compared with the previous reference-free methods, which estimate the relation between context and response only with the knowledge acquired from their training data, RADE explicitly takes the pre-created response as a benchmark to reduce the data-induced bias when generalizing to agnostic scenarios.Moreover, different from existing reference-based methods, which use the pre-created response as the golden standard without considering the semantic diversity of the response, we relieve the one-to-many problem via auxiliary response generation tasks.The share encoder enhances the capability of context represen-tation which augment the performance of scorepredicting task through multi-task learning.

Two-stage training
The neural-based model has been proven prone to data-induced bias, but it is costly to annotate a large dataset in every specific task.Therefore, we propose a two-stage strategy that includes: (1) cross-domain pre-training, and (2) task-specific fine-tuning, keeping a tradeoff of performance between in-and cross-domain.As shown in Figure 2 (right), we pre-train our model based on existing human-annotated datasets from different downstream tasks of open-domain dialogue to improve the generalizability (Ye et al., 2021a).Since the cross-domain datasets suffer from domain gaps and no pair-wised score, we finetune our model in the next stage with newly collected task-specific datasets.
Cross-domain pre-training.The pre-training datasets contain 54,438 dialogue-level examples collected from different downstream tasks, covering a wide range of domains (see more details in Table 7).For learning the coarse-grain judgment of generated response without human-annotated reference scores, our model is first pre-trained by minimizing a new cross-domain pre-training loss L Cross .Concretely, the L Cross is composed of scoreprediction loss and generation loss, which can be formulated as: where ŝa and s a denote the human-annotated score and the predicted score of the candidate response and L MSE ( ŝa , s a ) = ( ŝa − s a )2 .L GEN is the response generation loss, which is defined as: where P (r h |c) is the generation probability of r h defined in Eq. ( 6).
Task-specific finetuning.We next finetune our model with newly annotated datasets to enhance the performance when evaluating task-specific dialogue agents.The optimize objective L In is composed of score-prediction loss, generation loss, and pair-wised ranking loss, which can be formulated as: where L MSE ( ŝa , s a ) and L MSE ( ŝh , s h ) are MSE score-prediction loss of reference response and candidte response, respectively.L GEN is the generation loss as defined in Eq. ( 8).L PR is the pair-wise ranking loss defined as: in which g(s h , s a ) is a labeling function defined as: The L PR is introduced to assure that the rank order of the predicted scores satisfies the preannotated order.Compared to reference-free models that inherently favor outputs from their underlying models or those trained on similar datasets, RADE is specifically optimized to align with human intentions and effectively alleviate this bias.
6 Experimental Setup

Dataset and evaluation metrics
We mainly conduct experiments on the three datasets annotated in Section 4. We further evaluate the models on two existing benchmarks, USR-TopicChat and USR-PersonaChat (Mehri and Eskenazi, 2020c), to examine the generalizability of our method.The evaluation metrics include Pearson (r), Spearman (ρ), and Kendall (τ ) correlation, which measures the linear relationship, monotonic relationship, and the ordinal association between automatic evaluation and human evaluation, respectively 2 .We abbreviate the Pearson, Spearman, and Kendall correlation as r, ρ, and τ for simplicity.

Implementation details
We initialize the parameters of the encoder and decoder with BART (Lewis et al., 2020), a Transformer-based pre-trained model.BART is well-suited to our proposed model because it is capable of both text representation tasks and text generation tasks.We optimize the model using Adam optimizer with parameters β 1 = 0.98, β 2 = 0.97, and the learning rate of 5e−5.The model is trained up to 10 epochs, and we tune the hyper-parameters and pick the checkpoint on the development set.
The training of the model can be done within 5 hours using two 2080Ti GPUs.We denote the RADE model that pre-trained on cross-domain datasets as RADE (PT), and the model that further finetuned on task-specific data as RADE (TS).

Baselines
We compare our method with two types of baselines: reference-based and reference-free methods.
The reference-free baselines include: Di-aloRPT (Gao et al., 2020b), which trained on largescale social media feedback data to predict rankingbased scores; GRADE (Huang et al., 2020), which enhances the contextualized representations via topic-level commonsense graphs and predicts the score using a regression module; FED (Mehri and Eskenazi, 2020a), an unsupervised dialogue evaluation model based on DialogGPT; UniEval (Zhong et al., 2022), which evaluates the response from multiple perspectives; QuesEval (Scialom et al., 2021), which evaluates the fact-based text using summarizing asks.
The reference-based baselines include: RU-BER (Tao et al., 2018), an unsupervised evaluation metric considering the similarity of the response with dialog context and reference; BERTScore (Zhang et al., 2020b), which employs BERT to greedily match the response and the ground truth at the token level; BLEURT (Sellam et al., 2020), which is a BERT-based model pre-trained with millions of synthetic examples; BARTScore (De Bruyn et al., 2020), which weights the log-likelihood of the generated response as the score.We also test three reference-based lexicallevel metrics: ROUGE-L, BLEU-2, and METEOR.Moreover, we implement two reference-based baselines, BERT MLP and BART MLP , which are trained with the same human-annotated datasets as RADE, and provide a reasonable comparison with our proposed model.Specifically, we obtain the text representations of the dialogue using BERT or BART and then feed the representations into a multi-layer perception to calculate the scores.For a more comprehensive analysis, we also fine-tune the two strongest baselines, QuantiDCE and GRADE, on our cross-domain datasets as well as our selfcollected datasets, respectively.

Experimental results
Overall performance.Table 3 shows the experimental performance for all methods.Overall, RADE achieves the best performance in three benchmarks in terms of all metrics.Concretely, the pre-trained model RADE (PT) gets better or comparable correlation with human judgment than the best baseline method on three dialogue tasks.
The task-specific model RADE (TS), fine-tuned with the newly collected reference-assisted data, establishes a new state-of-the-art by improving the performance by about 30% on average compared to RADE (PT).For example, RADE (TS) gets r = 0.601, ρ = 0.569 in the ChitChat domain, and pushes r to 0.863 (0.314 absolute improvements), τ to 0.685 (0.287 absolute improvements) in Em-paDial domain.This result suggests that training with in-domain datasets is critical to enhancing the task-specific evaluation capability of RADE.For a more comprehensive comparison, we also train the two strongest baselines (QuantiDCE and GRADE) with our cross-domain and self-collected datasets, respectively.And the result and analysis are provided in Appendix A.2.3.
Generalizability.We find that the performance of the reference-free method varies dramatically across domains.For example, GRADE and Quan-tiDCE, trained in the chitchat domain, achieve high correlations with human judgment in ChitChat and EmpaDial but perform poorly in PersonaChat.The result indicates that the contextual representation capabilities of unsupervised methods are limited by their training data and, therefore, are prone to datainduced bias, decreasing their performance when employing agnostic scenarios.In contrast, the gap between the proposed RADE (PT) methods across different domains is relatively small.These results indicate that RADE has better generalizability than reference-free methods due to the assistance of reference and the proposed cross-domain training strategy.
Results on USR benchmarks.We further examine our methods on two USR datasets (Mehri and Eskenazi, 2020c) to verify the efficiency and robustness of RADE when generalizing to existing dialogue evaluation benchmarks.The results are listed in

Ablation study
We perform an ablation study to investigate the influence of different components in our methods.We examine two ablative variants: (1) w/o L PR : we remove the ranking-based loss L PR to verify its effectiveness (w/o L PR ); (2) w/o L GEN : we remove the L GEN to verify training with response generation task jointly can improve the predicting correlation with human judgment.Table 3 presents the results.Overall, the variants of our methods show a decreased performance compared to the base model.For example, Pearson drops 0.10, 0.09, and 0.07 in three benchmarks, respectively, after the L PR is removed.This result indicates that ranking-based loss can enhance performance by explicitly building the relation be- tween response and reference.After removing the L GEN , the correlation in all benchmarks has a prominent decrease, e.g., Spearman correlation drops by 0.15, 0.10, and 0.09, respectively.The results suggest that the auxiliary response generation task improves the representation capability of our method and relieves the one-to-many problem.

Case study
Our case studies demonstrate that RADE is more consistent with human judgment than baselines.Details about our case studies are available in Appendix A.2.5.

Qualitative analysis
To explain more intuitively, we show the scatter plots against human judgments for different automatic evaluation methods (i.e., RADE, GRADE, BERTScore, METEOR) on the EmpaDial dataset in Figure 3.As shown in Figure 3 (a), our method RADE achieves a stronger correlation with human judgment than the other methods.Figure 3 (b) shows that GRADE achieves a better correlation with human judgments.However, the distribution of GRADE predicted scores is concentrated in the high-scoring band, resulting in a low distinction of responses; RADE uses reference as a benchmark and thus has a more balanced distribution of predicted scores.

Discussions
The impact of the training data scale.To explore the minimum data scale required for our method, we train RADE using different amounts of randomly sampled annotated data.We observe a minor degradation in RADE's performance as the amount of data decreases

The difference between golden and candidate
Responses.Golden response refers to a scenario where there is only one correct response, and any different response is given a low score.For example, BERTScore calculates the cosine similarity between the golden and model-generated response.However, Candidate responses implies that there can be multiple correct answers, which is more flexible and human-intuitive.And RADE is optimized to align with this human intention using generative and pairwise-ranking loss.If more references are available, the RADE can consider multiple valid responses to make more reliable evaluations.To achieve this, we can concatenate model-generated responses with different references.However, due to the limitation of our datasets, we concatenate one reference and model-generated response, which are then fed to the encoder.
Employing RADE when the reference response is not available.Considering the reference is not always available in real-world scenarios, we design two alternatives to enable RADE, i.e., constructing a pseudo-reference via retrieval or generative method.We verify the two solutions on the FED dataset and the details can be found in Appendix A.3.

Conclusion
We have presented a new reference-assist dialogue evaluation (RADE) method to address the one-tomany problem when evaluating open-domain dialogue systems.RADE evaluates the response generated by open-domain dialogue agents with the assistance of reference response.In addition, we have curated the reference-assisted dialogue evaluation datasets by expanding three existing datasets via a pairwise human annotation.The extended datasets contain over 10K dialogues.Extensive experiments on three extended datasets and two existing benchmarks have verified the effectiveness and robustness of the proposed methods and their generalizability.

Limitations
The main limitation of this paper is the need for human-labeled reference responses.We will explore automated or human-machine collaboration methods to reduce the cost of annotation in the next stage.Another limitation is that we need to explore whether other auxiliary tasks can also enhance the performance of score prediction.In the future, we also plan to reproduce the proposed method for other, less resource-rich languages.

Ethics Statement
The paper proposes a dialogue evaluation method, which is intended to evaluate open-ended dialogue on topics such as books and movies.A new dataset is developed using some existing dialogue systems, such as DialoGPT, which are trained on large-scale web data that is known to contain biased or discriminatory content.The datasets that we trained on may also include subjective knowledge (comments on movies) that may express the bias of the writers.

A Appendix
A.1 Human Evaluation Details

A.1.2 Annotation Guideline
Table 6 provides detailed instructions for the annotators to s help them understand the setting of our annotation task.
Annotation Guideline Instruction You need to read the context for each conversation to understand the specific context.Afterward, compare the two responses and determine which is better on the given metric.Since we have given a score to the reference response, you should take it as the benchmark and rate the generated response.
Dataset (1) context: The historical interaction between two partners.
(2) (reference,s h ): The reference response and corresponding score.
(3) response: The response generated via agent which you need to rate.
Rating Details (1) If the generated responds is better, the scores you give should be more than s h .
(2) If the generated responds is worse, the scores you give should be less than s h .
(3) If there is no significant difference between the two response, you can give the same score as s h .
Table 6: The guideline used for our human annotation.

A.1.3 User Study
The dialogue can be evaluated from multiple perspectives.Some perspectives are universal to assess all dialogue agents, e.g., fluency, and relevance, while the other metrics are only used for task-specific dialogue agents.For example, the emotion-aware is a critical property for empathetic dialogue but is less important for persona dialogue.Therefore, we first simplify by sorting the possible aspects into two categories, i.e., the general view and the task-specific view.The former contains rel- evance, engagingness, and fluency, while the latter consists of understandability, emotion-aware, and personality-aware, which correspond to chitchat dialogue, emotional dialogue, and persona dialogue.To understand the relation between sub-metrics and overall quality, we conduct a user study to learn their preference for different sub-metrics.Specifically, we invite 20 experts and 80 users, each of whom is asked to select the four most important ones from the sub-metrics.The results are listed in Figure 4.The approval rates reflect the user preference for different sub-metrics, which can be used as a weight to calculate the overall score.Moreover, we apply the softmax function on these weights to make them more interpretable.

A.2.1 Datasets for Pre-train Stage
Our training process includes two stages, e.g., cross-domain pre-train and task-specific finetune.We first pre-train the model on diverse opendomain dialogue datasets as listed in Table 7 with the objective L cross .The next stage relies on taskspecific dataset with the objective L in (see in section 5).
These datasets are collected from https:// github.com/e0397123/dstc10_metric_track,which contain a variety of open-domain dialogue, such as emotional dialogue, personalized dialogue, knowledge-grounded dialogue, and chitchat.Every example in the datasets contains the dialogue context, response generated by dialogue agent, pre-created reference response, and the score of generated response which has been annotated for at least three people from several perspectives.We use cross-domain datasets for pre-training to improve the robustness and generalisability of the models across different evaluation scenarios.The experimental results show that RADE outperforms the state-of-the-art reference-free and reference-based methods on the USR-TopicalChat dataset.For example, we push the Pearson correlation to 48.0% (7% definite improvement) and Spearman correlation to 46.6% (4% absolute improvement).Moreover, RADE shows a stronger correlation with human judgment than existing reference-based methods on the second dataset.It achieves comparable, even better results with the reference-free methods except for USL-H.The results demonstrate that our pre-trained model is more robust even under agnostic scenarios.
We also compare the two existing methods, and the results suggest a similar phenomenon as 3. Firstly, the reference-free methods achieve better consistency than reference-based methods, i.e., the former has the highest result of r = 41.2%, ρ = 42.3% while the latter gets r = 34.2%,ρ = 34.8% on the USR-TopicalChat dataset.However, the reference-free methods suffer from more significant variance.For example, the MAUDE gets r = 0.345% and ρ = 0.298% on the USR-PearsonChat dataset but gets r = 0.044% and ρ = 0.083% on the USR-TopicChat dataset.It indicates that reference-free methods are more vulnerable and prone to data-induced bias.

A.2.5 Case Study
To explain more intuitively, we show examples of automatic evaluation and them with human judgment in Table 10, 11, 12, suggesting that the scores of our methods are closer to human ratings.

A.3 Presudo reference
Since the original FED does not provide the reference response, we construct a pseudo-reference via retrieval or generative method.The former retrieves reference from a curated response corpus based on our cross-domain datasets via BM25 with the dialogue context as the query.The latter generates via a large language model GPT-3 based on the dialogue context.The results show that RADE(PT) obtains Pearman'r=0.381and Spearman'rho=0.368with the retrieved reference while achieving Pearman'r=0.343,Spearman'rho=0.347with generative reference, outperforming the stateof-the-art baseline (QuantiDCE,Pearman'r=0.319,Spearman'rho=0.323).
To further validate the generalizability of our method, we evaluate our proposed RADE(PT) on another challenging benchmark, GRADE-Dailydialogue.Our RADE(PT) achieves Pearman'r=0.356and Spearman'rho=0.370with 5% and 2% relative improvements compared to stateof-the-art baseline, indicating that our method can generalize to more challenging benchmarks.

Context
User1: The library of Alexandria had a unique way of gathering books by making all ships visiting give their books for copy.User2: They must have had a impressive collection!User1: How unfair, they would copy their books give them back the copy and keep the originals.

Reference
I guess that is true.Do you think we will ever have a centralized digital library of all our works and knowledge?

Response
That is exactly why they didn't stick around and stay put.I wish I could work somewhere where I could share the workload.
Emotion Excited

Context
User1: I am looking forward to my best friend s surprise birthday party next week!User2: That sounds like so much fun!I love parties!User1: I am really happy about it.She is a great friend and she is turning 40, so it is a big one!Reference Hey, I just had that one!What do you have planned?
Response That is great.Do you have any other day?

Personality
(1): I like to donate time and money to charity.
(2): I work as a computer programmer.

Reference
Would you like to marry one of my four attractive daughters?I will sell one.
Response wow! That's a lot of money.Do you have any hobbies?
(2): I work at a grocery store.

Response
What do you do for fun?My girlfriend and I go to the lake a lot.

Figure 1 :
Figure 1: An example to explain the one-to-many nature of open-domain dialogues.

Figure 2 :
Figure 2: Left: An overview of our model which consists of an encoder, a regression layer, and a response generator.Right: Our two-stage training process with cross-domain pre-training (PT) and task-specific finetuning (TS).

Figure 3 :
Figure 3: Score correlation of automatic evaluation and human evaluation on the EmpaDial domain.The horizontal axis indicates the different automatic evaluation methods, and the vertical axis indicates human rating.
Figure 3 (d) illustrates that METEOR scores are zero or extremely low for the most response.It results from the oneto-many nature of open-domain dialogue, and word overlapping occasionally occurs.Figure 3 (c) suggests that the BERTScore scores are mainly concentrated in the range of 0.3-0.6,indicating no significant differentiation between the different responses.

Figure 4 :
Figure 4: Result of two-role user study.

ContextUser1:
Hi, how are you ?tell me something about yourself!User2: Well, I love going fishing with my four attractive daughters every day.User1: Sounds fun!I enjoying volunteering and donating to charities.User2: Cool maybe you'd like to run a charity at my new race track.I race cars!User1: Sounds exciting!I am a computer programmer, which pays over 200k a year.

ContextUser1:
Hi! what are you up to ?I am doing what I like best, eating cupcakes!User2: Hi I am winding down from a day at work.User1: So am I.The local grocery store is where I work.What about you?User2: I also work in the retail industry at the local department store!User1: Other than eating cupcakes, reading is also what I like to do to wind down.ReferenceI like to read also and play with my dog.Do you have a pet?

Table 2 :
The statistics of the collected datasets.For each example, the overall score of the response is mean of all sub-metrics.

Table 3 :
Results on three benchmarks.The metrics r, ρ, and τ indicate the Pearson's ρ, Spearman's r, and Kendall'τ .All values are statistically significant to p-value < 0.05 unless marked by * .Methods with † are implemented by ourselves.We underline the best results of each group of baselines methods and bold the best results of all methods.The bottom of the table show the ablation study, where the proposed RADE is compared with several variants (-w/o: without).See section 7.2 for details.

Table 4 .
Experiments show that RADE, which has not explicitly trained on these datasets, achieves better or comparable results to previous supervised methods.See Appendix A.2.4 for more results and details.
As shown in Table5, we extend the DSTC dataset with Blenderbot and DialoGPT, the Empathetic Dialogue dataset with KEMP, MoEL, MIME and EmpDG; the Persona-Chat dataset with Blenderbot and PersonaGPT.Since Roller et al. points out the length of the utterances is crucial to human judgments, i.e., too short responses are seen as dull, we only sample the example with at least two turn interactions with an average length of utterance no more than 25

Table 5 :
The data distribution of seven well-performing dialogue models, which are used for extend corresponding dataset.

Table 7 :
(Devlin et al., 2019)asets used for pre-train stage.AVG.Utts: the average of utterances per dialogue; AVG.Words : the average of words per dialogue.We show the details of our automatic evaluation experiments in Table9.The BERTScore and BLEURT are computed based on the large version of Roberta.As in Section 6, we implement two reference-based baselines, BERT MLP and BART MLP , using the same human-annotated datasets as RADE for training, and provide a reasonable comparison with our proposed model.Specifically, the BERT MLP is built on the base version of BERT(Devlin et al., 2019), while the BART MLP is built on the base version ofBART (Lewis et al., 2020).

Table 8 :
Results on USR-TopicalChat, USR-PearsonaChat and Grade-DailyDialogue.We divide the methods in Reference-free, Reference-based and REDE, while the reference-based methods including semantic-level and lexicon-level.The metrics r, ρ, and τ indicate the Pearson's ρ, Spearman's r, and Kendall'τ .All values are statistically significant to p-value < 0.05 unless marked by * .We underline the best results of each group of baselines methods and bold the best results of all methods.
love when I need to show someone how to do something and everything goes off without a hitch.User2:Oh yeah, that is always nice.Sometimes they just do not get it, though.User1:Fortunately, it is usually just not having the best answer for what they ask.I have learned and prepared a lot more this time.ReferenceWow, you sound pretty confident!You must really know what you are talking about.