DynaEval: Unifying Turn and Dialogue Level Evaluation

A dialogue is essentially a multi-turn interaction among interlocutors. Effective evaluation metrics should reflect the dynamics of such interaction. Existing automatic metrics are focused very much on the turn-level quality, while ignoring such dynamics. To this end, we propose DynaEval, a unified automatic evaluation framework which is not only capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue. In DynaEval, the graph convolutional network (GCN) is adopted to model a dialogue in totality, where the graph nodes denote each individual utterance and the edges represent the dependency between pairs of utterances. A contrastive loss is then applied to distinguish well-formed dialogues from carefully constructed negative samples. Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model, and correlates strongly with human judgements across multiple dialogue evaluation aspects at both turn and dialogue level.


Introduction
Modern dialogue systems Adiwardana et al., 2020) leveraging large-scale language model pre-training (Devlin et al., 2019;Radford et al., 2019) are capable of generating fluent and contextually relevant utterances. Yet, they still face difficulties in mimicking human conversations in the sense that they lack certain conversation-level attributes, such as coherence (Cervone et al., 2018), consistency (Welleck et al., 2019;Nie et al., 2020), diversity (Li et al., 2016;Wu et al., 2020) and engagement (Ghandeharioun et al., 2019;Ghazarian et al., 2020). One of the main reasons is the dearth of effective dialoguelevel evaluation mechanisms to guide the studies and to monitor progress. 1 https://github.com/e0397123/DynaEval Commonly used static metrics, such as BLEU (Papineni et al., 2002), ME-TEOR (Denkowski and Lavie, 2014) and ROUGE (Lin, 2004), correlate poorly with human judgements (Liu et al., 2016) rendering them unsuitable for dialogue evaluation. While some recent automatic dialogue evaluation metrics (Ghazarian et al., 2019;Mehri and Eskenazi, 2020b;Zhang et al., 2021b) demonstrate strong correlations with human judgement at the turn-level, they only focus on context-response pairs without explicitly modeling the interaction over an entire dialogue. To perform dialogue-level evaluation, we need to rely on the aggregation of turn-level scores over the dialogue as a proxy for a dialogue-level score.
Furthermore, a recent study by Mehri and Eskenazi (2020a) found out that even though state-ofthe-art chatbots outperform humans across multiple turn-level evaluation criteria, such as interestingness, engagement and specificity, their dialoguelevel ratings like coherence, Likability and diversity are still far below human level. This further reinforces the idea that turn-level quality evaluation may be insufficient to assess the performance of open-domain dialogue systems.
In this work, we address the problem of automatic open-domain dialogue evaluation by focusing on the quality of an entire dialogue. This is a departure from the way we frame the problem as a weakly supervised next sentence prediction (Mehri and Eskenazi, 2020b;Sato et al., 2020) or language modeling tasks (Nedelchev et al., 2020;Pang et al., 2020) for context-response pairs. To this end, we need to answer two important questions: (1) How to effectively represent the entire dialogue? (2) How to incorporate this dialogue-level knowledge into our evaluation framework? We propose Dy-naEval to provide meaningful dialogue-level representation with explicit modeling of the interactive dynamics among interlocutors, for a unified turn and dialogue level quality assessment.
The main contributions of this work include: (1) The unified turn and dialogue level evaluation represents a departure from turn-level evaluation scheme; (2) DynaEval is one of the first few metrics where dialogue level dynamics is considered with structured graph representation. (3) Empirical results show that DynaEval outperforms the stateof-the-art dialogue coherence model and strongly correlates with human judgements at both turn and dialogue level.
2 Related Work

Open-ended Dialogue Evaluation
Turn-Level Evaluation The current trend for automatic dialogue evaluation is shifting towards the reference-free paradigm. Lately, the research community has witnessed a surge in the automatic metrics along these lines. Many of them focus on evaluating naturalness of generated responses. Typical examples include perplexity (Adiwardana et al., 2020), USR-MLM (Mehri and Eskenazi, 2020b) and GPT-2 (Radford et al., 2019) based fluency metrics (Nedelchev et al., 2020;Pang et al., 2020).
Another group of metrics evaluates contextual relevance of the responses. For example, RU-BER (Tao et al., 2018), BERT-RUBER(Ghazarian et al., 2019) and USR-DR (Mehri and Eskenazi, 2020b) predict the relatedness between generated responses w.r.t the corresponding context by training a discriminative network to distinguish the original response from negative samples bootstrapped from the training set. Sato et al. (2020) and  provide a better sampling strategy for bootstrapping negative samples.
Even though all these automatic metrics demonstrate strong correlation with human judgements, they are laser-focused on one aspect of the evaluation. In addition, they do not explicitly model the speaker-level and utterance-level interactions, which we believe is essential for the dialogue-level representation, and eventually benefits the dialogue evaluation task.
Interactive Evaluation A popular human evaluation method is the interactive evaluation whereby human judges converse with dialogue systems and make the assessment at the end of the conversations (See et al., 2019;Finch and Choi, 2020;Deriu et al., 2020). It has been shown to be more reliable than turn-level static evaluation (Mehri and Eskenazi, 2020a).
There are few studies on fully automating this process. Ghandeharioun et al. (2019) propose a self-play scenario where the dialog system chats with itself and a combination of three metrics measuring sentiment, semantic coherence and engagement respectively along the conversation trajectory is computed to approximate dialogue-level quality estimation. Mehri and Eskenazi (2020a) propose the FED metric, which evaluates the quality of a system utterance in an interactive setting by computing the likelihood of a particular follow-up utterance responded by dialoGPT . Moreover, Sinha et al. (2020) come up with MaUde, a reference-free metric tailored for online dialogue evaluation, which leverages a pre-trained DistilBERT (Sanh et al., 2019) model to extract the semantic representation of dialogue turns and uses bidirectional LSTM to explicitly model the discourse structure.
While the interactive evaluation is more reliable than the turn-level static evaluation, it still relies on the aggregation of turn-level scores. An ideal approximation of the human evaluation process is a top-down approach whereby we examine the quality of the entire dialogue at macro level before zooming into the dialogue turns. Hence, a unified framework, which holistically models the entire dialogue, is highly sought after.

Dialogue Coherence
Examining a dialogue at macro level is related to discourse coherence (Halliday and Hasan, 2014;Grosz et al., 1995;Barzilay and Lapata, 2008), which considers whether a piece of text is in a consistent and logical manner, as opposed to a random collection of sentences. Dialogue is a special kind of discourse structure, of which coherence assessment is an essential part of quality evaluation.
Many studies have followed the standard discourse coherence evaluation protocol (Cervone and Riccardi, 2020;Zhou et al., 2019;Mesgar et al., 2020). Very few have considered customizing their dialogue coherence models for evaluating the performance of dialogue systems. It is common to leverage supervised approaches (Higashinaka et al., 2014;Gandhe and Traum, 2016;Cervone et al., 2018;Yi et al., 2019), that is closely linked to modeling with entities and dialogue acts (Cervone and Riccardi, 2020;Zhou et al., 2019;Mesgar et al., 2020).
Hence, we are motivated to study the application of dialogue coherence modeling for automatic dialogue evaluation by designing a self-supervised framework, without dependence on any human annotations for coherence features.
GNN is useful for dialogue modeling, because the relative position of target and context utterances decides how past utterances influence future utterances and vice versa (Ghosal et al., 2019). The interaction of utterances can be effectively captured with a graph structure as long as they are connected by relation-aware edges. However, GNN has not been well studied for dialogue evaluation.  recently proposes the GRADE metric, leveraging graph modeling for turn-level coherence evaluation. The way we use GNN is different from  because GRADE is focused on turn-level coherence evaluation while we are interested in a turn-dialogue joint evaluation. Furthermore, GRADE considers the keywords in context-response pairs, and we explicitly use graph structure to model the speaker and utterance level interaction within a dialogue.

DynaEval Framework
DyanEval represents an integration of several ideas. It takes advantage of the structured graph representation of dialogues, useful information on the utterance and speaker level interaction. It is motivated by dialogue coherence modeling.
In this paper, we only consider dyadic dialogues, but the formulation can be easily generalized to multi-party conversations. Formally, let A and B denote the two speakers participating in the dialogue. A dialogue, D, consists of a sequence of n utterances, [u A 1 , u B 2 , . . . , u A n−1 , u B n ] 2 . LetD represent the negative dialogue sample obtained via various sampling strategies described in Section 3.5. Figure 1 illustrates the learning process of Dy-naEval in four steps 3 : (1) Deriving contextualized representation, e i , for utterances within D. (Section 3.1). (2) Constructing the directed dialogue graph. The nodes are initialized with e i and the edges between node pairs represent the speaker and temporal dependencies (Section 3.2). (3) Generating utterance-level graph representation, h i , via feature transformation to aggregate useful contextual information from all connected neighbours to the current node (Section 3.3). (4) producing a dialogue-level score, which indicates whether D is preferred overD (Section 3.4).

Dialogue Utterance Representation
A sentence-encoder is needed to map the individual utterances within D onto the vector space. Firstly, we fine-tune a RoBERTa-base pre-trained language model (Liu et al., 2019) with training data of the target dialogue domain, because task-adaptive finetuning of the pre-trained language model on the target domain data benefits the final performance (Gururangan et al., 2020;Lee and Li, 2020). Next, the mean pooling operation is performed on the token embeddings within each utterance of D to derive their respective utterance-level representations. Formally, let SRoBERTa denotes the sentence encoder and u * i in D is mapped into vector representations, u i ∈ R d , whereby Note that * can be either speaker A or speaker B. Then, to capture a more fine-grained temporal dependency among the utterances, a bidirectional LSTM is adopted to model the sequential flow of information within D. The context-aware utterance representation, e i is then obtained via:

Dialogue Graph Construction
D is represented with a directed graph, G = (V, E). V is the sets of graph nodes and E is the set of  Figure 1: The architecture of DynaEval. The input is a pair of contrasting dialogues, D andD. The output is a unified score indicating whether D is preferred thanD. Utterance-level representation derived from SRoBERTa model is used for dialogue graph node initialization. Different types of arrows in relation edge connection represent different types of relations: (1) Solid line denotes intra-speaker dependency.
(3) Red color means self-connection. (4) Purple color means connection from future utterances to previous utterances. (5) Yellow color means connection from previous utterances to future utterances. Since there are two speakers, A and B. Hence, there will be a total of 2 × 2 × 2 + 1 = 9 distinct relation types.
edges, which reflects the contextual dependencies among utterance pairs.
Graph Nodes Each graph node corresponds to an utterance within D. Hence, for a dialogue with n utterances, All the graph nodes are initialized with utterance-level contextualized embeddings: v i = e i .
Edges For short conversations, G will be a fullyconnected graph whereby all graph nodes are connected to each other, including self-connection. The intuition is that short conversations tend to focus on a single topic and thus, each utterance is contextually dependent on all the other utterances in the dialogue. For long conversations, there may be frequent topic shifts. Distant utterances within the same dialogue may not be contextually relevant to the current utterance. Sometimes, adding more context leads to diminishing performance gain or even negative impact (Zhong et al., 2019). Therefore, a context window length, M , is set, which means that v i is only connected to Let v ij ∈ E denote the edge from v j to v i . Each edge is associated with an edge weight, a ij , and a relation type, θ ij . They are illustrated as follows: Edge Weights The edge weight determines the relative importance of the neighbour nodes w.r.t the current node. A similarity based attention module is applied to determine the edge weights. For a graph node, v i , the set of weights, a i , w.r.t all its incoming edges, should sum up to 1. The attention weight is formulated in the following way: More importance is placed upon neighbouring utterances on the same topic. Little attention is paid to the irrelevant utterances. Edge Relations Following (Ghosal et al., 2019), there are two aspects to take into account when defining the relation types. One aspect is to capture speaker dependencies. This is because we want to model the interaction between the interlocutors in a dialogue. The other aspect is to consider the temporal dependencies. This pertains to the relative position of an utterance w.r.t another. The explicit modeling of such dependency is important since the ordering of utterances within a dialogue is an essential feature for learning dialogue coherence. With these considerations, the total number of distinct types of relations 5 will be 2 (u * i occurs before or after u . This is depicted with different arrows connecting the graph nodes in Figure 1. We define this set of 9 relation types as Θ and θ ij ∈ Θ.

Feature Transformation
This section describes the process of transforming the initial node representation, e i , into both a speaker and context aware vector representation, h i , which captures the dynamics of interaction w.r.t u * i . Basically, the whole process is a two-stage graph convolution.
The first stage aggregates information from neighbourhood nodes to the current node v i based on the relation-aware transformation motivated by (Schlichtkrull et al., 2018) whereby edges of different relation types are associated with different transformation matrix, W θ : In Equation 4, h i is the intermediate node representation and σ denotes the activation function, such as ReLU. S θ i represents the set of indices of nodes connected to v i with their edges v ij having the relation type θ ∈ Θ. a ij and a ii are the edge weights of v ij and v ii respectively. W θ ∈ R d ×d and W 0 ∈ R d ×d are learnable parameters of the feature transformation. c i,θ is a problem specific normalization constant, which can be set as a learnable parameter or fixed in advance.
The second stage applies another graph convolution operation on the intermediate node representation, h i and the final node representation, h i is obtained via: where W ∈ R d ×d and W 0 ∈ R d ×d are two learnable parameters in the second stage of feature transformation.
Through Equation 4 and Equation 5, relevant contextual information from neighbouring nodes is effectively accumulated to the current node while irrelevant information is filtered out.

The Scoring Process
In the scoring step, h i is first concatenated with e i to obtain the final utterance representation, g i . Next, a mean pooling layer is applied on all the utterance representations in a conversation to derive the dialogue-level representation, o: o, which corresponds toD, is obtained in the same way. A unified score, s dial or sd ial , is derived by passing o orō through a fully-connected layer.

Training Setup
Learning Objective Inspired by the preference learning approaches, the label, y for the D andD pair is defined as: The margin ranking loss function is adopted to train DynaEval.
Sampling Strategy Two negative sampling strategies are explored in this paper to constructD: Utterance Replacement (UR) and Speaker Level Utterance Shuffling (SS). Utterance Replacement (UR) An utterance randomly selected from a dialogue is replaced with another utterance randomly chosen from a different dialogue. This sampling strategy perturbs a dialogue at the semantic level. An utterance from a different dialogue is considered topically in-congruent w.r.t the current dialogue context. It breaks down the current dialogue by suddenly injecting irrelevant information.
Speaker Level Utterance Shuffling (SS) With this strategy, the order of utterances from one speaker in a dialogue is kept the same while that from another speaker is shuffled. SS changes the coherence structure of a dialogue w.r.t specific speaker. This strategy is motivated by (Healey et al., 2014), which adopts a "Chance Other" method to measure how much syntactic and lexical repetition of a speaker happen by chance. The reason why we do not randomly permute the order of all utterances in the dialogue is because random permutation of all utterances is a very simple discrimination task.

Experiments
In this work, we consider two experiment settings to assess the effectiveness of DynaEval. The first setting (Section 4.2) is similar to the studies on dialogue coherence (Cervone et al., 2018;Mesgar et al., 2020) where accuracy score is applied to evaluate its discrimination capability in distinguishing original dialogues from negative samples. The second setting (Section 4.3) is to evaluate its dialogue-level and turn-level judgement capability via correlation analysis on the human-chatbot conversational datasets. The domain of the evaluation set is different from that of human-human conversation datasets that DyanEval is trained on.

Dialogue Datasets
Three bench-marking open-domain dialogue datasets are included in our experiments, Empathetic Dialogue (Rashkin et al., 2019), ConvAI2 PERSONACHAT (Zhang et al., 2018b;Dinan et al., 2020) and DialyDialog (Li et al., 2017). For training, we remove dialogues containing less than 4 utterances or more than 30 utterances. Statistics of the three human-human dialogue corpora after filtering is presented in Table 1.
Empathetic Dialogue is designed for mimicking the real-life human conversation scenario whereby the interlocutors need to recognize and acknowledge the others' feelings in the conversation. This dataset pertains to the short conversation scenario where interlocutors stick to a single topic.
ConvAI2 PERSONACHAT is a crowdsourced dataset where each pair of interlocutors try to get to know each other by conditioning their conversations on their respective persona profile provided in prior. The dataset contains more number of turns per dialogue as compared to Empathetic Dialogue. Hence, topic shift is more likely to occur within a dialogue and this simulates the long conversation scenario mentioned in Section 3.2.
DailyDialog is a high-quality human-human conversation dataset, which reflects our day-to-day communications and covers different topics about our daily life, such as relationship and health. The average dialogue length of DailyDialog lies in the middle of that of Empathetic Dialogue and Con-vAI2. Topic shift in the conversations of DailyDialog occurs less frequently as compared to those in ConvAI2.

The Dialogue-level Discrimination Task
Similar to the previous works (Cervone and Riccardi, 2020; Mesgar et al., 2020) (Xu et al., 2018) and S-DiCoh (Mesgar et al., 2020). RANDOM baseline arbitrarily assigns a label to the input dialogue pairs. It suggests the peformance lower bound. CoSim is a common method for dialogue coherence assessment (Xu et al., 2018;Zhang et al., 2018a). It obtains a dialogue-level score by averaging the cosine similarities between sentence embeddings of all adjacent utterance pairs within the dialogue. For fair comparison, we apply the same procedure described in Section 3.1 to derive the sentence embedding of an utterance in CoSim. S-DiCoh (Mesgar et al., 2020) is a recent state-of-the-art dialogue coherence model. It models a dialogue with a neural network framework consisting of two bidrectional LSTM layers with attention mechanism at both the token and utterance level.
Results and Analysis It can be observed in Table 2 that on all bench-marking dialogue datasets, DynaEval outperforms the baselines in both UR and SS category. Even though the dialogue datasets possess different characteristics as indicated in Section 4.1, DynaEval exhbits robust performance across all the datasets. This confirms our hypothesis that DynaEval provides useful dialogue-level representation for distinguishing the original dialogues from the corresponding negative samples. Especially when compared to S-Dicoh, which mod-  els a dialogue sequentially with bidrectional LSTM and does not explicitly incoporate the speaker level interaction, the structured graph modeling of a dialogue in DynaEval is more effective for capturing both the interaction between the interlocutors and the contextual information within a dialogue. Based on the experimental results, it can be deduced that the discrimination task with UR strategy is more challenging compared to that with SS strategy. The accuracy scores achieved by S-DiCoh in the SS category is much higher than that in the UR category on both datasets. Similar observation can be made w.r.t CoSim and DynaEval on the Con-vAI2 dataset. DynaEval performs remarkably in this task as it outperforms S-DiCoh by a significant margin of 13.97, 18.43 and 8.22 on Empathetic Dialogue, ConvAI2 and DailyDialog respectively. Given these observations, we further hypothesize that DynaEval model trained with UR strategy offers more useful dialogue representation to the dialogue evaluation task.

Dialogue Evaluation Task
To validate the above hypothesis, we assess the usefulness of DynaEval in both the dialogue-level and turn-level evaluation tasks. In both settings, Spearman correlations between the scores generated by DynaEval and the corresponding human evaluation scores are computed. The performance of DynaEval is compared against several recently proposed dialogue evaluators.
Evaluation Dataset FED (Mehri and Eskenazi, 2020a) is a bench-marking dataset useful for both dialogue-level and turn-level evaluation. It contains both human-human conversations and humanchatbot conversations, which are collected by the authors of the Meena chatbot (Adiwardana et al., 2020) in an interactive setup. In total, 124 conversations are collected, out of which 40 come from interacting with the Meena Chatbot, 44 come from interacting with the Mitsuku Chatbot and 40 are drawn from human-human conversations. The average number of utterances per conversation is 13.72 and the average number of words per utterance is 9.23. Human quality annotations of these conversations are performed at both the dialogue and turn level. There are 9 quality aspects for turn-level annotations and 11 for dialog-level annotations outlined in the first column of Table 3. FED includes 3348 turn-level and 1364 dialog-level annotations, for a total of 4712. The inter-annotator agreements for all the quality aspects, which indicate the metric performance upper bound, is shown in the last column of Table 3.
Metrics to Compare The recently proposed reference-free state-of-the-art dialogue metrics, including USR (Mehri and Eskenazi, 2020b), BERT-RUBER (Ghazarian et al., 2019) (BERT-R), GPT-2 based coherence metric (Pang et al., 2020) (GPT-2) and FED (Mehri and Eskenazi, 2020a) 6 , serve as the baseline dialogue evaluators. Since USR, BERT-R and GPT-2 are turn-level metrics, aggregation of all the turn-level scores in a dialogue is required for dialogue-level evaluation. The best correlation scores at dialogue level are reported in Table 3 among all the aggregation strategies for these three metrics. For completeness, we report their correlation scores w.r.t difference aggregation strategies in Appendix A.2. Similar to DynaEval, S-Dicoh provides a unified score for each dialogue. Based on insights from Section 4.2, the best performing model in the UR category is chosen to score the dialogues for both S-Dicoh and DynaEval.
Dialogue-level Evaluation DynaEval achieves  the highest correlation scores in 8 out of 11 dialogue aspects, including the overall category. For the other three categories, DynaEval attains second highest correlation scores. We can see that Dy-naEval significantly outperforms S-DiCoh. These results showcase that structured graph modeling of a dialogue with explicit incorporation of speaker and utterance level dependencies provides meaningful dialogue-level representations. Such representations capture information of various dialogue attributes that are beneficial for the dialogue-level evaluation task.
Moreover, BERT-R, GPT-2 and USR are stateof-the-art turn-level evaluation metrics. They evaluate a dialogue based on aggregation of scores of all the context-response pairs within the dialogue. It can be observed that their correlation scores across individual dialogue aspects are not as high as those of DynaEval. This supports our hypothesis in Section 1 that turn-level quality evaluation may be insufficient to assess the performance of open-domain dialogue systems.
In addition, dialogue aspects, including coherence, likability, informativeness and Inquisitiveness, are highly dependent on the interaction of the interlocutors. Amongst all the dialogue aspects, DynaEval achieves significantly higher scores in these four categories. This attributes to its incorporation of the speaker level dependency.
Turn-level Evaluation Furthermore, it can be observed that DynaEval achieves the highest correlation in 5 out of 9 categories including the overall category. This demonstrates that DynaEval is not only useful for holistic evaluation of a dialogue, but also useful for turn level evaluation. In this sense, DynaEval serves as a better proxy to the human evaluation process  whereby humans mainly evaluate the conversations in a holistic manner and laser-focus on the problematic turns.
Specifically, DynaEval performs well in turnlevel aspects, such as relevance, semantic appropriateness and correctness. These aspects highly correlate to the dialogue-level attributes, such as coherence and understanding, suggesting that the evaluation of these turn-level attributes also benefit from the explicit modeling of the speaker and utterance level interaction in a unified framework.
Error Analysis An interesting finding is that DynaEval and FED actually complement each other at both dialogue and turn level. For example, at the dialogue level, FED performs well in diversity and topic depth, but struggles with coher-ence and consistency. DynaEval performs well in coherence and consistency, but its performance in diversity is much lower in comparison to FED. This may be because dialoGPT, the backbone of FED, was trained on a large amount of Reddit data, which contain diverse amount of topics and variation of expressions while DynaEval is trained on a single dialogue domian. Moreover, dialoGPT does not explicitly model such speaker-level interaction, but DynaEval does. Hence, DynaEval is more useful for evaluating coherence and consistency aspects of a dialogue. One way to improve DynaEval for evaluating topic depth and diversity is to pre-train on a large amount of dialogue data with a variety of topics and then fine-tune it on the target domain.
Another observation is that DynaEval performs significantly poorer for the fluency aspect at turnlevel than for other turn-level aspects. Additionally, GPT-2, USR and FED, which leverage pretrained language model, perform significantly better than DynaEval in this category. This may be because DynaEval directly models a dialogue at the utterance level instead of at the token level, while the other metrics consider the language modeling objective, which focuses more on the token-level dependencies rendering them effective for evaluating the naturalness of a response. A remedy to this problematic aspect of DynaEval is to introduce perturbation strategies targeting the token level, such as word drop, word shuffling and word replacement (Sinha et al., 2020;Park et al., 2021). Such strategies provide negative samples mimicking the non-sensical or non-grammatical responses produced by certain seq2seq generative models. Another simple solution is to combine DynaEval with turn-level metrics specifically designed for evaluating naturalness of dialogue responses.
Besides the fluency aspect, DynaEval's performance in interestingness, engagement and specificity at the turn level is not as pronounced as that of FED. This may be because purely modeling the dialogue itself is not enough for all the aspects. The model may need to incorporate external knowledge concerning a diverse range of topics to be able to reflect these attributes. The same conclusion can also be drawn from DynaEval's relatively weaker performance in the diversity category at the dialogue level.
Lastly, DynaEval primarily targets open-domain dialogues where there is no clear or predefined task to perform. When evaluating task-oriented dialogues, task completion will take a more central role. Meta-information such as intents and request types are important to determine task completion and therefore, the evaluation framework will require further adaptation accounting for these information when evaluating task-oriented dialogues.

Conclusion & Future Work
DynaEval serves as a unified framework for both turn and dialogue level evaluation in open-domain dialogue. It provides meaningful representations that incorporate information reflecting various important dialogue attributes. Its explicit modeling of speaker and utterance level interaction leveraging GCN has been proven beneficial for the evaluation task. Lastly, the error analysis in Section 4.3 sheds light on how DynaEval can be further improved. DynaEval can also be combined with the specialized turn-level metrics, such as those targeting fluency and engagement, to fully approximate the interactive human evaluation process.

A.1 Utterance-level Pooling Techniques
To derive the dialogue-level representation, we have adopted the mean pooling method in Dy-naEval. In this section, we examine the effects of different pooling methods in the dialogue-level discrimination task. Specifically, we compare the performance of mean pooling against max pooling and the concatenation of sentence vectors derived with both mean and max pooling. The performance comparison is presented in Table 4. It can be observed that the performance difference across various pooling strategies is not statistically significant.   (4) Multiplication. The dialogue level correlation coefficients of USR, BERT-RUBER and GPT-2 based coherence metric are reported in Table 5, Table 6 and Table 7 correspondingly. Note that for turnlevel metrics leveraging the language model objective, we don't consider token-level aggregation variants. Instead, we follow the same formulations in the original papers. For example, the GPT-2 based coherence metric (Pang et al., 2020) computes a turn-level score based on averaging the token-wise conditional log probabilities in the corresponding response.
It can be observed that all three metrics don't perform well at dialogue level evaluation. This further validates our statement in Section 1 that turn-level quality evaluation may be insufficient to assess the performance of open-domain dialogue systems as they don't specifically model the interaction over an entire dialogue.

Quality
Mean Sum Max Prod      of 0.5 per epoch. A dropout of 0.5 is also applied. For Empathetic Dialogue and DailyDialog, the context window length, M is set to 4, because these two datasets contain relatively short conversations (4.31 and 7.90 average number of utterances per dialogue respectively). A context window size of 4 ensures each utterance is connected to all the remaining utterances in most of the dialogues. The utterances may provide important contextual information to each other within a dialogue. For ConvAI2, M is set to 2 to avoid introducing too much irrelavant context information. This is because most of the conversations in ConvAI2 are about two people getting to know each other and there are frequent topic changes in the conversations. M serves as an important hyperparameter to control the influence of an utterance on the rest in a dialogue.
For training DynaEval, we have filtered out dialogues of which the number of utterances is less than 4 or more than 30. We hypothesize that dialogues with less than 4 utterances containing little information for modeling speaker and utterance level interaction. Moreover, there are very few dialogues with more than 30 utterances in both datasets. Including them leads to large graphs and unnecessary paddings, which slow down the training process.