C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue Evaluation

Existing reference-free turn-level evaluation metrics for chatbots inadequately capture the interaction between the user and the system. Consequently, they often correlate poorly with human evaluations. To address this issue, we propose a novel model-agnostic approach that leverages Conditional Pointwise Mutual Information (C-PMI) to measure the turn-level interaction between the system and the user based on a given evaluation dimension. Experimental results on the widely used FED dialogue evaluation dataset demonstrate that our approach significantly improves the correlation with human judgment compared with existing evaluation systems. By replacing the negative log-likelihood-based scorer with our proposed C-PMI scorer, we achieve a relative 60.5% higher Spearman correlation on average for the FED evaluation metric. Our code is publicly available at https://github.com/renll/C-PMI.


Introduction
Evaluating dialogues is a multi-faceted task that demands consideration of diverse dimensions, which distinguishes it from the evaluation of task-oriented dialogue systems.Traditional n-gram-based evaluation metrics, such as ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002), demonstrate weak correlation with human-annotated judgments due to the broad spectrum of potential responses in dialogues.As a result, researchers often resort to human evaluations to ascertain the quality and effectiveness of their generated system responses, especially for knowledge-guided dialog systems (Li et al., 2022;Fung et al., 2023;Lai et al., 2023).
Substantial research has been conducted on automatic evaluation metrics for dialogue (Yeh et al., 2021).These metrics can be classified into reference-based and reference-free categories.Reference-based metrics, which depend on com-paring the system response to a human-written reference response, are generally inadequate for dialogue evaluation due to the inherent one-to-many nature of dialogues.The reference-free metric instead uses a computational model to generate a score for the system response with a given context.
Early models predominantly focus on a limited set of general features of dialogue generation quality, such as context coherency and fluency.Subsequent evaluation metrics investigated additional dimensions, such as USL-H (Phy et al., 2020), which combines relevance evaluation with factto-response selection.Holistic-eval (Pang et al., 2020) assesses content coherence, language fluency, self-consistency, and semantic appropriateness.D-Score (Zhang et al., 2021b) and Predictive Engage (Ghazarian et al., 2020) introduce response diversity and engagement scores.The recent FED (Mehri and Eskenazi, 2020a) metric encompasses 18 turn-level and dialogue-level metrics, including interestingness, likeability, and response flexibility.However, all of these methods do not model the interaction between the turn-level response and the dialogue history and regard them as an integrated context for score calculation.
In this paper, we focus on directly modeling user-system interactions through the lens of Mutual Information (Shannon, 1948;Ghassami and Kiyavash, 2017) and propose a novel scorer based on Conditional Pointwise Mutual Information (C-PMI), which effectively captures the turn-level interactions between the system and user with respect to a given hypothesis.We demonstrate that our approach results in a reference-free, training-free, automatic turn-level dialogue evaluation that significantly outperforms state-of-the-art methods with a comparable number of model parameters.Our contributions in this work are three-fold: • A novel dialogue evaluation metric based on Conditional Pointwise Mutual Information (C-PMI) that effectively captures turn-level in-teractions between the system and user with respect to a given hypothesis.
• An unreferenced, training-free, automatic turn-level dialogue evaluation that significantly outperforms state-of-the-art methods with a comparable number of model parameters.
• A model-agnostic approach that can be served as a generalized alternative to the Negative Log-Likelihood (NLL) based evaluation metrics when interactions between previous turns need to be considered.

Related Work
Developing automatic evaluation metrics for dialog is challenging for several reasons: 1) Dialogues often have a one-to-many nature, rendering wordoverlap metrics ineffective.To address this issue, metrics should be designed to be reference-free.2) Given the limitless nature of conversation topics in open-domain dialogues, the dialogue evaluation metrics are expected to understand the semantic meaning of both the dialogue context and the generated responses.This necessitates a metric that can leverage pre-trained large language models and self-supervised training objectives.3) Training dialogue evaluation metrics solely on labeled data can significantly restrict the metric's range, risking over-fitting to the training data in terms of conversation topics and response generation models.As such, recent metrics have started to incorporate selfsupervised training objectives designed to capture various aspects of a dialogue, such as relevance, fluency, and interestingness among others.Given the aforementioned challenges, large language models have become an integral part of dialogue evaluation.DialogRPT (Gao et al., 2020) employs an extended GPT-2 model trained on 147 million conversation-like interactions from Reddit.USR (Mehri and Eskenazi, 2020b) is an unsupervised, reference-free tool that takes advantage of the RoBERTa (Liu et al., 2019) model.USR employs a dialogue retrieval metric for assessing dialogue, where the metric is trained to differentiate between a ground truth response and a randomly sampled response.The FED metric (Mehri and Eskenazi, 2020a) utilizes DialoGPT (Zhang et al., 2020) due to its capacity for capturing knowledge, specifically within the context of conversations.It ignores the interaction between the user and the system and consider the dialogue history and the system response as an integral context, while our method explicitly captures such interaction through conditional mutual information.
3 Background FED (Mehri and Eskenazi, 2020a) measures eighteen fine-grained qualities of dialogue without requiring comparison to a reference response or training data with ground-truth human ratings.The method leverages DialoGPT and uses the followup hypotheses as a means of evaluation, based on the assumption that the language model has learned to accurately measure the likelihood of the input sequence.Given a dialog context c, a system response r, and a scorer L that computes the average Negative Log-Likelihood (NLL) of a sequence with a language model θ, the predicted score for a pair of positive and negative hypotheses where {a, b} means text b is appended to text a,and for each of the evaluation dimensions, |p| and |n| number of positive and negative hypothetical sentences are respectively pre-defined and used for reducing evaluation variance.For example, given a combined history {c, r}, the response is regarded as more interesting if the probability of DialoGPT generating a positive hypothesis (e.g., "That's really interesting!") is greater than the probability of it generating a negative one (e.g., "That's really boring.").

Conditional Pointwise Mutual
Information based Turn-level Metric For each of the dialogue turn t, our Pointwise Mutual Information (PMI) based metric is considering the dependencies between the following three random variables: the full dialogue history r t = {u 0 , x 0 , u 1 , x 1 , ..., u t } ∼ R (where u t is the user utterance), the system response x t ∼ X and a hypothesis h ∼ H. Ideally, we want to know how much correlation between the dialogue history and the system response causes the hypothesis to be a plausible entailment of the combined history, {r t , x t }.We measure such correlation by calculating the Conditional Mutual Information (CMI) between the response and the history with a given hypothesis, i.e., Intuitively, if I(R, X|H) is large, the hypothesis is less likely to be caused by the interaction (i.e., the shared information) between R and X.
Since sampling the history on a turn-by-turn basis needs exponentially increasing computation, an accurate estimation of the CMI between these random variables is intractable.Therefore, we propose to measure the CMI by calculating the pointwise mutual information contained between the observed dialogue history and the system response when the hypothesis is appended to the combined history.Formally, we define our Conditional PMI (C-PMI) score between the observed dialogue history, the system response, and the hypothesis as follows, In practice, we estimate the probability of each sequence using the averaged Log-Likelihood (LL) obtained from a language model P θ , i.e., and our score is then computed as, which can be efficiently implemented using the modern deep learning framework.To retain the symmetric property of the mutual information, we also define a symmetric version of our score, C-PMI-SYM, by interchanging the response and the dialogue history, i.e., For integrating our scorer with the existing evaluation system such as FED, we simply replace its NLL scoring function with our C-PMI scorer, and follow the original pipeline to get the final score for each of the data samples.

Dataset
We evaluate our model on the turn-level annotated subset of the FED (Mehri and Eskenazi, 2020a) dataset.This subset consists of 455 data samples, each of which includes a dialog context, a system response, and eight human-annotated turn-level labels: Interesting, Fluent, Engaging, Specific, Relevant, Correct, Appropriate, and Understandable.The annotations are obtained through a survey with the options of No, Somewhat, Yes, or N/A.An additional overall impression label is measured using a five-point Likert Scale.The FED dataset is proposed to evaluate metrics as it is annotated with human quality judgments with conversations from Meena and Mitsuku bots (Adiwardana et al., 2020).

Baseline Metrics
We primarily compare our proposed reference-free and unsupervised metric with FED, but other baselines are also included as follows.
BARTScore (Yuan et al., 2021) is a text-scoring model based on BART (Lewis et al., 2020) and does not requiring any fine-tuning.BARTScore calculates the weighted log probability of text y given text x: where the weighted sum of the log probability of one text y given the other text x is used for scoring.
DynaEval (Zhang et al., 2021a) is an automatic evaluation framework for dialogue response generation tasks, designed to evaluate both turn-level and dialogue-level.The framework utilizes structured graph representations of dialogues and is trained on datasets that contain ground-truth human ratings.

Implementation Details
We follow the data pre-processing procedure as used by Yeh et al. (2021) for the FED dataset , and modify the scorer function as in the original FED repository.Following Yeh et al. (2021), we use a special "<|endoftext|>" token to connect each turn of the system responses and the user utterances for constructing a full sequence.The sequence is then fed into the DialoGPT-large language model to obtain the log-likelihood for calculating both the FED score and our C-PMI score.

Results & Analysis
Table 1 shows that our proposed metrics, FED+C-PMI-SYM and FED+C-PMI, outperform other methods in most of the evaluation dimensions, and is comparable to DynaEval which requires training on the evaluation dataset.Both FED+C-PMI-SYM and FED+C-PMI show substantial improvements in Interesting, Engaging, Specific, Semantically Appropriate and the Understandable dimensions compared to our re-implemented FED metric.Notably, our metric even substantially outperforms DynaEval on the Interesting and the Engaging dimensions which conceptually needs an accurate measure of the interaction between the user and the system.This demonstrates the effectiveness of our approach in capturing turn-level interactions.
The performance of FED+C-PMI-SYM and FED+C-PMI is quite similar across most dimensions.However, FED+C-PMI shows slightly better performance in the Relevant, Correct, and Understandable dimensions, suggesting that the asymmetrical variant of the C-PMI calculation might provide more accurate evaluation scores in certain cases.We suspect that this is because interchanging the positions of the response and the dialogue history results in unnatural dialogue, which leads to worse probability estimation from the language models.
The results indicate that the proposed C-PMIbased turn-level metrics are capable of providing a more accurate evaluation of dialogue system responses compared to existing state-of-the-art methods.Moreover, our metric is unreferenced and training-free, which makes it particularly suitable for practical applications, such as responses selec-tion and re-ranking.

Conclusion
In this paper, we introduce a novel dialogue evaluation metric based on Conditional Pointwise Mutual Information (C-PMI) that captures turn-level interactions between the system and user across various evaluation dimensions.The proposed metric is reference-free and training-free, outperforming state-of-the-art methods with a comparable number of model parameters.For turn-level dialogue evaluations, our experimental results demonstrate that this metric can serve as a generalized alternative to the Negative Log-Likelihood scorer for multi-dimensional evaluation metrics.We plan to extend our approach to other dialogue evaluation methods and explore its applicability to general text generation problems.We are also interested to see if our measure can improve the factual consistency evaluation for document-grounded dialogue or conversational question answering.Additionally, we will investigate incorporating our C-PMI-based metric into the fine-tuning process of LLMs.

Limitations
While our proposed method demonstrates promising results and outperforms several state-of-the-art techniques, it is important to acknowledge certain limitations.
• Dependence on pre-trained LLMs: Our method relies heavily on the pre-trained LLM's quality and the knowledge it has captured.As a result, any biases, inaccuracies, or limitations present in the LLM may directly impact the performance of our evaluation metric.
• Lack of diversity in the dataset: The FED dataset, which we use for evaluation, is primarily derived from conversations with the Meena and Mitsuku chatbots.Consequently, it is possible that our evaluation might not have better correlation with human ratings for other dialogue systems or more diverse conversational contexts.
• Adaptability to new evaluation dimensions: Our method currently focuses on eight turnlevel metrics.Extending the method to incorporate additional or novel evaluation dimensions might require further investigation and calibration.
• Computational cost: The current implementation of our approach is around twice as slow as the baseline NLL-based method due to multiple times of the inferences of the language model.The efficiency of the implementation can be improved in the future by re-using the log-likelihood of the dialogue history.
• Subjectivity in human judgments: Our evaluation metric's correlation with human judgments serves as a key performance indicator.However, human judgments are inherently subjective, which could lead to inconsistencies or discrepancies in the evaluation results.
Despite these limitations, our proposed method presents a significant step forward in dialogue evaluation, offering a model-agnostic, unreferenced, and training-free approach that captures the human and the system interaction.Future work could address these limitations and explore additional dimensions of evaluation, further refining the method and its applicability across a broader range of dialogue systems and text evaluation systems.

Ethics Statement
In this study, we recognize the importance of ethical considerations in natural language processing and dialogue systems research.Acknowledging the potential biases in pre-trained LLMs and human judgments, we advocate for future research to investigate and mitigate these biases in evaluation metrics.We strive for fairness and inclusivity by designing our method to be generalizable and adaptable to various settings.As researchers, we are committed to responsible AI development and contribute to the ongoing discourse on evaluating dialogue systems, enabling the creation of more effective and ethical AI-powered conversational agents.We encourage the research community to continue discussing ethical considerations and promoting transparency in the field.

Table 1 :
Fu et al. (2023)ng Fluent Engaging Specific Relevant Correct Appro.Und.Avg.The Spearman correlations with human judgment on the FED Turn-level dataset.Italicized values indicate that they are not statistically significant (p > 0.05).We include the results from the supervised metric to showcase the power of our method.For the unsupervised metrics, the highest correlation is shown in bold and the second highest is underlined.*indicatesour reimplementation.The results for DynaEval, BARTSCORE, and FED are fromFu et al. (2023).Appro.and Und. are respectively the abbreviations of the evaluation dimensions: Semantically Appropriate and Understandable.