Timothy Leffel

2024

In task-oriented conversational AI evaluation, unsupervised methods poorly correlate with human judgments, and supervised approaches lack generalization. Recent advances in large language models (LLMs) show robust zero- and few-shot capabilities across NLP tasks. Our paper explores using LLMs for automated dialogue quality evaluation, experimenting with various configurations on public and proprietary datasets. Manipulating factors such as model size, in-context examples, and selection techniques, we examine “chain-of-thought” (CoT) reasoning and label extraction procedures. Our results show that (1) larger models yield more accurate dialogue labels; (2) algorithmic selection of in-context examples outperforms random selection,; (3) CoT reasoning where an LLM is asked to provide justifications before outputting final labels improves performance; and (4) fine-tuned LLMs outperform out-of-the-box ones. In addition, we find that suitably tuned LLMs exhibit high accuracy in dialogue evaluation compared to human judgments.

2023

pdf bib abs

Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs
Abishek Komma | Nagesh Panyam Chandrasekarasastry | Timothy Leffel | Anuj Goyal | Angeliki Metallinou | Spyros Matsoukas | Aram Galstyan
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)

Measurement of interaction quality is a critical task for the improvement of large-scale spoken dialog systems. Existing approaches to dialog quality estimation either focus on evaluating the quality of individual turns, or collect dialog-level quality measurements from end users immediately following an interaction. In contrast to these approaches, we introduce a new dialog-level annotation workflow called Dialog Quality Annotation (DQA). DQA expert annotators evaluate the quality of dialogs as a whole, and also label dialogs for attributes such as goal completion and user sentiment. In this contribution, we show that: (i) while dialog quality cannot be completely decomposed into dialog-level attributes, there is a strong relationship between some objective dialog attributes and judgments of dialog quality; (ii) for the task of dialog-level quality estimation, a supervised model trained on dialog-level annotations outperforms methods based purely on aggregating turn-level features; and (iii) the proposed evaluation model shows better domain generalization ability compared to the baselines. On the basis of these results, we argue that having high-quality human-annotated data is an important component of evaluating interaction quality for large industrial-scale voice assistant platforms.

Co-authors

Anoop Kumar 1

Spyros Matsoukas 1

Angeliki Metallinou 1

Ajay Nagesh 1

Nagesh Panyam Chandrasekarasastry 1

Xujun Peng 1

Tamer Soliman 1

Venues

ACL1
NAACL1

Fix author