On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

John Mendonça, Alon Lavie, Isabel Trancoso


Abstract
Large Language Models (LLMs) have showcased remarkable capabilities in various Natural Language Processing tasks. For automatic open-domain dialogue evaluation in particular, LLMs have been seamlessly integrated into evaluation frameworks, and together with human evaluation, compose the backbone of most evaluations. However, existing evaluation benchmarks often rely on outdated datasets and evaluate aspects like Fluency and Relevance, which fail to adequately capture the capabilities and limitations of state-of-the-art chatbot models. This paper critically examines current evaluation benchmarks, highlighting that the use of older response generators and quality aspects fail to accurately reflect modern chatbot capabilities. A small annotation experiment on a recent LLM-generated dataset (SODA) reveals that LLM evaluators such as GPT-4 struggle to detect actual deficiencies in dialogues generated by current LLM chatbots.
Anthology ID:
2024.nlp4convai-1.1
Volume:
Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4ConvAI 2024)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Elnaz Nouri, Abhinav Rastogi, Georgios Spithourakis, Bing Liu, Yun-Nung Chen, Yu Li, Alon Albalak, Hiromi Wakaki, Alexandros Papangelis
Venues:
NLP4ConvAI | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–12
Language:
URL:
https://aclanthology.org/2024.nlp4convai-1.1
DOI:
Bibkey:
Cite (ACL):
John Mendonça, Alon Lavie, and Isabel Trancoso. 2024. On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation. In Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4ConvAI 2024), pages 1–12, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation (Mendonça et al., NLP4ConvAI-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.nlp4convai-1.1.pdf