BotChat: Evaluating LLMs’ Capabilities of Having Multi-Turn Dialogues

Haodong Duan, Jueqi Wei, Chonghua Wang, Hongwei Liu, Yixiao Fang, Songyang Zhang, Dahua Lin, Kai Chen


Abstract
In the realm of modern Large Language Models (LLMs), facilitating high-quality, multi-turn dialogues with humans represents a cornerstone feature. However, human-based evaluation of such a capability involves substantial manual effort. This study offers a formative assessment of current LLMs’ proficiency in emulating human-like, multi-turn conversations using an LLM-centric approach. The evaluation encompasses three key elements in the evaluation pipeline: utterance generation, evaluation protocol, and judgement, and we delve deeply into each aspect. GPT-4, both as an utterance generator and as a judge, exhibits exceptional performance. As a generator, GPT-4 crafts dialogues indistinguishable from human interactions in terms of style and flow. When judging, it shows a heightened alignment with human evaluative standards and consistency. Conversely, other LLMs face challenges in producing quality multi-turn dialogues, hindered by inadequate instruction-following abilities, a propensity for prolix utterances, and overall limited capabilities. Notably, generating extensive dialogues (e.g., spanning tens of turns) remains a formidable task for most LLMs, particularly in Chinese contexts. We hope that our work can serve as a valuable resource for evaluating the multi-turn chatting capabilities of LLMs. Related resources are available at https://github.com/open-compass/BotChat.
Anthology ID:
2024.findings-naacl.201
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3184–3200
Language:
URL:
https://aclanthology.org/2024.findings-naacl.201
DOI:
10.18653/v1/2024.findings-naacl.201
Bibkey:
Cite (ACL):
Haodong Duan, Jueqi Wei, Chonghua Wang, Hongwei Liu, Yixiao Fang, Songyang Zhang, Dahua Lin, and Kai Chen. 2024. BotChat: Evaluating LLMs’ Capabilities of Having Multi-Turn Dialogues. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3184–3200, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
BotChat: Evaluating LLMs’ Capabilities of Having Multi-Turn Dialogues (Duan et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-naacl.201.pdf