LUCID: LLM-Generated Utterances for Complex and Interesting Dialogues

Joe Stacey, Jianpeng Cheng, John Torr, Tristan Guigue, Joris Driesen, Alexandru Coca, Mark Gaynor, Anders Johannsen


Abstract
Spurred by recent advances in Large Language Models (LLMs), virtual assistants are poised to take a leap forward in terms of their dialogue capabilities. Yet a major bottleneck to achieving genuinely transformative task-oriented dialogue capabilities remains the scarcity of high quality data. Existing datasets, while impressive in scale, have limited domain coverage and contain few genuinely challenging conversational phenomena; those which are present are typically unlabelled, making it difficult to assess the strengths and weaknesses of models without time-consuming and costly human evaluation. Moreover, creating high quality dialogue data has until now required considerable human input, limiting both the scale of these datasets and the ability to rapidly bootstrap data for a new target domain. We aim to overcome these issues with LUCID, a modularised and highly automated LLM-driven data generation system that produces realistic, diverse and challenging dialogues. We use LUCID to generate a seed dataset of 4,277 conversations across 100 intents to demonstrate its capabilities, with a human review finding consistently high quality labels in the generated data.
Anthology ID:
2024.naacl-srw.8
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Yang (Trista) Cao, Isabel Papadimitriou, Anaelia Ovalle, Marcos Zampieri, Francis Ferraro, Swabha Swayamdipta
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
56–74
Language:
URL:
https://aclanthology.org/2024.naacl-srw.8
DOI:
10.18653/v1/2024.naacl-srw.8
Bibkey:
Cite (ACL):
Joe Stacey, Jianpeng Cheng, John Torr, Tristan Guigue, Joris Driesen, Alexandru Coca, Mark Gaynor, and Anders Johannsen. 2024. LUCID: LLM-Generated Utterances for Complex and Interesting Dialogues. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 56–74, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
LUCID: LLM-Generated Utterances for Complex and Interesting Dialogues (Stacey et al., NAACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.naacl-srw.8.pdf