Dialogue Scaffolding: Producing a Realistic Corpus of Human-Computer Open-Domain Dialogues Using a Spoken Dialogue System and ChatGPT

Kevin Bowden; Marilyn Walker

Dialogue Scaffolding: Producing a Realistic Corpus of Human-Computer Open-Domain Dialogues Using a Spoken Dialogue System and ChatGPT

Abstract

Researchers in dialogue interaction have had a long-term interest in multi-domain human-computer conversations and how they differ from human-human conversations. Recently, research on dialogue has begun to rely more and more on corpus-based training of neural conversational models, and conversational LLMs such as ChatGPT. However, existing large open-domain dialogue corpora do not accurately capture the characteristics of social human-computer dialogue. This paper addresses this gap by synthesizing a new corpus of 4000 long social dialogues on 200 user-model based topics that we call User-Centric SocialChat (UCSC). We create UCSC with a novel method called Dialogue Scaffolding, where a real dialogue system, that competed successfully in the Alexa Prize, interacts with ChatGPT to generate conversations. The Dialogue Scaffolding method ensures that the dialogues closely resemble the social chat genre of human-computer dialogues. We evaluate UCSC to ensure quality and safety, and we measure lexical diversity and topic consistency to show that the conversations are not repetitive and stay on topic. We evaluate the utility of UCSC by fine-tuning a compact dialogue-level model, PerQy-DLM, and showing that it outperforms competitive fine-tuned models like COSMO, Vicuna, and RedPajama-Chat-3B.

Anthology ID:: 2025.sigdial-1.44
Volume:: Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Month:: August
Year:: 2025
Address:: Avignon, France
Editors:: Frédéric Béchet, Fabrice Lefèvre, Nicholas Asher, Seokhwan Kim, Teva Merlin
Venue:: SIGDIAL
SIG:: SIGDIAL
Publisher:: Association for Computational Linguistics
Note:
Pages:: 538–560
Language:
URL:: https://aclanthology.org/2025.sigdial-1.44/
DOI:
Bibkey:
Cite (ACL):: Kevin Bowden and Marilyn Walker. 2025. Dialogue Scaffolding: Producing a Realistic Corpus of Human-Computer Open-Domain Dialogues Using a Spoken Dialogue System and ChatGPT. In Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 538–560, Avignon, France. Association for Computational Linguistics.
Cite (Informal):: Dialogue Scaffolding: Producing a Realistic Corpus of Human-Computer Open-Domain Dialogues Using a Spoken Dialogue System and ChatGPT (Bowden & Walker, SIGDIAL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.sigdial-1.44.pdf

PDF Cite Search Fix data