Faithful Persona-based Conversational Dataset Generation with Large Language Models

Pegah Jandaghi; Xianghai Sheng; Xinyi Bai; Jay Pujara; Hakim Sidahmed

doi:10.18653/v1/2024.findings-acl.904

Faithful Persona-based Conversational Dataset Generation with Large Language Models

Pegah Jandaghi, Xianghai Sheng, Xinyi Bai, Jay Pujara, Hakim Sidahmed

Abstract

High-quality conversational datasets are essential for developing AI models that can communicate with users.One way to foster deeper interactions between a chatbot and its user is through *personas*, aspects of the user’s character that provide insights into their personality, motivations, and behaviors.Training Natural Language Processing (NLP) models on a diverse and comprehensive persona-based dataset can lead to conversational models that create a deeper connection with the user, and maintain their engagement. In this paper, we leverage the power of Large Language Models (LLMs) to create a large, high-quality conversational dataset from a seed dataset. We propose a Generator-Critic architecture framework to expand the initial dataset, while improving the quality of its conversations.The Generator is an LLM prompted to output conversations.The Critic consists of a mixture of expert LLMs that control the quality of the generated conversations.These experts select the best generated conversations, which we then use to improve the Generator.We release Synthetic-Persona-Chat, consisting of 20k conversations seeded from Persona-Chat.We evaluate the quality of Synthetic-Persona-Chat and our generation framework on different dimensions through extensive experiments, and observe that the losing rate of Synthetic-Persona-Chat against Persona-Chat during an AI detection test decreases from 17.2% to 8.8% over three iterations.

Anthology ID:: 2024.findings-acl.904
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15245–15270
Language:
URL:: https://aclanthology.org/2024.findings-acl.904/
DOI:: 10.18653/v1/2024.findings-acl.904
Bibkey:
Cite (ACL):: Pegah Jandaghi, Xianghai Sheng, Xinyi Bai, Jay Pujara, and Hakim Sidahmed. 2024. Faithful Persona-based Conversational Dataset Generation with Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 15245–15270, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Faithful Persona-based Conversational Dataset Generation with Large Language Models (Jandaghi et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-acl.904.pdf

PDF Cite Search Fix data