PLACES: Prompting Language Models for Social Conversation Synthesis

Collecting high quality conversational data can be very expensive for most applications and infeasible for others due to privacy, ethical, or similar concerns. A promising direction to tackle this problem is to generate synthetic dialogues by prompting large language models. In this work, we use a small set of expert-written conversations as in-context examples to synthesize a social conversation dataset using prompting. We perform several thorough evaluations of our synthetic conversations compared to human-collected conversations. This includes various dimensions of conversation quality with human evaluation directly on the synthesized conversations, and interactive human evaluation of chatbots fine-tuned on the synthetically generated dataset. We additionally demonstrate that this prompting approach is generalizable to multi-party conversations, providing potential to create new synthetic data for multi-party tasks. Our synthetic multi-party conversations were rated more favorably across all measured dimensions compared to conversation excerpts sampled from a human-collected multi-party dataset.


Introduction
Training dialogue models typically requires an abundance of data, as with any machine learning task. However, collecting high quality data is difficult and expensive, especially for dialogue tasks where there often is no "right answer" when developing the trajectory of a conversation. Typically dialogue data are sourced from crowdworkers and the quality of annotations, evaluations, and conversations can vary considerably (Zhao and Zhu, 2014), often necessitating guardrails such as credentialbased worker selection or defensive task design for quality control (Allahbakhsh et al., 2013).
To accommodate data scarcity in training dialogue tasks, low resource methods have become * Work done during internship at Amazon Alexa AI Large deposits of coal were found nearly and so many industries located themselves here. The village quickly grew into a key industrial centre.

Speaker 2
What's it like?

Speaker 2
It's a really great place to live. The people are friendly, the food is great, and there's lots to do! The best thing is that it's so small. It's really quiet and peaceful.

Dyadic Conversations
Figure 1: Pair of dyadic conversation excerpts about hometowns (upper) and pair of triadic conversation excerpts about Ithaca, NY (lower). In both pairings, one conversation is synthetically generated and the other is collected from humans. The answer is in Section 4. a topic of growing interest and importance (Zhao et al., 2019;Mi et al., 2019;Qian and Yu, 2019;Li et al., 2019). One idea that has gained particular attention is transfer learning -specifically, finding ways to leverage knowledge learned by pre-trained large language models (PLMs) for new tasks. PLMs have demonstrated impressive emerging conversational capabilities, enabling big performance improvements in various dialogue tasks (Brown et al., 2020;Shuster et al., 2022;Peng et al., 2022;Kulhánek et al., 2021). Particularly, PLMs have been prompted to augment existing conversational data (Chen et al., 2022;Mehri et al., 844 2022;Sahu et al., 2022).
Given some in-distribution seed examples, augmentation techniques attempt to generate data that are faithful to some task distribution (Kim et al., 2021b). Albeit powerful, one caveat common to all augmentation techniques is that the quality of synthetic data heavily relies on seed examples. But, what if crowdworkers do not possess the necessary background or skill set to complete a task en masse? How can we still get adequate high-quality synthetic data to learn a task?
In this work, we explore a novel application of Prompting LAnguage models for social ConvErsation Synthesis (PLACES). Synthesizing conversational datasets allows for the construction of training instances in nonexistent tasks. We specifically conduct open-domain, topicconditioned conversation generation using few-shot in-context learning with expert-written synthetic conversations. We conjecture that expert end-users know exactly the types of conversations that they need. Rather than using existing datasets, they can simply write a small set of high quality conversation examples according to the structure of their desired conversational outputs. We reason that given structure through high-quality in-context demonstrations, large PLMs are able to utilize their expansive pre-training data (e.g. Gao et al. (2020)) to synthesize realistic social conversations, implicitly creating personalities and backgrounds for hypothetical speakers. The process of conversation writing would otherwise require human creativity and effort.
Our paper makes four core contributions.
(1) PLACES involves synthesizing an entire conversational dataset from a few targeted expert-written examples. These conversations match the quality of two widely adopted social dialogue datasets, Daily-Dialog (Li et al., 2017) and Topical Chat (Gopalakrishnan et al., 2019), in terms of human evaluation and automatic metrics.
(2) We demonstrate that our synthetic conversations can be used as a finetuning dataset which matches the performance of its human-curated counterparts as measured by an interactive human evaluation and automatic metrics. (3) We apply PLACES to synthesize data for an under-studied subfield of dialogue research: multi-party conversations. We evaluate a set of synthetic triadic conversations in comparison to two human-collected multi-party conversational datasets (Shaikh et al., 2010;Poria et al., 2019).
To our knowledge, our work is the first to synthesize multi-party conversations, adding to the still-growing body of work on multi-party social dialogue. (4) Lastly, we conduct an error analysis on both dyadic and triadic synthetic conversations. We discuss the implications of our findings, as well as potential solutions to address the generation "errors."
In particular, in-context learning, where few-shot examples are provided in the input prompt of a PLM, has been found to provide valuable information in guiding generation output (Min et al., 2022;Brown et al., 2020;Min et al., 2021;Lu et al., 2021b). As a result, many recent efforts in prompting PLMs have sought to augment various natural language processing datasets (Chen et al., 2022;Wang et al., 2022;Sahu et al., 2022;Mehri et al., 2022;Rosenbaum et al., 2022a). Prompting has become a viable "solution" for augmentation in dialogue tasks, which have traditionally been considered challenging due to the difficulty of augmenting dialogue context (Chen et al., 2022).
However, prompt-based augmentation strategies are uncontrolled forms of generation, which may result in generation mistakes for labeled datasets (Sahu et al., 2022;Chen et al., 2022;Meng et al., 2022). In contrast, other recent studies have instead proposed language augmentation strategies that use complex, highly-controlled frameworks that often involve fine-tuning generators (Papangelis et al., 2021;Zhang et al., 2020b;Kulhánek et al., 2021;Zhang et al., 2020a). Such complex augmentation frameworks require larger amounts of seed data to maintain a ground-truth language distribution (Rosenbaum et al., 2022b;Kim et al., 2021b), and are more costly than prompting PLMs (Chen et al., 2022). However, in the context of dataset synthesis, seed data and label correctness are less  Figure 2: Example of the components of a prompt (left) used by OPT 30B to generate a synthetic conversation about pets (right). Conversations in the prompt are prefixed by recipes. Blue text: topic labels. Red text: seed background information metadata. important considerations. There is no task distribution from which seed data is drawn that PLMs must remain faithful to, and similarly, invariant groundtruth knowledge for language models is dependent on the desired task being synthesized.
Our work differs from existing applications of prompting for conversations along several dimensions. Many studies examine utterance-level generation (Chen et al., 2022;Sahu et al., 2022;Aher et al., 2022;Rosenbaum et al., 2022b), whereas our work concerns the synthesis of full conversations. Bae et al. (2022) generated conversations for a narrow task and provided evaluations between their synthesis conditions. Recent concurrent work by Kim et al. (2022) sought to distill conversations from InstructGPT 175B using a commonsense knowledge graph. In our work, we synthesize conversations using an open-source PLM and demonstrate that they are comparable to humancollected datasets, in terms of both conversation quality and usability as a dataset. Moreover, all of these studies only concern dyadic conversations, because the vast majority of conversational tasks are dyadic. Our work is the first study to synthesize multi-party conversations.

Conversation Generation
In this section, we discuss our methods for conversation generation. We first detail the construction of our example conversations, then describe their application to prompting PLMs.

Writing Conversation Examples
We simply wrote a pool of ten conversations between two speakers representing everyday dialogue using proper grammar. Along with each conversation, we wrote a brief conversation "recipe" which includes a topic, as well as background information for the two speakers 1 .
The background information represents some more fine-grained information about the two speakers, relevant to that particular topic. For example, Figure 2 depicts an example prompt with three in-context conversation demonstrations. Each conversation is prefixed by a recipe and is structured in the same manner: "The following is a conversation between Alice and Bob about topic" (e.g., "pets") followed by detailed background information (e.g., "Alice love cats. Bob is more of a dog person.").

Creating Conversations via Prompting
Each prompt consists of three randomly sampled conversations from the aforementioned pool, along with their accompanying recipe. After experimenting with PLMs of three different sizes (GPT-J 6B, GPT-NeoX 20B, OPT 30B), we primarily use OPT-30B and generate with nucleus sampling with p = 0.92. Inspired by the format of DailyDialog, our handwritten and synthetically generated conversations fall into three categories: start-to-finish conversations, excerpts from the start to the middle  (2022)), a human-chatbot dataset designed to determine desirable humanchatbot tasks/conversations. FITS contains 5592 conversations which span 52 conversational topics (e.g., "nutrition," "philosophy") with 315 subtopics (e.g., "Italian food," "Soren Kierkegaard"). We wrote background information for each of the 315 subtopics in the form given in Figure 2.
Using the product of this process once results in a new synthetic dataset with 5592 conversations using the same topic, subtopic pairings from FITS. The average length of each conversation is 9.29 turns, with 12.84 words per turn. This is comparable to the dataset statistics of DailyDialog and Topical Chat, as per

Synthetic Conversation Evaluation
In Figure 1, the top-left is taken from DailyDialog, whereas the top-right is generated synthetically. The bottom-left is generated synthetically and the bottom-right is taken from MPC.

Evaluation of Conversation Quality
Table 2 provides a crowdworker evaluation of our synthetic dataset compared against DailyDialog and Topical Chat. We expect Topical Chat to be rated as the most interesting, due to the knowledgegrounding process utilized during the dialogue collection process. We randomly sampled 200 conversations for each conversation source and asked a pre-qualified pool of 28 crowdworkers on Amazon Mechanical Turk (AMT) to rate each conversation.  The instructions and details of our human evaluation setup are explained in Appendix A. As these conversations are generated using prompting, we first checked whether each conversation followed the prescribed prompt. Crowdworkers identified 95% of the conversations generated by OPT 30B as matching the topic stated in the prompt 2 , indicating this prompting strategy's effectiveness for topic-grounded conversation generation. Overall, Table 2 indicates that synthetic conversations generated by OPT 30B are rated as the most coherent, and more interesting and consistent than DailyDialog. The synthetic conversations are almost as natural as DailyDialog, but are rated as less interesting and natural than Topical Chat. Given our results, we also hypothesize that larger models likely produce higher quality conversations. We provide several examples of conversations generated by OPT 175B using an online web interface 3 in the Appendix.
A concern one might have is that since in-context examples heavily influence prompting (Min et al., 2022;Lu et al., 2021b), our small in-context example size may limit the lexical diversity of our synthetic conversations. Following earlier work evaluating text generation, we use Distinct-N to measure lexical diversity (Wu et al., 2021;Li et al., 2016). Figure 3 shows that our synthetically generated conversations are slightly more diverse than both DailyDialog and Topical Chat in terms of distinct bigrams and trigrams, and slightly less diverse than Topical Chat in terms of 4-grams.
We then sought to examine the impact of using expert handwritten examples by comparing against synthetic conversations generated using conversations from DailyDialog and Topical Chat as in-   Table 3 shows that synthetic conversations generated conditioned on handwritten incontext examples are the most coherent, natural, and on-topic. In terms of interestingness and consistency, the ratings of these conversations slightly trail the ratings of the conversations generated conditioned on Topical Chat.

Fine-Tuning with Synthetic Conversations
After establishing that our synthetic conversations are of rather high quality on their own, we attempted to use the synthetic dataset as training data for dialogue models. We fine-tuned distilled BlenderBot 400M (Roller et al., 2021) on DailyDialog, Topical Chat, and our synthetic conversations 4 . Rather than directly prompting OPT as a response generator, we select BlenderBot as a lightweight, effective dialogue model. This allows for comparisons between the three data sources as training sets, because fine-tuning OPT is prohibitively expensive. Moreover, while prompting with larger PLMs can yield coherent responses, it is generally impractical as an end-to-end dialogue system if hosted on typically available hardware. For long inputs (e.g. with multiple dialogues incontext), generation time typically takes several minutes using OPT 30B 5 .
We first performed an interactive human evaluation of the three dialogue models as end-to-end social chatbots using the LegoEval platform (Li et al., 2021). Details can be found in Appendix A. Table 4 shows that dialogue models fine-tuned on our synthetic conversations are rated compara-4 For fair comparison, we fine-tune on the same numebr of training instances via downsampling. 5 All experiments are conducted using one p3dn.24xlarge AWS EC2 instance.  Table 4: Interactive human evaluation yields comparable ratings for chatbots fine-tuned on conversations from DailyDialog (DD), Topical Chat (TC), and our Synthetic Data (Syn). bly to dialogue models fine-tuned on real humanhuman data -the chatbot fine-tuned on synthetic data appeared to be the most natural and nonrepetitive, and was rated as the second-most coherent. It was rated as the least intelligent, engaging, consistent, and interesting. However, two-sided t-tests at α = 0.05 revealed that there was not a statistically significant difference in ratings between the models fine-tuned on all three datasets across all dimensions except for interestingness. The Topical Chat model was rated as significantly more interesting, as expected.

2-grams 3-grams 4-grams
In terms of automatic evaluation, we applied these dialogue models on out-of-distribution test sets to prevent an unfair comparison. We evaluated models fine-tuned on DailyDialog and our synthetic data on Topical Chat, and models fine-tuned on Topical Chat and our synthetic data on DailyDialog. Table 5 indicates that in terms of perplexity and ROUGE, models fine-tuned on our synthetic data generalize to out-of-distribution convesational data as well as models trained on real human-

Triadic and Multi-Party Conversations
The vast majority of dialogue tasks and conversational datasets focus on dyadic conversations (e.g. Li et al. (2017) 2019)), following the traditional speaker-listener paradigm (Engelhardt et al., 2006). In contrast, the literature on multi-party social conversation is rather scarce, not only in terms of conversation generation but as a task altogether. However, while it is an understudied research area, it is incredibly important, because dyadic conversations do not capture the full reality of in-person, human-human social conversations, nor the full potential of dialogue agents.
To name a few applications, dialogue agents have the potential to supplement classroom learning with multiple parties, serving as a third mediating party in a debate or discussion between two people, or to provide companionship and support in virtual group settings. A major reason why these lines of work remain unsolved is that there are few largescale multi-party dialogue datasets. Many existing multi-party datasets are scripted corpora such as MELD (Poria et al., 2019) or MPDD (Chen et al., 2020) or HLA-Chat (Ju et al., 2022;Li et al., 2020). Other multi-party corpora are collected for highly domain-specific purposes, such as multi-party empathetic dialogue (Zhu et al., 2022). Such corpora are also typically collected through asynchronous online platforms, rather than natural conversation. These platforms exist in the form of forums and online chat platforms such as Ubuntu IRC (Lowe et al., 2015) or Reddit (Baumgartner et al., 2020). Other more natural multiparty conversational datasets are license-protected speech datasets (e.g. CHIME (Christensen et al., 2010)) which have been constructed for tasks such as speaker attribution.
We find that we can apply our prompting approach to generate synthetic, open-domain, multiparty social conversations following the same structure as our synthetic dyadic conversations 6 . As in the dyadic case, we generate triadic conversations using optional background information for each speaker. We consider the "Multi-Party Chat" corpus (MPC) (Shaikh et al., 2010), a text-based, open-domain conversation dataset collected in realtime online sessions at the University of Albany, and MELD, which contains scripted multi-party dialogues from the popular sitcom "Friends." We directly compare our synthetically generated conversations against MPC and MELD. Table 6 includes our evaluation of our conversations using the same pool of pre-qualified AMT workers, again with 200 randomly sampled conversations. MPC consists of massive conversation settings -on the scale of 500 turns for a typical conversation session -so we randomly sample 8 to 12 7 continuous turns for each conversation evaluation to more closely match the structure of our synthetic conversations. 8 We present examples of 6 While we effectively use Alice, Bob, and Claire instead of Speaker 1, Speaker 2, and Speaker 3, respectively, the order of speakers does not necessarily follow the speaker order in the in-context examples (e.g. Appendix Table S10). 7 The length between 8 and 12 turns is chosen uniformly. 8 We sample rather than selecting the first 8-12 turns, to  MPC and MELD in Appendix Tables S20, S21. We inform the AMT workers that they will read conversation excerpts. In addition to the questions in Table 2, we add two questions specific to multiparty conversations. We ask if the conversation excerpt looks comprehensible (in terms of the reader being able to determine who each speaker is addressing), and we ask if all parties of the conversation are participating equally and actively.
In Table 6, we find that the synthetic conversations are rated statistically significantly more favorably than MPC and MELD across all dimensions. Beyond conversation quality, it is possible that the ratings for MPC are comparatively low due to the fact that each conversation typically has more than three speakers, which may be more difficult for human raters to interpret. Our results for MELD also indicate that while the corpus is high quality, it may be better fit for comedy and accompaniment with visual context, than as pure dialogue.
Additionally, we checked the linguistic diversity for each speaker. In terms of Distinct-N, each speaker's lexical diversity is comparable (Figure 4) as well as the number of words per turn (12.2, 12.2, and 13.5 for Speakers 1, 2, and 3 respectively). The triadic conversations tended to be slightly longer than the average dyadic conversation (11.5 turns/conversation versus 9.29 turns/conversation).

Discussion
Overall, we find that prompting PLMs to generate synthetic conversations is promising.

Considerations for Dyadic Dialogue
The synthetically generated conversations appear comparable to conversations from human-collected datasets. The individual conversations appear interesting, coherent, natural, and consistent, as the average ratings for each category lie between 4.0 and 5.0. The Appendix includes multiple examples avoid overrepresenting greetings. of conversations generated using the strongest performing PLM (OPT 30B, e.g. Table S7) as well as several conversations generated using OPT 175B (e.g. Table S8). Tables 4 and 5 also indicate that fine-tuning on synthetically generated examples can result in dialogue models of comparable quality, with the potential for further improvements by simply generating more synthetic conversations.
Future work may consider applying applying this generation approach to dyadic contexts beyond social conversations, such as task-oriented dialogue. The clearest difference between social and task-oriented dialogue contexts is the importance of knowledge grounding. In task-oriented dialogue, there typically needs to be retrieval from knowledge base for response generation. An application of PLACES could involve using database results as a ground-truth reference. Rather than using a topic list like FITS, one could form conversational recipes using database search results as background information. Given the apparent semantic control described in Section 4, it is possible that synthetic task-oriented conversations would be able to correctly utilize knowledge.

Considerations for Multi-Party Dialogue
We found that in comparison to MPC, our synthetic triadic dialogues appear to be of fairly high quality. However, there remain several open questions about multi-party dialogue, even in the triadic case. For instance, there is not a set archetype of conversations. Sometimes, conversations may be dominated by a single speaker, whereas in others, each speaker in the conversation may contribute equally. Depending on the scenario, a speaker may be the facilitator -meetings can be considered (topicspecific) multi-party dialogues which are typically led by designated speakers.
Moreover, there are several questions about how to utilize multi-party dialogues in an interactive dialogue system. There are use cases where it may be appropriate for one dialogue system to interact with multiple users. On the other hand, in scenarios like emotional support dialogue systems, it may make sense for a single user to interact with multiple simulated conversational parties.
Here, we investigated our approach's potential to generate synthetic multi-party conversations, hoping to bridge the gap in data availability in multiparty chat. This opens opportunities for a variety of applications. Synthetic datasets could be used to help discover how to properly model triadic and multi-party conversations. In the future, datasets could also be generated for domain-specific, multiparty applications ranging from language learning to task-oriented spoken dialogue systems.

Error Analysis
We examine the dyadic and triadic conversations which received low scores (1/5) across multiple dimensions.

Dyadic Conversations
Out of the dyadic conversations, two conversations were rated as generic and dull. One conversation (Appendix Table S13) talks about the singer, Taylor Swift. However, the conversation is repetitive, repeating utterances such as "What are your thoughts on her?" and "I think she is very nice." The other conversation is about the filmmaker, Ken Burns (Appendix Table S14). While the conversation is appears coherent and uses correct factual information (e.g., making reference to Ken Burns' documentaries on World War II and the Vietnam War), the language could be perceived as dull.
Three conversations were rated as completely unnatural. In one case, the PLM missed the prescribed subtopic (cotton candy) and instead hallucinated a conversation about a sensitive topic, cancer (Appendix Table S15). This is also the only conversation to be rated as completely incoherent. The other two conversations are both on-topic. However, one conversation is on-topic but rather short (five turns), whereas the other conversation is overly verbose and a little repetitive.
There were also three conversations were evaluated as completely inconsistent. In all three conversations, the roles of the two speakers seemingly swap. While these hypothetical turns are possible in excerpts of real conversations, they assume background information or events which have not been explicitly established when considered as standalone conversations. An example is given in Appendix Table S16.
While some of the evaluations may be subjective, an issue that has objectively appeared multiple times is the consistency of speakers' utterances. The intents and personas of the speakers appear to get switched, which is also an open problem in dialogue systems research. Future work may look to combine conversation synthesis approaches with strategies for dialogue consistency such as the generate-delete-rewrite framework (Song et al.,

Triadic Conversations
No conversations were perceived as completely incomprehensible, but human evaluators indicated that two conversations appeared to have imbalanced engagement -in both cases, the third speaker ("Claire") only has one dialogue turn. As discussed in Section 6.2, however, it is not clear whether this is a drawback. Real-life triadic conversations do not follow a set archetype in terms of engagement balance.
There was one conversation which was rated as completely incoherent. In the conversation, there is one dialogue turn which presents information inconsistent with prior turns, but the another issue appears to be an oddly placed transition which brings the conversation from travel to hobbies: "You should definitely go to Paris! What do you like to do for fun?" (Appendix Table S17).
There are two conversations which were perceived as completely unnatural. However, naturalness appears to be a rather subjective evaluation. One conversation is given in Appendix Table S18, and it is debatable whether the language conventions used are unnatural. One could argue that it is overly enthusiastic, but others could argue that it is how some people speak colloquially. Interestingly, the second conversation which received a low naturalness score is also enthusiastic and about the same topic (gardening).
The only conversation which was rated as generic and dull was a 15-turn debate about whether the European Union is a "conspiracy" (Appendix Table S19). The debate is rather shallow and does not make a lot of progress.
As with the dyadic conversation error analysis, we see that there are issues with persona consistency. However, unlike the dyadic scenario, there are fewer existing solutions for dialogue consistency. Multi-party conversation synthesis could potentially be improved by applying ideas from the newly published PersonaTKG dialogue system, which employs a unified graph that encodes personas, utterances, and external knowledge on a scripted dialogue dataset (Ju et al., 2022).
Beyond consistency, in the example from Table S19 we see that there is potential for PLMs to hallucinate misinformation. There are again fewer existing studies on circumventing this obstacle in multi-party dialogue, but future work could look to incorporating external knowledge (Kang et al., 2022) or dialogue safety approaches (Kim et al., 2021a;Dinan et al., 2019). All said, our work motivates further study into multi-party dialogue consistency, safety, and synthesis.

Conclusion
In this work, we presented an application of prompting PLMs to create synthetic conversations. These synthetic conversations are comparable in terms of quality and lexical diversity to actual humanhuman datasets, and can be used as training data for dialogue models. This opens avenues in generative language work such as collaborative and creative writing, story generation, as well as synthesis of new conversational tasks. Here, we presented one example -synthesizing a multi-party conversational dataset. This presents a unique opportunity to further study multi-party dialogue modeling.

Limitations
Controllability. We witness encouraging levels of control through the prompt (95% of the time, the synthetic conversation matches the desired topic), but prompting PLMs is still an uncontrolled form of generation. Future work could seek to add more semantic controls beyond the stated topic in the prompt or explore using weak supervision to provide post-hoc improvements on synthetic data quality, similar to Chen et al. (2022). In this work, we also did not thoroughly explore the effects of different generation approaches. Future work may consider applying semantic constraints during the decoding process (Lu et al., 2021a). Further controls are necessary before using this approach for higherstakes settings such as task-oriented dialogue and other knowledge-grounded tasks.
Cost of Human Effort. While we demonstrate the ability to synthesize large amounts of data, the quality of a synthesized dataset is still dependent on human effort, to an extent. One can use a generic prompt template such as "Alice is interested in [subtopic]" for each subtopic, but we qualitatively see that more detailed background information in a prompt often yields better generation performance.
In this work, we generated 5592 dyadic and triadic conversations, matching the number of topic combinations in FITS. PLACES can be used to generate many more conversations in the future. Using the same overall can continue to make new combinations of topic and subtopic, or simply rerun the generation process as it is nondeterministic. Moreover, one may consider filling the slots in our conversation recipes using an abundant of external sources, including from existing dataset annotations (e.g. Persona Chat Zhang et al. (2018)).
Computational Costs. Once a dataset is synthesized, small, task-specific models can be used downstream. However, the synthesis method used in this work is still expensive: we prompt PLMs. While we only used freely accessible PLMs such as OPT, we acknowledge that not everyone has access to the number of GPUs necessary to load PLMs, even for inference.
Prompt Design. The idea of prompting large language models is not novel. There is a plethora of work that examines how to apply prompting to a variety of different tasks (e.g. Brown et al. (2020); Min et al. (2021)), along with several studies on how to mine or engineer different prompts (Liu et al., 2021). In this work, we do not claim novelty to our prompt, nor do we claim that our prompt design is the optimal prompt for conversation generation. Our prompt is designed in a conversational manner, drawing inspiration from Chen et al. (2022). We instead emphasize the application of prompting for conversational dataset synthesis. The idea of synthesizing conversational datasets "from scratch" is previously unexplored, and has potential to supplement a lot of areas of dialogue research, such as multi-party conversations.

Ethical Considerations
Human Evaluation and Crowdsourcing. We make use of crowdsourcing through Amazon Mechanical Turk for several experiments. All crowdworkers were paid at a rate higher than the minimum wage in California. In accordance with California State Law, all crowdworkers were also informed they were speaking with chatbots during the data collection for our interactive evaluation. All participants consented to the logging of their responses.
Language Model Biases. Large pre-trained language models are typically pre-trained on massive corpora crawled from the internet such as The Pile (Gao et al., 2020) or Common Crawl. This allows language models to have exposure to a large amount of linguistic diversity, but this also results in exposure to a lot of hateful, biased, or otherwise undesirable content from the internet (Luccioni and Viviano, 2021). Future work should examine combining conversation synthesis with dialogue safety approaches.
Scientific Artifacts. All scientific artifacts are used according to their intended purpose. The FITS dataset is publicly available at https://parl.ai/ projects/fits/. OPT is an open-source language model. GPT-J is available for use under the MIT license. We use the HuggingFace Transformers and PyTorch packages for all modeling (Wolf et al., 2020;Paszke et al., 2019). All artifacts used are in English.
2022. Using large language models to simulate multiple humans. arXiv preprint arXiv:2208.10264.

A Human Evaluation Setup
Our human evaluation studies on Amazon Mechanical Turk are evaluated conducted with 28 pre-qualified crowdworkers, who have previously demonstrated proficiency with natural language processing tasks.

A.1 Conversation Evaluation
The crowdworkers were asked to rate conversations from multiple sources according to the following dimensions and instructions.
• How natural is the overall conversation? Scale: 1 (completely unnatural) to 5 (as natural as two native English speakers) • How coherent is the overall conversation? Scale: 1 (completely incoherent) to 5 (as coherent as two native English speakers) • How interesting is the overall conversation? Scale: 1 (generic and dull) to 5 (full of content and very engaging) • How consistent are each of the speakers' turns? Scale: 1 (completely inconsistent) to 5 (no logical fallacies) • Does the conversation match the stated topic? Options: Yes (1) or No (0) Each conversation is rated by three crowdworkers, and the median score is selected, following the idea of a majority vote.
For multi-party conversations, crowdworkers were asked two additional questions regarding comprehensibility and engagement balance.
• Can you tell which speaker is speaking to which? Scale: 1 (completely incomprehensible) to 5 (perfectly comprehensible) • Is each speaker engaged, or is the conversation primarily dominated by one or two of the speakers? Scale: 1 (totally dominated by one or two speakers) to 5 (all speakers are actively participating in the conversation to an equal degree)

A.2 Interactive Evaluation
For each HIT of the interactive evaluation study, each crowdworker was presented with links to chatbots presented in a randomized order. The link connects each crowdworker to a deployment on an instance of LegoEval (Li et al., 2021). The users are presented with a landing page where they are told that they are interacting with a chatbot, and will be asked to evaluate their conversation experience. Immediately after interacting with a chatbot, each crowdworker was presented with a survey asking for their impression of the chatbot. In addition to the above dimensions (other than on-topic), the crowdworkers were asked how engaging, intelligent, and non-repetitive they thought the chatbot was.

B Model Details
During generation, we use top-p sampling with p = 0.92.  Table S7: Pair of dyadic conversations generated using OPT 30B. The prompt recipe given is: "The following is a conversation between Alice and Bob about their hometowns. Bob is from Austin, Texas, and Alice is from New York City." Table S19: Synthetic triadic conversation generated by OPT 30B which was rated as generic and dull. "Alice" begins a long debate on whether the EU is a "conspiracy" without making a lot of conversational progress.
Party Utterance john sure john i think so, meg mara how did i know that was coming meg its not just the public eye john haha mara mara hushh.. *** nick There are already other countries who are investigating the Bush administration for war crimes -Spain meg with the breton woods george they need to be prosecuted...that's in obama's hands nick wow, george, right win propaganda... huh meg look at how well Iraq is doing mara goodness meg there's a point at which interrogation becomes torture and is just inhumane john agree to george mara ? mara im in albany btw meg Which we signed! amy well it is the way the world is going-email, chat" etc john yes jordan And this is one of the tricky things in this virtual world. You know nothing about the people u r talking to!!!! amy u r right you just used online language haha mara hes not much fun either haha, what do you think? amy hi john-can you see my message here? jordan Hi, amy mara i dont know what is better really!!! john haha  Alice is learning about the planets in school.

Pokemon
Alice likes to play Pokemon. Bob also likes Pokemon. Table S22: Corresponding background information written for each of the subtopics found in the FITS dataset.
There is a mixture of prompts which only mention one speaker and prompts which mention two speakers. Every synthetic conversation involves both speakers. 866 Table S25: Triadic conversation recipes written for each of the "generic topics" given in the FITS dataset. These conversation recipes are included after the in-context examples when prompting PLMs to generate synthetic conversations. Unlike Table S22, each of these conversation recipes may include background for up to three people. Continued in Table S26. 867 Table S26: Triadic conversation recipes written for each of the "generic topics" given in the FITS dataset continued from Table S25.