Controllable Dialogue Simulation with In-Context Learning

Building dialogue systems requires a large corpus of annotated dialogues. Such datasets are usually created via crowdsourcing, which is expensive and time-consuming. In this paper, we propose \textsc{Dialogic}, a novel dialogue simulation method based on large language model in-context learning to automate dataset creation. Seeded with a few annotated dialogues, \textsc{Dialogic} automatically selects in-context examples for demonstration and prompts GPT-3 to generate new dialogues and annotations in a controllable way. Our method can rapidly expand a small set of dialogue data with minimum or zero \textit{human involvement} and \textit{parameter update} and is thus much more cost-efficient and time-saving than crowdsourcing. Experimental results on the MultiWOZ dataset demonstrate that training a model on the simulated dialogues leads to even better performance than using the same amount of human-generated dialogues under the challenging low-resource settings, with as few as 85 dialogues as a seed. When enough data is available, our method can still serve as an effective data augmentation method. Human evaluation results also show that our simulated dialogues have near-human fluency and annotation accuracy. The code and data are available at \textbf{\url{https://github.com/Leezekun/dialogic}}.


Introduction
Task-oriented dialogue (TOD) systems can assist users in completing tasks such as booking a restaurant or making an appointment.Building such a dialogue system requires a large corpus of annotated dialogues (Wu et al., 2020), which is costly to obtain in terms of money and time.
One popular approach to collecting and annotating task-oriented dialogues is crowdsourcing via a Wizard-of-Oz setup (Mrksic et al., 2017;Eric et al., 2017;Budzianowski et al., 2018), where crowdworkers produce conversations.Significant annotation efforts are further needed to label intent, entities, etc. Prior work has been proposed to minimize the cost and effort in data collection by hiring crowdworkers or leveraging user simulators to interact with existing dialogue systems (Williams et al., 2013;Shah et al., 2018b,a;Papangelis et al., 2019;Zhao et al., 2019;Rastogi et al., 2020;Tseng et al., 2021).However, the dependency on existing dialogue systems leave the developers with a classic chicken-and-egg problem.In addition, developing such user simulators typically requires considerable handcrafting and human involvement.
In recent years, large language models (LLMs) (Brown et al., 2020;Lieber et al., 2021;Rae et al., 2021;Thoppilan et al., 2022;Smith et al., 2022) demonstrate strong in-context learning capability.Provided with a few in-context examples, the LLMs, such as GPT-3 (Brown et al., 2020), can generate text with similar patterns without finetuning.This capability has been leveraged to synthesize training data in a few NLP tasks (Wang et al., 2021b;Liu et al., 2022).Although there have been methods that generate training data for a single component in the TOD systems (Li et al., 2022b), there hasn't been a plausible solution to generate whole dialogues with annotations for endto-end training due to its complex nature of involving multi-turn interactions, multiple possible logic flows, and multiple types of annotations.
To address the challenge, we introduce a controllable dialogue simulation method DIALOGIC for dialogue dataset creation.Seeded with a few seed dialogues, DIALOGIC automatically selects incontext examples for demonstration and prompts LLMs such as GPT-3 to generate annotated dialogues in a controllable way.DIALOGIC can play the roles of both user and system simulator.Figure 1 illustrates a partial example.For the user side, GPT-3 is prompted first to generate the turn-  9.
level user goal (belief state), conditioned on which the user utterance that expresses the goal will be generated.Likewise, we prompt GPT-3 to generate the dialog act for the system side and then the corresponding system response.We also propose automatic verification and revision methods to mitigate annotation errors.
This paper has two key insights.First, leveraging the in-context learning ability of LLMs, our method can simulate both the user and system side to generate annotated dialogues by learning from a few examples.Except for the minimal efforts in collecting the small seed dataset and training an auxiliary model on that, the simulation process is free of human involvement and parameter update, making our method much cheaper and faster than crowdsourcing in dataset creation.Specifically, a large-scale and high-quality dataset such as Multi-WOZ (Budzianowski et al., 2018) can be created using our method within only several hours.Second, we design controllable dialogue generation strategies to overcome the deficiency of GPT-3 in lack of reliability and interpretability.We also investigate effective representations and selection strategies of in-context dialogue examples for LLMs to better leverage their in-context learning capabilities.
We conduct experiments on MultiWOZ2.3(Han et al., 2021) dataset.Remarkably, in the challenging low resource settings where as low as only 85 seed dialogues (1% of the whole training dataset) are given, the dialogues simulated by our method lead to even better model performance than the same amount of human-generated dialogues.DIA-LOGIC can also serve as an effective data augmentation method when the full training set is provided.Human evaluations indicate that our simulated dialogues have comparable fluency, annotation accu-racy, and more diverse dialogue flows than humangenerated dialogues.Our results demonstrate the promise of leveraging large language model to automate the complex dialogue dataset creation.We have released the code and simulated data to facilitate future studies.2 2 Related Work

Dialogue Collection and Simulation
Building end-to-end dialogue systems heavily relies on annotated training data.Wizard-of-Oz (Kelley, 1984), as a popular approach, is able to produce high-quality conversations but totally relies on human efforts (Mrksic et al., 2017;Eric et al., 2017;Asri et al., 2017;Budzianowski et al., 2018).There are also dialogue corpora of interactions between humans and existing dialogue systems or APIs (Williams et al., 2013(Williams et al., , 2014;;Raux et al., 2005).To further reduce human efforts, user simulators are leveraged to interact with the system via reinforcement learning or self-play (Shah et al., 2018b,a;Papangelis et al., 2019;Zhao et al., 2019;Rastogi et al., 2020;Tseng et al., 2021).However, existing dialogue systems or APIs are still needed, which restricts these solutions to existing domains.To this end, Mohapatra et al. (2020) proposed a method that utilizes GPT-2 (Radford et al., 2019) to simulate both the user and system side.However, this method still needs many dialogues to train the simulators and cannot guarantee the simulation quality in low-resource settings.

Task-oriented Dialogue
A task-oriented dialogue system usually consists of three components: natural language understanding (NLU) for dialogue state tracking, dialogue management (DM) for predicting the dialog act based on the dialogue states, and natural language generation (NLG) for mapping dialog act to natural language response.The annotated data of belief states, dialog acts, and system responses are needed to train these components whether in a separate way (Wu et al., 2019;Lee et al., 2019;Heck et al., 2020), or an end-to-end fashion (Peng et al., 2021;Hosseini-Asl et al., 2020;Lin et al., 2020;Yang et al., 2021;Su et al., 2021).In this paper, we aim to generate dialogues and their complete set of annotations.

In-Context Learning
As an alternative to finetuning, in-context learning with LLMs, such as GPT-3 (Brown et al., 2020), can perform a new task by learning from a few incontext examples without training model parameters.Due to the superior few-shot performance and scalability, in-context learning has been applied to a wide range of NLP tasks.As for dialogue tasks, incontext learning has been increasingly deployed in tasks such as intent classification (Yu et al., 2021), semantic parsing (Shin and Van Durme, 2021), and dialogue state tracking (Hu et al., 2022).Madotto et al. (2021) built an end-to-end dialogue system solely based on in-context learning.Despite its success, GPT-3 requires a large number of resources to be deployed.And its public API is charged based on the length of input text.What's worse, the limitation of input length restricts the number of in-context examples and thus the generation performance.Consequently, a few methods have been proposed to leverage GPT-3 to synthesize data to train smaller models for inference (Wang et al., 2021a,b;Liu et al., 2022;Li et al., 2022a).Although it is especially desirable for dialogue tasks as the input prompt of dialogues is usually lengthy, there hasn't been a plausible solution to generating annotated dialogues for developing TOD systems due to its complex nature of involving multi-turn interactions and multiple types of annotations.

Method
In this paper, we introduce a novel method DIA-LOGIC to simulate annotated dialogues for building task-oriented dialogue systems based on language model in-context learning.The only requirements are a small seed dataset D s consisting of a few annotated dialogues and an ontology O that includes all slots and possible slot values for each domain.An auxiliary TOD model M such as Simple-TOD (Hosseini-Asl et al., 2020) and PPTOD (Su et al., 2021) trained on D s will be used to verify and revise generated annotations.Our goal is to expand D s by generating new dialogues.For each turn of the dialogues, we need to generate the user utterance U , belief state B, database (DB) query result Q, dialog act A, and system responses S (we omit the turn index for brevity).We will elaborate the design of our method using a well-studied task-oriented dialogue dataset MultiWOZ (Budzianowski et al., 2018;Eric et al., 2020;Han et al., 2021), which covers 7 domains such as hotel and restaurant, and 24 slots such as hotel-area and restaurant-food (see Appendix A for more details).To simulate the low-resource environment, we use 1%, 5%, 10% of the training dataset as the seed dataset D s .

Overview
A partial example of a simulated dialogue is shown in Figure 1.The pipeline of our method is illustrated in Figure 2.For a domain, the goal generator will take the ontology O as input to generate a new user goal G i .Then we select a few seed dialogues with similar user goals from D s as the in-context example for GPT-3.Given the user goal G i and the selected in-context examples, we leverage GPT-3 to generate a new dialogue C i .As the generated data may fail to satisfy our requirement, we design methods for automatic verification and revision.

In-context Example
User Goal.A task-oriented dialogue is a conversation where the dialogue system helps accomplish the user's goal.For a new dialogue C i , we first generate its user goal G i based on the ontology.The user goal and belief state are a set of domain-slotvalue triplets: (domain, slot_name, slot_value).For example, when a user wants to book a 4-star hotel for 2 nights, and a cheap restaurant that serves Chinese food, his user goal will be {(hotel, stars, 4), (hotel, book stay, 2), (restaurant, pricerange, cheap), (restaurant, food, chinese)}.We investigate several ways to generate the user goal, i.e., determining the domains, slots, and slot values to be selected, which will be discussed as follows.
Example Selection.Given the target user goal G t , we select a few seed dialogues as in-context examples, from which GPT-3 can learn to generate the target dialogue C i .To achieve that, the selected dialogue examples should contain as much ontology information needed in the target dialogue (i.e., mentioned slots) as possible so that GPT-3 can mimic the "in-domain" generation.To measure how two dialogue goals G i and G j overlap, we calculate their similarity as: where D(G i ) and S(G i ) denote the set of domains and slots in the user goal G i , respectively.The first part is the Jaccard similarity (Niwattanakul et al., 2013) of the domain set, while the second part is that of the slot set.The probability of a dialogue C j from the seed dataset D s being sampled as incontext examples for the target dialogue C i is: where τ is the temperature.A higher temperature will introduce more randomness and diversity in example selection.We investigate several ways to generate user goals and select in-context examples: • Random Sampling: we randomly select domains, slots, and slot values to form a user goal and sample in-context examples as described in Equation 2. In this way, we can generate any unseen user goal and thus the corresponding dialogues.However, as the number of seed dialogues is limited, it is hard to guarantee that the sampled dialogue examples can cover all the information required for generating the target dialogue.• Value Substitution: we only substitute the slot values of the seed dialogues' user goal to form a new user goal.This method can ensure that all the required slots are mentioned in the incontext examples.However, GPT-3 will tend to replicate the in-context examples, and thus few diversity can be introduced.• Combination: we first select a few dialogues from the seed dataset and then combine their user goals to create a new goal.As the new user goal might involve too many domains and slots, we randomly drop some slots.This method can ensure that all the mentioned slots in the target user goal are covered in the examples and encourage the GPT-3 to generate diverse data.
We experimentally found the Combination method yields the best performance.More details, comparison, and discussion of different goal generation methods can be found in Appendix A.2. Demonstration.To better demonstrate the desired pattern of generated data for a dialogue to GPT-3, we design the format for the example dialogues as shown in Figure 3.The user goal and belief state are converted from a sequence of triplets to the natural language via a template.For example, the user goal of {(hotel, stars, 4), (hotel, book stay, 2), (restaurant, pricerange, cheap), (restaurant, food, chinese)} will be converted to [hotel] star is 4 ,  8.
book stay is 2 [restaurant] pricerange is cheap , food is chinese, where [domain] separates domains and the comma separates slots in each domain.
As for the conversation part, the desired annotations are incorporated with the utterances for each turn.For the user side, GPT-3 will generate the user utterance and the turn-level belief state, i.e., the user goal mentioned in this turn.Dialog acts and their corresponding system response are needed for the system side.Similarly, the dialog act is also a set of triplets (domain, action_type, slot_name) and is converted to natural language similarly to belief states.An actual example of the demonstration can be seen in Table 8.

Prompt Design
Given a prompt consisting of the task description, a few in-context examples, and an incomplete entry, we instruct GPT-3 to generate text to complete the entry.A template of our prompt is shown in Figure 4.The format of in-context examples is described in Section 3.2, which consists of an instruction (user goal), and the conversation.As for the incomplete entry, the target user goal description is used as the instruction, and GPT-3 will generate the corresponding conversation in a controllable way, which will be described in the next section.

Controllable Dialogue Generation
Considering GPT-3's known deficiencies in lack of reliability and interpretability, we propose methods to control GPT-3 when it generates dialogue data.In addition, we design automatic revision methods to minimize potential annotation errors.Figure 5 illustrates the controllable generation process of a dialogue turn.
For the user side, GPT-3 will generate the belief state B and the corresponding user utterance U .   9.
The belief state is expected to be consistent with the the user utterance.We keep U unchanged as the final user utterance and check the annotation errors in the generated belief state B, which can be categorized into two types.Taking the example in Figure 5 for illustration, (hotel, stay, 1), as a part of the original generated belief B, doesn't appear in the user utterance, which is called over-generation.
On the contrary, the value don't care for slot hotelarea is mentioned by the user but not included in U , which is called de-generation (Li et al., 2020).We utilize an auxiliary generator and slot-value match filter to mitigate de-generation and over-generation issues, respectively.Auxiliary Generator.To tackle the de-generation issue, we try to detect as many mentioned slots in the user utterances as possible.To this end, we utilize an auxiliary TOD model M trained on the seed dataset D s to generate its predicted belief state B, conditioned on the dialogue context of all previous turns and user utterance U of the current turn.B could be complementary to B generated by GPT-3.We found that GPT-3 sometimes forgets to generate all or even any belief state.If not corrected, GPT-3 will continue the errors in the following turns.Therefore, it is nontrivial to utilize the auxiliary generator, though not well trained when the seed data is limited, to complement belief states.With increasing seed data to train the auxiliary model, we can better detect belief state slots forgotten by GPT-3 and mitigate the de-generation issue.11.
Slot-value Match Filter.B and B contain the belief states detected by GPT-3 and auxiliary generator and are complementary.We thus combine them.When the predictions of GPT-3 and auxiliary generators have overlapped slots, we take the slot value detected by GPT-3, i.e., GPT-3 has a higher priority.We then filter out the over-generated slots whose values couldn't be matched in the user utterance, resulting in the final belief state B.
The auxiliary generator and slot-value match filter are used jointly to automatically detect and correct annotation errors, mitigating the de-generation and over-generation issues.Taking the example in Figure 5 for illustration, the auxiliary model detects the correct belief state (hotel, area, don't care), which is missing in B. On the contrary, the slot value cheap of the slot hotel-pricerange couldn't be detected in the user utterance, and thus is removed from the belief state.The resulting final belief state B is used to automatically retrieve the DB entry Q from a pre-defined database.
As for the system side, GPT-3 can generate the dialog act Â by concatenating the generated user utterance U and belief state B with the prompt.We also utilize the auxiliary TOD model M to generate its prediction Ā conditioned on the dialogue context X and the user utterance U , revised belief states B, and DB query result Q.To ensure some dialogue logic is followed and the database queried result is taken into account, we write some rules to filter out invalid dialog acts and decide the final dialog act annotation A, which is then concatenated with the prompt to continue generating the system response S.
In most cases, GPT-3's generation is acceptable without requiring revision, and we cannot guarantee that all the errors can be detected and corrected (we list the frequency in Appendix B).However, the automatic revision on the fly is still essential, as GPT-3 tends to imitate the errors in following turns.Therefore, each revision can not only correct the current error but also avoid numerous potential mistakes in the following turns.In addition, when there is enough data D s to train the auxiliary TOD model M, DIALOGIC plays a more significant role as user simulator to interact with the well-trained system M.

Turn-level Generation for DST
For the dialogue state tracking (DST) task, the belief states at each step are an accumulation of previous steps.Any errors from earlier steps will propagate to later steps.In addition, when focusing on the DST task and belief state annotations, it is not necessary to generate them along with the whole dialogue and other annotations.To avoid error accumulation and unnecessary cost, we propose a method that only generates user utterances and corresponding belief states at a turn level.
As shown in Figure 6, for each turn of the seed dialogue, we will simulate a new user turn with turn-level belief states and user utterances as an alternative to the original turn to form a new dialogue.To preserve the consistency of dialogue flow, we generate turn-level belief states according to the dialog act of the previous turn: • Request means that the system is requesting user's requirements on some attributes (slots), and the user is expected to answer the question by mentioning values for requested slots.We thus generated new turn-level belief states by selecting some or all of the requested slots and adding some other unmentioned slots.• Reqmore means that the system asks the user whether he wants service for other domains.
Under this circumstance, we select an unmentioned domain and randomly select several slots in this domain to form a new turn-level belief state.For the other dialog acts, we can randomly select unmentioned slots in the current domain.Some examples of these three kinds of dialog acts are provided in Appendix as Table 10.Given the new belief state, we prompt GPT-3 to generate the corresponding user utterance, which is then verified and revised as described in Section 3.4.As for in-context examples, we only need to sample user turns instead of whole dialogues, which largely reduces the length of prompts and, thus, the generation cost.
Similarly, Li et al. (2020) proposed a method that can only substitute the slot values of turn-level belief states to form new belief states (Value Substitution), and trains GPT-2 to generate the corresponding user utterance.However, it requires a large set of data to train the utterance generator, which is not available in low-resource settings.

Experimental Setup
Seed Dataset.We implemented our method on the MultiWOZ (Budzianowski et al., 2018) dataset, which consists of 8,438 training, 1,000 validation, and 1,000 test dialogues across 7 domains.As annotation errors exist in the original dataset, we conduct experiments on a cleaner version Mul-tiWOZ2.3 (Han et al., 2021).To simulate the challenging low-resource scenarios, we use 1% (85/8438), 5% (422/8438), and 10% (843/8438) as the seed training dataset and adopt the standard val/test set for evaluation.We also simulate 422/843 dialogues given the full training set to evaluate its effectiveness as a data augmentation method under the full-shot setting.Simulated Dataset.We select the largest version of GPT-3 API text-davinci-002 and use the top-p decoding, where p = 0.7.When generating the user goal, we limit the maximum number of requested domains in a dialogue to 4 and the maximum of slots in each domain to 6.We stop generating a dialogue if the number of turns exceeds 12.We use two in-context examples for all generations.More details are provided in Appendix A.1.PP-TOD trained on the seed dataset is utilized as the auxiliary model for automatic revision.

Cost Comparison.
MultiWOZ dataset creation required 1,249 workers and cost around $30k except for the additional cost of postprocessing (Budzianowski et al., 2018).Assuming a minimum hourly wage of $8, the whole process would take up to 3,750 work hours.In comparison, our approach doesn't require either human involvement or parameter update, except for the minimal efforts in collecting the small seed dataset and training an auxiliary model on it.The cost and time are thus mainly derived from GPT-3 API call. 3Averagely, generating a dialogue using GPT-3 only cost $0.52,while generating a training sample (turn) for DST augmentation only cost $0.006.Using other openresourced LLMs such as OPT-175B (Zhang et al., 2022) can avoid the cost and make our method almost free.Each dialogue can be generated within a few seconds, meaning we can create a large-scale dataset such as MultiWOZ within several hours, which largely shortens the time for dataset creation.Evaluation Metric.To assess the quality of the simulated dialogues, we evaluate the performance of models trained on these simulated dialogues on two benchmark TOD tasks: (1) end-to-end dialogue modeling (E2E) and ( 2) dialogue state tracking (DST).For E2E evaluation, we use the metrics defined in MultiWOZ (Budzianowski et al., 2018): Inform, Success, BLEU, and an overall metric Combined Score: BLEU +0.5×(Inf orm+Success).
For DST evaluation, we report the joint accuracy.Baselines.We select the following three recent end-to-end TOD models as baseline models: Sim-pleTOD (Hosseini-Asl et al., 2020), MinTL (Lin et al., 2020), and PPTOD (Su et al., 2021).These three models are all based on pre-trained transformers.SimpleTOD is initialized with GPT-2 small , while MinTL and PPTOD are initialized with T5 small .PPTOD has also been pretrained on heterogeneous dialogue corpus, making it more powerful in low-resource settings.These three models are all capable of performing end-to-end dialogue modeling tasks and DST tasks.We also experiment on a classic DST model TRADE (Wu et al., 2019).For a fair comparison, we use the delexicalized system response in the same format and the evaluation script as in (Zhang et al., 2020;Lin et al., 2020;Su et al., 2021) for E2E evaluation on all these models.During inference, we didn't use any oracle information.For DST evaluation, we use lexicalized utterances.We use the default hyperparameters in their original implementations.

End-to-end Dialogue Modeling
We here investigate a realistic question when building a TOD system for a new task or domain: would  our method be a good alternative to crowdsourcing in expanding a small corpus of dialogue data?To answer this question, we combine the seed dataset with (1) simulated dialogues with our method from the seed dataset (Sim.);(2) human-generated dialogues from the original dataset, excluding the seed data (Orig.).We then train several representative TOD models with these two datasets and compare their performance.For a fair comparison, we use the same amount of simulated dialogues and original dialogues and the same set of seed data across all setups.
As seen in Table 1, the models trained on simulated dialogues along with the seed dataset perform much better than those only trained on the seed dataset.Remarkably, compared with humangenerated dialogues, the same amount of our simulated dialogues can lead to even more significant performance improvement in most cases.Our simulated dialogues can still improve performance when the full training data are provided.We only show the result of using a small amount of simulated dialogues (422/843) here.One can also generate more dialogues to further improve the performance.The results suggest the effectiveness of our method as a data augmentation method.This is expected as when we have more seed data to select from as in-context examples for GPT-3 and train the auxiliary revision model, we can generate more diverse dialogues, more accurate annotations, and thus dialogues with better quality.
To understand the observations, we further analyze the statistics of simulated and humangenerated dialogues and present them in Table 2.We found that our simulated dialogues have more requested domains and dialogue turns compared with human-generated dialogues, which are controlled by the generated user goals.The much more sub-tasks and sub-conversations in the simulated dialogues improve the model's ability to deal with more complex and challenging multi-domain tasks, thus leading to a higher Inform rate and Success rate.On the contrary, the system responses generated by our method have a comparable amount of unique tokens but fewer 3-grams than the original ones, which explains why the BLEU score is slightly lower than the human-generated dialogues.Effect of Automatic Revision.Table 3 shows performances of PPTOD trained with dialogues generated in the presence and absence of the au- tomatic revision under the 1% setting.Without automatic revision, the generated dialogues lead to lower performance improvement in all metrics, which suggests the importance of our controllable generation strategy and automatic revision methods.We list the frequency of the automatic revision in Appendix B.

Dialogue State Tracking
Next, we investigate the effectiveness of belief state annotations generated by our method as augmented data for DST training in low resources settings.We simulate the low-resource setting using 1% of the MultiWOZ training set as seed data.Table 4 shows the performance of various models trained on different sizes of augmented data.The augmented data can consistently improve the model performance across different models.With the increase of augmented data, the accuracy keeps increasing, though the upward trend is gradually slowing down.

Human Evaluation
To get a more comprehensive measure of the quality of the simulated data compared with the original human-generated data, we conducted a blind human evaluation study.Three participants with NLP backgrounds are given 50 dialogues simulated from 1% of the training data and another 50 dialogues from the original dataset without knowing their source (simulated or original).Following (Mohapatra et al., 2020), we ask the participants to check the quality of the conversation and annotations of each dialogue turn by answering the following questions: (1) "Are the utterances grammatically correct?" (2) "Is the user utterance fluent and natural?" (3) "Is the system response fluent and natural?" (4) "Is the belief state annotation consistent with the user utterance?"(5) "Is the dialog act consistent with the system response?".For each question, the participants should answer "yes" or "no".We adopt a majority vote approach to decide the final answer with at least two votes.measure (with the answer of "yes").We find that our method can generate even more grammatically correct conversations than human crowdworkers leveraging the strong generation ability of GPT-3.The generated user utterances are comparably fluent and natural to original humangenerated ones.In contrast, the generated system responses are not that fluent.We suspect this is because the system responses are delexicalized, which is more challenging for GPT-3 to understand and imitate.As for the annotation quality, although humans generate original dialogues, annotation errors still exist, suggesting the difficulty of task-oriented dialogue annotations.As expected, our generated dialogue data has more annotation errors than human-generated ones.Considering only as few as 85 dialogues are given and no human involvement is required, we believe the gap is acceptable and can be bridged with more seed data provided.Overall, the conversations generated by our method have comparable quality to human-generated ones.However, the generated annotations are not as accurate as human annotations.Satisfyingly, the noisy annotation still introduces considerable performance improvement.

Conclusion
In this paper, we propose a dialogue simulation method based on large language model in-context learning to automate dataset creation.Our proposed method can generate dialogues and annotations given only a few seed dialogues.The simulation process requires zero or minimum human involvement and model training, making our method much more cost-efficient and time-saving than crowdsourcing.Human and automatic evaluations demonstrate that the simulated dialogues have comparable quality to human-generated ones, which shows the potential of our method as an alternative to crowdsourcing in dialogue dataset creation.

Limitations
In this paper, we investigate ways to leverage incontext learning with GPT-3 to automatically generate high-quality task-oriented dialogues for building dialogue systems.Although our method can already generate high-quality dialogues without requiring human involvement, there are still some limitations in real-world applications.GPT-3 has a deficiency in lack of reliability and is inevitable to generate some unexpected data even with automatic revision.Human review and revision are still necessary to ensure the annotation is completely correct.However, it is challenging as the revision at each step will influence the latter steps.Therefore, an effective and efficient human and machine collaboration approach is our future direction.In addition, as the dialogues along with annotations are very lengthy, it is essential to reduce their length to lower the generation cost and enable the use of more in-context examples.

Ethics Statement
Our proposed method instructs LLMs to generate dialogues for building dialogue systems.However, LLMs such as GPT-3 (Brown et al., 2020) are observed to generate toxic or biased text (Brown et al., 2020;Lucy and Bamman, 2021;Chan, 2022).Although a new version of GPT-3 called InstructGPT has been released, trying to reduce these toxic languages, the issue hasn't been sufficiently addressed.Thus, automatic filtering or human review methods is necessary to exclude some parts of training data to avoid the models generating undesirable responses containing toxicity and bias from the simulated dialogues.dom Sampling (RS), (2) Value Substitution (VS), and (3) Combination (Comb.).We show the performance of PPTOD trained on dialogues simulated with different goal generation methods and varying augmentation sizes in Figure 7.We also show the model performance trained on the same amount of original dialogues in the dashed line as a baseline (Ori.).
We can see that the VS method performs worst, as it can only change the slot values.Thus, GPT-3 tends to simply replicate the in-context examples.As the Combination method can cover most of the information needed in the target dialogues and encourage GPT-3 to generate diverse data, it achieves the best performance.However, it is harder to improve further with the increased augmentation size as there are limited seed dialogues and, thus, their combinations.On the contrary, as the RS method can generate any user goals, the model performance keeps increasing.However, there is still an upper bound of performance, which depends on the number of provided seed dialogues.

A.3 Turn-level Generation
We generate user turns for DST augmentation based on the human-generated seed dialogues.The hyper-parameters for GPT-3 API call are the same as dialogue generation as in Appendix A.1.We use 2 in-context examples by default.One can also increase the number for better generation quality.For each turn from the original dialogues, we generate augmented user turns and concatenate them with the previous turns in this dialogue to create a new training sample.We generate the user turns based on the dialog acts of the last system turn in the context.Suppose the last system turn contains Table 6: Full ontology for all domains in MultiWOZ2.3(Han et al., 2021).The upper script indicates which domains it belongs to.*: universal, 1: restaurant, 2: hotel, 3: attraction, 4: taxi, 5: train, 6: hospital, 7: police.The table is adapted from (Budzianowski et al., 2018)  the request act.In that case, we will randomly select at least one slot from the requestable slots and another at least two slots from the unmentioned slots in the current domain to form the belief state of the augmented user turn.If the last system turn contains the reqmore act, we will randomly select one unmentioned domain and at least 1, at most 4 slots in the domain.For the other cases, we will randomly drop at least one slot from the original belief state and add at least one unmentioned slot.
Having selected the slots in the belief state, we will randomly select a possible slot value for each slot to create the concrete belief state for the augmented user turn.

B Automatic Revision Frequency
To investigate how often the data generated by GPT-3 need to be revised and how many errors can be corrected by our method, we randomly selected 20 dialogues simulated under the 1% low-resource setting and manually checked the annotations.In a total of 170 turns, GPT-3 generates incorrect belief state annotations in 31 turns (18 de-generations / 13 over-generations).The auxiliary generator corrects GPT-3 in 13 out of 18 de-generated turns, while the slot-value match filter corrects in 10 out of 13 over-generated turns.Finally, only 6.47% of the revised belief states are still incorrect (11 out of 170), while that for dialog acts is 11.18% (19 out of 170).

C Generation Examples C.1 Dialogue Generation Example
An example of a complete prompt is shown in Table 8, given which the controllable generation process of the dialogue is presented in Table 9.The process is fully automated.

C.2 Turn-level Generation Example
We provide the illustration of three different types of augmented turns for DST in Table 10, and the prompt and generation process of it in Table 11.Table 9: The controllable dialogue generation process of a dialogue given the prompt in Table 8.For each turn, GPT-3 will first generate the belief state and the user utterance.We parse the belief state, which is then verified and revised automatically.We replace the original generated belief state with the revised belief state in the user turn generation.Then the revised user turn generation will be used to query the database and concatenated with the dialogue context to continue the generation.As for the system turn, we use the revised dialog act, conditioned on which we prompt GPT-3 to generate the system response.Note that the user goal of the target dialogue is allowed to change during the generation.We will keep the updated user goal instead of the original one in the prompt (instruction3), which is only used to initiate the generation of the target dialogue.

Figure 1 :
Figure 1: Illustration of a part of an annotated dialogue generated by our method.Left: the conversations and annotations are generated simultaneously by GPT-3, where the user utterances are in blue, the system responses are in green, and the annotations are in red.Right: the structured annotation obtained by parsing the GPT-3's generation shown on the left.Best viewed in color.A complete generated dialogue is shown in Appendix C.2 as Table9.

Figure 2 :
Figure 2: Overview of the proposed method.

Figure 3 :
Figure 3: Illustration of an in-context example from the MultiWOZ dataset.The user goal, belief states, and dialog acts are in red.User utterances are in blue, while system responses are in green.Best viewed in color.

Figure 4 :
Figure 4: Template for the prompt of GPT-3 to generate new dialogues.An actual example of the complete prompt is shown in Appendix C.1 as Table8.

Figure 5 :
Figure 5: Illustration of the controllable generation process of a dialogue turn.An example of the generation process of a complete dialogue is shown in Appendix C.1 as Table9.
User: I am looking for a place to to stay that has cheap price range .System: Okay , do you have a specific area you want to stay in ?User: No , but it should have free parking , please User: I need it to be in center of the town, and in the type of hotel .

Figure 6 :
Figure6: Illustration of the turn-level generation for DST augmentation.The turn-level belief state is decided on the dialog act of the last turn.The user utterance, which is underlined, is generated conditioned on the turn-level belief state.The newly generated user turn will be concatenated with previous turns of the original dialogue to form a new dialogue.An example of the generated user turn is shown in Appendix C.2 as Table11.

Figure 7 :
Figure 7: The combined score of PPTOD trained with dialogues simulated with different methods and augmentation size with 1% training data as the seed data.
You are going to book a hotel, and your requirements for the hotel are([hotel]type is hotel , pricerange is cheap , parking is yes , stay is 2 , day is tuesday , people is 6).Make sure you get the booking information once booked.User ([hotel] stay is 3 , day is tuesday , people is 6): yes , please .6 people 3 nights starting on tuesday .Assistant([hotel][nobook] day [request] stay day): i am sorry but i was n't able to book that for you for[value_day].is there another day you would like to stay or perhaps a shorter stay ?User ([hotel] stay is 2): how about only 2 nights .
Instruction: Conversation: User ([hotel] type is hotel , pricerange is cheap): I am looking for a place to to stay that has cheap price range it should be in a type of hotel .Assistant ([hotel] [request] area): okay , do you have a specific area you want to stay in ?User ([hotel] parking is yes , pricerange is cheap): no , i just need to make sure it is cheap .oh , and i need parking .Assistant ([hotel] [inform] price choice parking type [offerbook]): i found [value_choice] [value_price] [value_type] for you that include -s parking .do you like me to book it ?Assistant ([hotel] [offerbooked] reference [general] [reqmore]): booking was successful .reference number is : [value_reference] .anything else i can do for you ?User ([general]): no , that will be all .goodbye .Assistant ([general] [bye]): thank you for using our services .

Table 1 :
(Han et al., 2021)e modeling evaluation on MultiWOZ2.3(Hanetal., 2021), where I, S, B, C stand for the Inform, Success, BLEU, and Combined Score metrics, respectively.Sim. and Orig.stand for simulated and original dialogues.The highest scores are bolded.The average improvements are w.r.t. the combined scores.

Table 2 :
Comparison of the statistics of simulated dialogues (Sim.) with human generated dialogues from the original MultiWOZ dataset (Orig.).

Table 3 :
Ablation study on the automatic revision under 1% low-resource setting.

Table 5 :
Table 5 shows the percentage of dialogue turns that satisfy each quality Human evaluation on the simulated dialogues and original dialogues w.r.t. the percentage of dialogue turns that satisfy each quality measure.
Original user turn generationUser([hotel]area is south , stay is 5 , people is 4): i need a hotel in the south side please .GPT-3 generated belief state <hotel, area, south> , <hotel, book stay, 5> , <hotel, book people, 4> Revised belief state <hotel, area, south> , <hotel, book stay, 5> , <hotel, book people, 4>, <hotel, type, hotel> Revised user turn generation User([hotel] area is south , type is hotel): i need a hotel in the south side please ., area>, <hotel, inform, name>, <hotel, inform, internet>, <hotel, inform, parking>, <hotel, inform, type>, <hotel, inform, stars>, <hotel, offerbook, none> Revised system turn generation Assistant([hotel] [inform] area name internet price parking type stars [offerbook]): the [value_name] hotel is in the south side and it has [value_internet] .the price is [value_price] per night and it has [value_parking] .it is a [value_stars] star hotel .Original user turn generation User([hotel] stay is 5 , people is 4): i would like to to book it for 4 people and 5 nights.Original user turn generation User([train] destination is birmingham new street , arrive is 13:06): i need a train to birmingham new street station that arrives by 13:06 please .GPT-3 generated belief state <train, destination, birmingham new street> , <train, arrive, 13:06> Revised belief state <train, destination, birmingham new street> , <train, arrive, 13:06> Revised user turn generation User([train] destination is birmingham new street , arrive is 13:06): i need a train to birmingham new street station that arrives by 13:06 please .[request] day departure): what day will you be leaving and what is your departure station ?Original user turn generation User([train] day is saturday , departure is cambridge): i will be leaving this saturday from cambridge station .GPT-3 generated belief state <train, day, saturday> , <train, departure, cambridge> Revised belief state <train, day, saturday> , <train, departure, cambridge> Revised user turn generation User([train] day is saturday , departure is cambridge): i will be leaving this saturday from cambridge station .Revised dialog act <train, inform, arrive>, <train, inform, id>, <train, inform, leave>, <train, offerbook, none> Revised system turn generation Assistant([train] [inform] arrive id leave [offerbook]): the train arrives at [value_arrive] and the id is [value_id] .would you like me to book it for you ?[bye]): you are welcome , please contact us if you need anything else .