NatCS: Eliciting Natural Customer Support Dialogues

Despite growing interest in applications based on natural customer support conversations, there exist remarkably few publicly available datasets that reflect the expected characteristics of conversations in these settings. Existing task-oriented dialogue datasets, which were collected to benchmark dialogue systems mainly in written human-to-bot settings, are not representative of real customer support conversations and do not provide realistic benchmarks for systems that are applied to natural data. To address this gap, we introduce NatCS, a multi-domain collection of spoken customer service conversations. We describe our process for collecting synthetic conversations between customers and agents based on natural language phenomena observed in real conversations. Compared to previous dialogue datasets, the conversations collected with our approach are more representative of real human-to-human conversations along multiple metrics. Finally, we demonstrate potential uses of NatCS, including dialogue act classification and intent induction from conversations as potential applications, showing that dialogue act annotations in NatCS provide more effective training data for modeling real conversations compared to existing synthetic written datasets. We publicly release NatCS to facilitate research in natural dialog systems


Introduction
Applications that are applied to human-to-human customer support conversations have become increasingly popular in recent years.For example, assistive tools that aim to support human agents, provide analytics, and automate mundane tasks have become ubiquitous in industry applications (Amazon Contact Lens, 2023;Google Contact Center AI, 2023;Microsoft Digital Contact Center Platform, 2023).Despite this growing interest in data processing for natural customer service conversations, to the best of our knowledge, there exist no public datasets to facilitate open research in this area.
Existing dialogue datasets focusing on development and evaluation of task-oriented dialogue (TOD) systems contain conversations that are representative of human-to-bot (H2B) conversations adhering to restricted domains and schemas (Budzianowski et al., 2018;Rastogi et al., 2020).Realistic live conversations are difficult to simulate due to the training required to convincingly play the role of an expert customer support agent in non-trivial domains (Chen et al., 2021).Existing datasets are also primarily written rather than spoken conversations as this modality is cheaper to simultaneously collect and annotate asynchronously through crowdsourcing platforms.
To address these gaps, we present NATCS, a multi-domain dataset containing English conversations that simulate natural, human-to-human (H2H), two-party customer service interactions.First, we describe a self-collection dataset, NATCS SELF , where we use a strict set of instructions asking participants to write both sides of a conversation as if it had been spoken.Second, we present a spoken dataset, NATCS SPOKE , in which pairs of participants were each given detailed instructions and asked to carry out and record conversations which were subsequently transcribed.We observe that the resulting conversations in NATCS share more characteristics with real customer service conversations than pre-existing dialogue datasets in terms of diversity and modeling difficulty.
We annotate a subset of the conversations in two ways: (1) we collect task-oriented dialogue act annotations, which label utterances that are important for moving the customer's goal forward and (2) we categorize and label customer goals and goalrelated information with an open intent and slot schema, mimicking the process for building a TOD system based on natural conversations.We find that classifiers trained with the resulting dialogue act annotations have improvements in accuracy on real data as compared to models trained with preexisting TOD data.
Our main contributions are threefold: • We present NATCS, a multi-domain dialogue dataset containing conversations that mimic spoken H2H customer service interactions, addressing a major gap in existing datasets.
• We show that NATCS is more representative of conversations from real customer support centers than pre-existing dialogue datasets by measuring multiple characteristics related to realism, diversity and modeling difficulty.
• We provide TOD dialogue act and intent/slot annotations on a subset of the conversations to facilitate evaluation and development of systems that aim to learn from real conversations (such as intent and slot induction), empirically demonstrating the efficacy of the dialogue annotations on real data.
Our paper is structured as follows: In Section 2, we review other dataset collection methods and approaches for evaluating dialogue quality.In Section 3, we describe how NATCS conversations and annotations were collected.In Section 4, we compare NATCS with pre-existing dialogue datasets as well as real H2H customer service conversations.Finally, in Section 5, we further motivate the dataset through two potential downstream applications, task-oriented dialogue act classification and intent induction evaluation.

Related Work
Dialogue Dataset Collection The goal of NATCS, to produce conversations that emulate real spoken customer service interactions, differs substantially from previous synthetic dialogue dataset collections.Previous synthetic, goal-oriented datasets have used the Wizard of Oz framework (Kelley, 1984).This framework calls for one person to interact with what they think is a computer, but is actually controlled by another person, thus encouraging a human-to-bot (H2b) style.Wen et al. (2016) define a specific version of this approach to be used with crowdsourced workers to produce synthetic, task-oriented datasets.This approach has since been adopted as a standard method to collect goal-oriented conversations (Peskov et al., 2019;Byrne et al., 2019;Budzianowski et al., 2018;El Asri et al., 2017;Eric et al., 2017).MultiDoGO (Peskov et al., 2019) (MDGO), SGD (Rastogi et al., 2020), and TaskMaster (Byrne et al., 2019) are particularly relevant comparisons to our work.MDGO explicitly encourages dialogue complexities, which serve as inspiration for our complexity-driven methodologies.SGD is interesting partially because of the scale of their collection, relevance of their task-oriented dialogue act and intent/slot annotations, and because they diverge from the common practice of using the Wizard-of-Oz methodology through the use of dialogue templates and paraphrasing.TaskMaster presents a methodology for self-collected dialogues, which is analogous to our NATCS SELF collection.
Despite these similarities, the methodology for NATCS differs significantly from any of these preexisting datasets, because of both the target modality (spoken, or spoken-like conversations), and the setting (H2H instead of H2B).
Analysis of Dialogue Quality An important component of our work is the comparison of synthetic dialogue datasets with real data.We adopt multiple metrics for comparing NATCS with both real and previous TOD datasets.Byrne et al. (2019) use perplexity and BLEU score as stand-ins for "naturalness", with the logic that a dataset should be harder for a model to learn if it is more realistic, because realistic data tends to be more diverse than synthetic data.Previous collections of dialogue datasets also have made comparisons based on surface statistics such as the number of dialogues, turns, and unique tokens.Casanueva et al. (2022) compares intent classification datasets based on both lexical diversity metrics and a semantic diversity metric computed using a sentence embedding model.
There are a number of metrics more common in other sub-domains that may be useful for measuring naturalness in dialogues.The measure for textual lexical diversity (MTLD), discussed in depth by McCarthy and Jarvis (2010), provides a lexical diversity measure less biased by document length.Liao et al. (2017) introduce a dialog complexity metric, intended for analyzing real customer service conversations, which is computed by assigning importance measures to different terms, computing utterance-level complexity, and weighting the contribution of utterances based on their dialogue act tags.Hewitt and Beaver (2020) performed a thorough comparison of the style of human-to-human vs. human-to-bot conversations, using lexical diversity measures, syntactic complexity, and other dimensions like gratitude, sentiment and amount of profanity, though the data was not released.

Collection Methods
We propose two collection methods as part of NATCS.We have three goals for the resulting conversations: (1) They should exhibit the spoken modality, (2) all conversations from each domain should seem to be from the same company, and (3) they should appear to be real, human-to-human conversations between a customer and an agent.We explore two methodologies, resulting in the NATCS SPOKE and NATCS SELF datasets, to weigh collection cost and complexity compared to dataset effectiveness.
To support goal (3), we propose a set of discourse complexity types.The motivation for providing specific discourse complexities is to encourage some of the noise and non-linearity present in real human-to-human conversations.Based on manual inspection of 10 transcribed conversations from a single commercial call center dataset, we identify a combination of human expressions (social niceties, emotionally-charged utterances), phenomena mimicking imperfect, non-linear thought processes (change of mind, forgetfulness/unknown terminology, unplanned conversational flows), re-flections of the wider context surrounding the conversation (continuing from a previous conversation, pausing to find information), distinctions between speakers' knowledge bases, and the use of multiple requests in single utterances (stating multiple intents, providing multiple slot values).A list of these complexities along with estimated target percentages (minimum percent of conversations where these phenomena should be present) is provided in Figure 1.Descriptions and examples are provided in Appendix Table 15.
To achieve cross-dataset consistency, supporting goal (2), collectors are provided with mock company profiles, including name of the mock company, as well as mock product or service names with associated prices.Collectors are also provided with a schema of intents and associated slots.Some flexibility is allowed in the slot schema to reflect real world situations where customers may not have all requested information on hand.Examples of company profiles and intent schemas are provided in Appendix Tables 13 and 14.For each conversation, we sample a set of minimum discourse complexity types.For example, one conversation could be assigned the target complexities of ChitChat, FollowUpQuestion, MultiElicit, and SlotLookup.Scenarios eliciting each of these complexity types are generated and provided to the participants.
For NATCS SPOKE , one participant plays the part of the customer service representative ("agent"), and one participant plays the part of the customer ("customer").The participants are recorded as they play-act the scenarios described on their instruction sheets from the same room.These audio recordings are then transcribed and annotated for actual complexities.
Given the time, cost, and complexity involved for the creation of the NATCS SPOKE datasets, as an alternative approach, we apply the NATCS SELF method.For NATCS SELF , participants write selfdialogues as if they were spoken out loud.This method has the benefit of (1) not needing to be transcribed, and (2) requiring only one participant to create each conversation and therefore not requiring scheduling to match participants together.However, these rely on an understanding by the participants of the distinction between spoken and written modality data.The NATCS SELF method follows a similar set-up as the NATCS SPOKE method, except in addition to being provided with a set of target discourse complexities, participants are also provided with a set of spoken form complexities.While discourse complexities target discourserelated phenomena, spoken form complexities consist of phenomena specifically observed in spoken form speech.For this complexity type, we include phenomena such as hesitations or fillers ('um'), rambling, spelling, and backchanneling ('uh huh go on').A list of these complexities along with target percentages is provided in Figure 4, and further examples are provided in Appendix Table 16.

Annotations
One goal of collecting realistic dialogues is to facilitate the development and evaluation of tools for building task-oriented dialogue systems from H2H conversations.To this end, we perform two types of annotations on a subset of NATCS: Dialogue Act (DA) annotations and Intent Classification and Slot Labeling (IC/SL) annotations.
IC/SL annotations are intended to label intents and slots, two key elements of many TOD systems.An intent is broadly a customer goal, and a slot is a smaller piece of information related to that goal.We use an open labelset, asking annotators to come up with specific labels for each intent and slot, such as "BookFlight" and "PreferredAirline" as opposed to simply "Intent" and "Slot".Annotators are instructed to label the same intent no more than once per conversation.For slots, we use the principle of labeling the smallest complete grammatical constituent that communicates the necessary information.
Our DA annotations are intended to identify utterances that move the dialog towards the customer's goal.TOD systems often support only a small set of dialogue acts that capture supported user and agent actions.For the agent, these may include eliciting the user's intent or asking for slot values associated with that intent (ElicitIntent and ElicitSlot respectively).For the user, such acts may include informing the agent of a new intent or providing relevant details for resolving their request (InformIntent and InformSlot respectively).Such acts provide a limited view of the actions taken by speakers in natural conversations, but do provide a way to identify and categorize automatable interactions in natural conversations.Table 1 provides an example conversation annotated with intents, slots, and dialogue acts.

Dataset Analysis
To better motivate NATCS as a proxy for natural, spoken form customer service conversations from multiple domains with a diverse set of intents, we compare with real conversations from commercial datasets comprising 5 call centers for retail and finance-related businesses (henceforth REAL).All datasets in REAL consist of manuallytranscribed conversations between human agents and customers in live phone conversations where all personally-identifiable information has been preredacted.We restrict our analysis to datasets with primarily customer-initiated two-party dialogues.
As shown in  (Budzianowski et al., 2018), SGD (Rastogi et al., 2020), and TM1 Self (Byrne et al., 2019).We compare datasets collected from our two methodologies (NATCS SELF and NATCS SPOKE ), as well as 5 call center datasets (REAL).MTLD is a lexical diversity measure (McCarthy and Jarvis, 2010) computed at the conversation level.
per conversation on average, vs. 22 for TM1 Self ).Furthermore, each turn has more words per turn, suggesting increasing complexity in spoken H2H dialogues.NATCS closely matches REAL in terms of conversation and turn lengths.

Intents and Slots
Table 3 provides a comparison of intent and slot annotations between existing synthetic datasets, NATCS and REAL.Datasets in REAL contain considerably more intents and slots for a particular domain than existing TOD datasets like MDGO.Turns containing intents are longer for both NATCS and REAL than SGD and MDGO.
Figure 2 compares the intent and slot distributions between Retail A and NATCS SELF Insurance, indicating that both have skewed, long-tailed distributions of intents/slots, a product of the open intent/slot schemas used in NATCS.

Diversity Metrics
As we expect conversations in REAL to be less homogeneous than synthetic dialogue datasets, we compute automatic metrics to measure multiple aspects of diversity and compare with NATCS.

Conversational Diversity
In Table 3, we examine the diversity of conversation flows as measured by the ratio of unique sequences of slots informed by the customer to the total number of sequences (e.g.slot bigrams or trigrams).In SGD, which constructs dialogue templates using a simulator, we observe a much lower percentage of unique n-grams than in REAL, despite SGD containing dialogues spanning multiple domains and services.On the other hand, while NATCS has lower slot n-gram diversity than REAL, both collection types have substantially higher slot n-gram diversity scores than both MDGO or SGD.
The average perplexity of a language model provides one indication of the difficulty of modeling the dialogue in a given dataset (Byrne et al., 2019).High perplexity indicates higher difficulty, while low perplexity can indicate more uniform or predictable datasets.We compare between fine-tuning and zero-shot language modeling settings using GPT-NEO (Black et al., 2021).Zero-shot evaluation gives an indication of how compatible the dataset is with the pre-trained model without any fine-tuning.Section A.1 provides details on the fine-tuning procedure.
As shown in Table 4, we observe high perplexity on real data and low perplexity on synthetic datasets like SGD and MDGO.NATCS has lower perplexity than REAL, but considerably higher perplexity than MDGO, MWOZ, and SGD.Interestingly, there is a wider range of perplexities across real datasets, while most existing TOD datasets have a perplexity of 10 or less.

Intent Diversity
We also investigate the semantic diversity of intent turns (SemDiv intent ) following Casanueva et al. (2022).Details of this cal-culation are provided in Section A.1.As shown in Table 3, we observe the highest SemDiv intent for REAL, but NATCS has considerably higher SemDiv intent than pre-existing synthetic datasets, indicating greater potential modeling challenges.We also compare the semantic of diversity of NATCS with other datasets for specific aligned intents, like CheckBalance in Appendix Table 9, also observing higher semantic diversity as compared to pre-existing intent classification benchmarks.Further investigation into the lexical diversity of intents is provided in Section A.1.

NATCS Dialogue Acts Distribution
To better understand the characteristics of typical H2H customer service dialogues in call centers, we annotate a subset of real conversations with the "taskoriented" dialogue acts described in Section 3.2.Because NATCS dialogue acts consist of a small set of intent and slot-related functions commonly employed in automated TOD systems, we do not expect the labels to have high coverage in real conversations that do not revolve around a fixed set of intents and slots.However, they aim to provide a mechanism for aligning turns from natural conversations onto these automatable TOD constructs.
Table 5 compares the percentage of turns labeled with NATCS dialogue acts across multiple datasets along with the average counts of each label per dialogue.As expected, compared to synthetic data, dialogues in REAL have fewer turns labeled with NATCS dialogue acts (27.7%).More reflective of REAL in this regard, in NATCS SPOKE and NATCS SELF , we observe that more than half of the turns in each conversation are not labeled, despite having higher total counts of dialogue acts per con- versation as compared to pre-existing datasets.

Human Evaluation
We also perform a human evaluation comparing MDGO, SGD, and REAL datasets against both NATCS SELF and NATCS SPOKE .Rather than compare complete dialogues, because of the large disparity in conversation lengths, we restrict evaluation only to snippets including the first 5 turns after an intent is stated (including the turn containing the intent).Conversation snippets are graded on a scale of 1 to 5 along multiple dimensions, including realism (believability of the dialogue), concision (conciseness of customer, lack of verbosity), and spoken-likeness (possibility of being part of a spoken conversation).See Section A.2 Figure 3 for explicit definitions provided to graders.The evaluation is conducted by 6 dialogue systems researchers, with each grader rating 50 randomly-selected conversations.As indicated by results in Table 6, conversations from REAL observe high values for both realism and spokenlikeness (4.71 and 4.95 respectively), with lower values for concision, indicating greater customer verbosity in real dialogues.The results also indicate that although expert human graders can still differentiate NATCS from real conversations, NATCS is graded as significantly more realistic and indicative of spoken modality than SGD and MDGO (two-tailed T-test with p < 0.005).

Applications
In this section, we investigate two potential applications of NATCS as a resource for building and evaluating systems related to human-to-human (H2H) customer service interactions.One goal of NATCS is to encourage research in the under- explored space of H2H task-oriented interactions, so these are intended to serve as motivating examples rather than prescribed uses.

NATCS as Training Data
One goal of this work is to accelerate the development of dialogue systems based on H2H conversations.While most existing work in intent induction assumes that customer turns corresponding to requests have already been identified, NATCS dialogue acts provide a mechanism to map turns onto TOD constructs like intents.
To validate the usefulness of dialogue act annotations in NATCS, we compare the cross-dataset generalization of dialogue act classifiers trained on annotations in NATCS against that of SGD, a large multi-domain corpus of task-oriented dialogues, evaluating on real conversations between human agents and customers.
We fine-tune ROBERTA-BASE using per-label binary cross entropy losses to support multiple la- bels per sentence.Fine-tuning details are provided in Section A.3.We compare the dialogue act classification performance on real data when training on SGD, NATCS, and in-domain vs. out-of-domain real data.As shown in Table 7, a DA classifier trained on NATCS performs significantly better on real data than a classifier trained on SGD.Performance still lags behind that of training on real data, but with NATCS, the gap is closed considerably.In Section A.4, we also show that the dialogue act annotations in NATCS are able to generalize to new domains.

Intent Clustering with Noise
Recent work indicates growing interest in applications that can accelerate the development of these systems by automatically inducing TOD constructs such as intents and slots from customer support interactions (Yu et al., 2022;Shen et al., 2021;Kumar et al., 2022;Perkins and Yang, 2019;Chatterjee and Sengupta, 2020).To further motivate NATCS as a realistic test bed for applications that learn from natural conversations, we demonstrate how it can serve as a benchmark for unsupervised intent clustering tasks.
In a realistic setting, turns in conversations containing intents will not be provided in advance.We thus compare three settings: 1) using the first customer turn in each conversation 2) using turns predicted as having intents with a dialogue act classifier and 3) using turns labeled with intents (gold dialogue acts).
Utterances are encoded using a sentence embedding model from the SENTENCETRANSFORM-ERS library (Reimers and Gurevych, 2019), ALL-MPNET-BASE-V2.We use k-means clustering with the number of clusters set to the number of reference intents.a logistic regression classifier using ALL-MPNET-BASE-V2 embeddings as static features on inputs assigned cluster labels (such as first turns), then apply the classifier to the missing turns to get predicted cluster labels.
The results, shown in Table 8, demonstrate that using automatically predicted turns leads to a drop in purity and NMI.The drop in purity is attributable to irrelevant, non-intentful turns being clustered together with relevant intents, a potentially costly error in real-world settings that is not typically reflected in intent clustering evaluation.

Conclusions
We present NATCS, a corpus of realistic spoken human-to-human customer service conversations.The collection of NATCS is complexity-driven and domain-restricted, resulting in a dataset that better approximates real conversations than preexisting task-oriented dialogue datasets along a variety of both automated and human-rated metrics.We demonstrate two potential downstream applications of NATCS, showing that training using NATCS results in better performance with real test data compared to training using other publiclyavailable goal-oriented datasets, and that NATCS can provide a new challenging benchmark for realistic evaluation of intent induction.
We hope that NATCS will help facilitate open research in applications based on customer support conversations previously accessible mainly in industry settings by providing a more realistic annotated dataset.In future work, we hope to expand on annotations in NATCS to support more tasks such as call summarization, response selection, and generation.

Limitations
NATCS is partially annotated with dialogue acts, intents, and slots, which are annotated independently from the initial collection of the conversations.While decoupling annotations from collection was intended to facilitate natural and diverse dialogues, the methodology is more timeconsuming and expensive than previous approaches that use pre-structured conversation templates to avoid the need for manual annotation.In particular, NATCS SPOKE requires multiple participants engaging in synchronous conversations, followed by independent manual transcriptions and annotations, making the approach particularly time-consuming and difficult to apply for large collections.Furthermore, this decoupling of annotations from collection has greater potential for annotator disagreement.
While the complexity types and annotations are mostly language-agnostic, NATCS is restricted to EN-US customer-initiated customer service conversations between a single agent and customer in a limited number of domains (multi-party conversations beyond two participants or agent-initiated conversations are not included).The annotations included are primarily intended for applications related to task-oriented dialogue systems.
Further, we note that NATCS closes the gap from real conversations along many metrics, but still falls short along some dimensions.We find that real conversations are more verbose, more believable, and less predictable.We also note that comparisons in our paper focused on a limited number of taskoriented dialogue datasets with different collection approaches, and did not exhaustively include all pre-existing dialogue datasets for comparison.

Ethics Statement
In this paper, we present a new partially-annotated dataset.In adherence with the ACL code of conduct and recommendations laid out in Bender and Friedman (2018) it is appropriate to include a data statement.Our dataset is completely novel, and was collected specifically to support the development of natural language systems.Workers who are proficient in the EN-US variant of English were hired through a vendor with a competitive hourly rate compared to the industry standard for language consultants.For NATCS SPOKE , these workers spoke to each other and then transcribed the data.For NATCS SELF these workers wrote the conversations.
To annotate the data, we used two pools of annotators.Both had formal training in linguistics and were proficient in the EN-US variant of English.One pool was hired through a vendor with a competitive hourly rate.The other pool consisted of full-time employees.
Curation Rationale Our dataset includes all of the data that was produced by the consultants we hired.Quality Assurance was done on a subset of this data.We hope that any concerns would have shown up in this sample.We annotated a random subset of the full dataset.

Language Variety
The dataset is EN-US.The speakers (or writers) were all fluent speakers of EN-US.We did not target a particular sub-type of the EN-US language variety.Speaker Demographics We do not have detailed speaker demographics, however, we do have male and female speakers from a variety of age ranges.
Annotator Demographics We do not have detailed annotator demographics, however, we do have male and female speakers from a variety of age ranges.All annotators had at least some formal linguistics training (ranging from a B.A. to a Ph.D.).
Speech Situation For NATCS SPOKE , speakers were talking in real time on the phone to one another.It was semi-scripted.Speakers were not told exactly what to say, but were given some constraints.

A.1 Experimental Setting Details
In this section, we provide details on the experimental settings used for evaluation of NATCS and other dialogue datasets.
Perplexity Evaluation To evaluate the language modeling (LM) perplexity on each dataset, we compare both a fine-tuning setting as well as a zero-shot setting using GPT-NEO (Black et al., 2021) as the pre-trained LM.We fine-tune GPT-NEO on each dataset, sampling 4096 blocks of 128 tokens as training data and evaluating on held-out test splits.Fine-tuning is performed for 6 epochs with a batch size of 64 and learning rate of 5e-5.Perplexity is computed at the level of bytes using a sliding window of 128 tokens.

Semantic Diversity
Evaluation Following Casanueva et al. (2022), to compute semantic diversity for a single intent, we (1) compute intent centroids as the average of embeddings for the turns labeled with the intent using the Sentence-BERT (Reimers and Gurevych, 2019) library with the pre-trained model ALL-MPNET-BASE-V2, then (2) find the average cosine distance between each individual turn and the resulting centroid.Finally, (3) overall semantic diversity scores (SemDiv intent ) in Table 3 are computed as a frequency-weighted average over intent-level scores.
Table 9 shows the semantic diversity scores (Casanueva et al., 2022) for different intents aligned across several datasets.

Lexical Diversity of Intents As an additional
indicator of intent-level diversity, we measure the frequency-weighted average of type token ratios for utterances within each intent (TTR intent ).To account for the redaction of names and numbers in real data, we perform a similar redaction step on all datasets, automatically converting names and numbers to a single PII token with regular expressions and a named entity tagger before computing TTR.Shown in Table 10, we observe similar TTR intent between NATCS and REAL, while the pre-existing synthetic datasets lag behind considerably.

A.2 Human Evaluation Details
Guidelines provided to human graders are provided in Figure 3. Realism measures the overall believability that the conversation could have taken place in a real scenario (which penalizes unlikely, silly utterances from the customer or unprofessional behavior from the agent).Concision measures how concise the customer responses are, with lower scores for lengthy utterances containing details that are unnecessary for resolving their request.Spoken-like measures the likelihood that the conversation was originally spoken in a phone conversation (as opposed to being written from a chatroom or messaging platform).50 of the 300 conversation snippets evaluated were graded by pairs of annotators (5 pairs total).Table 10: TTR intent provides intent-level type-token-ratios after removing names and numbers.Type-token-ratios are computed as 1-grams, 2-grams, and 3-grams, for which we observe consistently higher diversity for NATCS and REAL than MDGO and SGD.

Intent
For these, we observed a Krippendorff's alpha of 0.52, 0.59, and 0.37 respectively for Spoken-like, Realism, and Concision respectively.

A.3 Dialogue Act Classifier Training Details
The dialogue act classifier is implemented by finetuning ROBERTA-BASE (Liu et al., 2019) using per-label binary cross entropy losses to support multiple labels per sentence.To encode dialogue context, we append the three previous sentences to the current turn with a separator token ([SEP ]), adding a speaker label to each sentence and using padding tokens at the beginning of the conversation.Fine-tuning is performed for 6 epochs with a batch size of 16, using AdamW with a learning rate of 2e-5.Dataset-level dialogue act classifier performance is provided in Table 11.

A.4 Dialogue Act Cross-Domain Generalization
For models trained on NATCS to be useful in practice, they must be able to generalize beyond the limited number of domains present in NATCS.To measure cross-domain generalization, we perform cross validation, training separate models in which we hold out a single domain (e.g.train on Banking, evaluate on Travel).As shown in Table 12, performance on the heldout domains is lower, particularly for recall, but does not drop substantially.

A.5 Collection Methodology
Table 13 provides an example company profile used to assist in achieving cross-dataset consistency.Table 14 provides an example intent/slot schema for a ResetPassword intent in the Insurance domain.
Table 15 provides example utterances for each discourse complexity.Table 16 provides examples of spoken complexities used to simulate spoken-form conversations in the Insurance.
Figure 4 provides target percentages for complexities used to simulate spoken conversations in NATCS SELF .

A.6 Conversation Examples
Tables 17, 18 Read each conversation excerpt, then fill in values for each of the below dimensions with ratings between 1 and 5.
Realism -Is the customer making a realistic request?Is it believable that the conversation took place in a real setting?Is the customer's request or problem something that you imagine customers can encounter, or is it something incredibly unlikely or silly?Does the agent respond in a professional manner?Is it believable that the conversation took place in real life?
• 5 (realistic) means that the conversation is realistic and could have taken place in a real life setting.
• 1 (unrealistic) means that it is highly unlikely that the conversation could have taken place in a real life setting.
Concision -Does the customer provide only the necessary details for resolving their request?Does the customer avoid providing long responses or extra details that aren't 100% relevant to the conversation or for resolving their request?
• 5 (concise) means that the customer was consistently concise and clear with their responses to the agent.
• 1 (not concise) means that the customer provides multiple extra unnecessary details to the agent and provides long responses.
Spoken-like -Does the excerpt appear to be from a spoken (as opposed to written) conversation?Are there signals that indicate that the conversation was originally spoken and then transcribed (as opposed to occurring over a chatroom or messaging platform)?
• 5 (spoken) means that it is extremely likely that the conversation was originally spoken.
• 1 (written) means that it is unlikely that the conversation was originally spoken (and was probably written as a chat conversation instead).Customer or agent asks the other party to hold or wait while they look up some information.
Can you hold on while I find that number?

SlotLookup
The agent already has information about the customer and only needs to verify details Is johndoe@gmail.comstill a good email address for you?

Overfill
The customer provides more information than asked A: What's your first name?/ C: It's John Doe, email is johndoe@gmail.comSlotCorrection The agent corrects a customer regarding a product or service offered by the company C: I ordered the Derek pot set / A: The Darin set, yes I see your order here.A: All right.Just one second.PauseForLookup C: Yeah.I really hope get help to find my card.We are renovating our house at the moment right now.Started redoing our walls not too long ago.It's a bunch of wallpaper, so we just need help finishing removing it.And then, my wife is gonna head off to the store to get some paint to start that project.

BackgroundDetail, ChitChat
A: Oh, that's awesome.Do you guys have like a color scheme or color palette that you're working with?ChitChat ...

Figure 2 :
Figure 2: Intent and slot counts (logarithmic scale) for Retail A and NATCS Insurance.We observe intents and slots in real data follow a Zipfian distribution in both REAL and NATCS.
, and 19 provide examples of conversations collected in NATCS SPOKE .Tables 20, 21, and 22 provide examples of conversations collected in NATCS SELF .

Figure 3 :
Figure 3: Guidelines used in human evaluation.

Table 1 :
Thank you for calling Intellibank.What can I do for you today?ElicitIntent C: Hi, I was trying to figure out how to create an account online, but I'm having some trouble.A: I'm sorry to hear that.I'd be happy to help you create an online account.Well, actually, the reason I called is to check the balance on my account.Example conversation from NATCS Banking.Conversations are annotated with dialogue act annotations such as InformIntent and ElicitIntent, intents such as SetUpOnlineBanking, and slots such as AccountNumber. C: Table 2, one surface-level distinction from publicly available TOD datasets is the average number of turns per conversation.Compared to MDGO, SGD, MWOZ and TM1 Self , REAL has considerably longer conversations (over 70 turns Table 2: Comparison of dialogue datasets and corresponding high-level data characteristics.Task-oriented dialogue (TOD) datasets include MDGO (Peskov et al., 2019), MWOZ

Table 3 :
Casanueva et al. (2022)ion statistics for REAL, NATCS, MDGO, and SGD.Sem.Diversity is an sentence embedding-based metric for measuring intent-level diversity reported inCasanueva et al. (2022).Slot n-gram % indicates the ratio of unique sequences of slot annotations in conversations to the total number of such sequences.
Table 4: Perplexity of GPT-Neo 125M on datasets in both fine-tuning (PPL) and zero-shot (ZS PPL) settings.

Table 5 :
Percentage of turns containing NATCS dialogue acts in synthetic and real dialogue datasets.* Indicates distribution estimated from automatic predictions.A lower percentage of task-oriented turns is observed in NATCS and REAL than previous task-oriented dialogue datasets.
To assign cluster labels to all gold input turns, we use label propagation by training Turns NMI ACC Purity Inv.Purity

Table 11 :
Comparison of dialogue act classifier performance on real datasets trained on SGD, NATCS, and in-domain (Real ID ) vs. out-of-domain (Real OOD ) real data.Training on NATCS achieves comparable precision to Real OOD .

Table 15 :
Discourse complexities with examples

Table 16 :
Spoken complexities with examples Hi Rose, my name is Kim FIRSTNAME .I have a couple of questions for you actually I need to make a wire transfer and I also have an issue with my card.I may have lost it.So, but I'd like to to do the wire transfer first and then well maybe we can re-issue me a new card.

Table 17 :
Example conversation from NATCS Banking.The initial customer utterance is labeled with multiple intents (ExternalWireTransfer, ReportLostStolenCard, and RequestNewCard).Thank you for calling Intellibank.My name is Izumi.Who UNSPECIFIED am I speaking with today?ElicitSlot C: Hey, Izumi.My names Edward FIRSTNAME Elric LASTNAME .just calling in cuz I have a big problem today.I was at the I was at couple of stores with my wife buying supplies for a house and I think I lost my card.I can't remember when I lost it since we were at one of the stores.My wife actually used one of her cards to get the store points that they offer.don't know if it was before then or after.I think it was after because we went to go eat somewhere there.just calling in to get some help.Oh, no.I'm so sorry to hear that Mr. Elric.I know that it's difficult when you lose your your bank card.but yeah.Please rest assured, I will do everything I can to make sure that your account is secure and then we'll get you a new card as well as soon as possible, OK?

ChitChat C :
Oh, sweet.Never lost my card before.Kind of worried.So not sure what I have to do.Is there anything that I need to give you first so I can get my card?Yes, Mr. Elric.So first of all, I'll just need to ask for your personal information so that I can pull up your account .can I get your account number ACCOUNTNUMBER , please?OK, perfect.Thank you so much.Let me just pull up your account.Give me one moment.PauseForLookup C: Oh, OK.

Table 18
Hello and thank you for calling Intellibank.This is Mark speaking.How could I help you today?ElicitIntent C: Hi Mark.My name is Dorothy FIRSTNAME Lee LASTNAME .I would like to check my savings TYPEOFACCOUNT account balance.OK Dorothy.I can help you with that if you give me one second.Could you in the meantime give me your date of birth DATEOFBIRTH please?April the fifteenth nineteen ninety-nine DATEOFBIRTH .InformSlot A: OK perfect.Thank you so much.Now also could you give me your account number ACCOUNTNUMBER ?ElicitSlot C: Oh no I don't have it with me.Is that a problem?A: That shouldn't be a problem.do you happen to have your credit card number CREDITCARDNUMBER on you as well?OK well I need some form of identifying you.do you happen to have a driver's license OTHERIDNUMBER or another state ID issued number?ConfirmSlot, ElicitSlot MissingInfoWorkaround C: No sorry I I didn't bring anything with me today.Is there any other way we can do it?InformIntent Disfluencies A: unfortunately I'm gonna need some of that information to to process your request.so unfortunately because there's a lot of of theft going on I I it's could be fraud.I'm not sure that you are who you say who you are and if you can't give me that information.We use those as security checkpoints then I won't be able to complete your request for you.I apologize for that Mrs. Dorothy Lee.Well I'm kind of disappointed because I always get like a terrible service customer here but OK.I want you to help me with another thing.Is that possible?
: Example conversation from NATCS Banking with discourse complexities (e.g.BackgroundDetail, ChitChat, and FollowUpQuestion).In this conversation, the customer provides considerable background information related, but not necessary for understanding their intent.C:I don't have that either.A:

Table 19 :
Example conversation from NATCS Banking with discourse complexities (e.g.MultiIntent and Missing- InfoWorkAround).In this conversation, the customer is unable to provide the necessary information for identity verification, despite the agent offering multiple possible workarounds.Thank you for calling Rivertown Insurance.How may I help you today?ElicitIntent C: Yes.I need to do something about lowering my premiumiss I recently lost my job and just can't afford the payments anymore.I don't want to change companies.I've been with you guys for years.I'm sorry to hear you lost your job.Let's see what we can do.C: Okay.Thank you.A: May I have your first FIRSTNAME and last name LASTNAME ?ElicitSlot MultiElicit C: It's Maria FIRSTNAME Sanchez LASTNAME .InformSlot A: Thank you, Maria.Do you happen to have your customer number CUSTOMERID ?ConfirmSlot, ElicitSlot C: I think so.Let me check my purse.PauseForLookup A: Okay.Take your time.C: It's one two three four five six seven eight CUSTOMERID .It's seven twenty six nineteen eighty nine DATEOFBIRTH .InformSlot A: Thank you, Maria.I have you pulled up here.Which PLANTYPE policy were you looking at reducing the payment on?Life or Auto?Yes it was the highest one.A: Okay.We do have two options with lower payments how much lower did you need to go? ElicitSlot ...

Table 20 :
Example conversation from NATCS SELF demonstrating primarily discourse complexities.