Sources of Noise in Dialogue and How to Deal with Them

Training dialogue systems often entails dealing with noisy training examples and unexpected user inputs. Despite their prevalence, there currently lacks an accurate survey of dialogue noise, nor is there a clear sense of the impact of each noise type on task performance. This paper addresses this gap by first constructing a taxonomy of noise encountered by dialogue systems. In addition, we run a series of experiments to show how different models behave when subjected to varying levels of noise and types of noise. Our results reveal that models are quite robust to label errors commonly tackled by existing denoising algorithms, but that performance suffers from dialogue-specific noise. Driven by these observations, we design a data cleaning algorithm specialized for conversational settings and apply it as a proof-of-concept for targeted dialogue denoising.


Introduction
Quality labeled data is a necessity for properly training deep neural networks. More data often leads to better performance, and dialogue tasks are no exception (Qian and Yu, 2019). However, in the quest for more data, practitioners increasingly rely on crowdsourcing or forms of weak supervision to meet scaling requirements. Even when acting in good faith, crowdworkers are not trained experts which understandably leads to mistakes. This ultimately results in noisy training inputs for our conversational agents. Moreover, when dialogue systems are deployed into the real world, they must also deal with noisy user inputs. For example, a user might make an ambiguous request or mention an unknown entity. All these sources of noise eventually take their toll on model performance.
Before building noise-robust dialogue systems or denoising dialogue datasets, it would be helpful to know what types of noise exist in the first place. Then our efforts can be spent more wisely tackling the sources of noise that actually make a difference. Prior works have looked into counteracting noisy user interactions (Peng et al., 2021;Liu et al., 2021), but did not study the impact of noisy training data. Moreover, they lack analysis on how noise influences performance across different We categorize this as two types of instance-level noise. model types or conversational styles. Other works claim that dialogue agents can be easily biased by offensive language found in noisy training data (Ung et al., 2022;Dinan et al., 2020). Given such a danger, we wonder "How much toxic data actually exists in annotated dialogue data?" To investigate these concerns, we survey a wide range of popular dialogue datasets and outline the different types of naturally occurring noise. Building on this exercise, we also study the patterns of annotation errors to determine the prevalence of each noise type and identify the most likely causes of noise. Next, we run transformer models through the gamut to find out how well they handle the different types of noise documented in the previous step. In total, we test 3 model types on 7 categories of noise across 10 diverse datasets spanning 5 dialogue tasks. We discover that most models are quite robust to the label errors commonly targeted by denoising algorithms (Natarajan et al., 2013;Reed et al., 2015), but perform poorly when subjected to dialogue-specific noise. Finally, to verify we have indeed identified meaningful noise types, we apply our findings to denoise a dataset containing real dialogue noise. As a result, we  Table 1: Breakdown of ten dialogue datasets used in constructing the noise taxonomy. The datasets were chosen to span a wide variety of annotation schemes, task specifications and conversation lengths. KB/Document refers to a dataset containing an external knowledge base or document to ground the conversation. (See Appendix A) are able to raise joint goal accuracy on MultiWOZ 2.0 by 42.5% in relative improvement. In total, our contributions are as follows: (a) Construct a comprehensive taxonomy of dialogue noise to guide future data collection efforts. (b) Measure the impact of noise on multiple tasks and neural models to aid the development of denoising algorithms. (c) Establish a strong baseline for dealing with noise by resolving dialogue specific concerns, and verify its effectiveness in practice.

Dialogue Datasets
A data-driven taxonomy of dialogue noise was designed by manually reviewing thousands of conversations across ten diverse datasets. The datasets were chosen from non-overlapping domains to exhaustively represent all commonly considered dialogue tasks. At a high level, they are divided into six task-oriented dialogue datasets and four open domain chit-chat datasets.  (Yu and Yu, 2021). In addition to dialogue style, the datasets also span a variety of data collection methodologies, which has a close connection to the types of noise produced. We also consider whether the interlocutors engage in real-time vs. non-synchronous chat. Details of each dataset can be found in Table 1 and Appendix A.
The taxonomy creation process starts by uniformly sampling 1% of conversations from each corpus, rounding up as needed to include at least 100 dialogues per dataset. Expert annotators then conducted three rounds of review per conversation to tally noise counts. The group also cross-referenced each other to merge dupli-cate categories and resolve disagreements. Notably, the final taxonomy purposely excludes sources of noise that occur less than 0.1% of the time. This active curation supports future denoising research by focusing attention on the most prominent sources of noise.

Sources of Noise
Through careful review of the data, we discover that dialogue systems encounter issues either from noisy training inputs during model development or from noisy user inputs during model inference.

Training Noise
Noisy training data impacts model learning, before any user interaction with the system. The sources of noise are derived from labeling errors, ontology inconsistencies or undesirable discourse attributes.

Labeling Errors
For a given dataset of (X, Y ) pairs, any occasion when the target label y is labeled incorrectly.
Class Level When noise occurs due to confusion between two classes, this is considered a class-level labeling error. This can be further sub-divided into Uniform Label Swapping or Structured Label Swapping. In the former, symmetric noise implies all classes have equal likelihood to be confused with any other class, whereas in the latter certain classes are more likely to confused with other related classes. For example, "anger" is more likely to be confused with "frustration" than "joy".
Instance Level Noise comes from the example itself due to the complexity of interpreting natural language, which is especially common within dialogues (Zhang et al., 2021). For example, annotators may carry over the dialogue act from the previous turn, even though it is no longer relevant, resulting in Over Labeling. Conversely, Under Labeling is when a label is missed. Partial Labels occur when some labels are correct, while others are not. This is common in dialogue due to the prevalence of multi-label examples, such as an utterance with two slot-values to fill. (See Figure 1) Figure 2: Diagram of the main sources of noise that might affect training. Our taxonomy also includes inference noise which describes noise that occurs when users interact with the dialogue agent (See Fig 4).
Annotation Level Noise arises due to the labeler or data collection process. (Snow et al., 2004). Applying heuristics on a gazetteer to label named entities in NER produces Distant Supervision noise. Human annotators are also a source of noise either purposely from Adversarial Actors or inadvertently with Formatting Mistakes. (See Table 2) 3.1.2 Ontology Inconsistency Another source of noise comes from inconsistent formatting when constructing the ontology. The only entities which actually contained issues are (a) Dates: tomorrow, Jan 3rd, 1/3/2022, January 3 (b) Times: 14:15, 2:15 PM, quarter past 2, 215pm (c) Locations: NYC, New York, ny, the big apple (d) Numbers: three, 'wife daughter & I', 3, 'Me and my two buddies'. The effect is so pronounced in certain datasets that classifying labels becomes untenable, leaving generation or copying as the only viable method of predicting slot-values.

Discourse Attributes
Dialogue agents developed for response generation often mimic the behavior found in the training examples, so one hopes they contain positive discourse attributes while avoiding negative ones. We identify six such attributes by following qualitative metrics commonly used for dialogue evaluation and through our own review of the conversations.
(1) Fluent utterances flow well, obey proper grammar, and are syntactically valid. (2) Coherent dialogues are semantically valid, and make sense such that they are interpretable and understandable by a general audience. (3) Consistent models do not contradict what was stated earlier in the conversation, or haphazardly change their stance on a subject. (4) Sensible models follow common sense principles and understand basic natural laws (ie. gravity). (5) Polite dialogue models avoid toxic language or offensive speech. They should not exhibit overt bias towards certain groups or minorities. (6) Natural dialogues reflect how people generally talk in real life. In addition, the speakers should not break the fourth wall by directly or indirectly referring to the data collection process.

Inference Noise
Inference noise refers to issues that occur during user interaction with the system. This aligns nicely with the concept of out-of-scope errors (Chen and Yu, 2021), which are made up of two categories: out-of-distribution cases and dialogue breakdowns.

Out-of-Distribution (OOD)
Novel queries The user asks the model to do something it was not trained to do. Example: the customer asks about frequent flyer miles, but the agent is only capable of making flight reservations.
Unseen entities Facing new entities or values not seen during training. Although difficult, we could still expect a model to understand a portion of such queries by generalizing from the context.
Domain shift The dialogue system must make predictions in a new domain (taxi vs. flight). Commonly tackled in zero-shot settings, we can expect models to occasionally generalize because there may be shared slots across domains (ie. departure time is shared by both taxi and flights queries).  Ambiguous Meaning Query or statement that the model should be able to handle, but caused confusion, possibly by failing to consider dialogue context. Alternatively, a co-reference issue caused difficulty in interpreting what the user wanted. For example, "Yea, let's go with that one" is unclear.
Paraphrasing The text is rephrased to become: (a) Simplification: request may be simplified or shortened that makes it unclear what the user wants. (b) Non Sequitur: response is plausibly in-distribution, but does not reasonably answer the question. (c) Verbosity: request is so verbose that the underlying request is lost.
Text Perturbations Notable instances include (a) ASR Errors that fail to "wreck a nice beach" (recognize speech) (b) Typos and other syntax errors on the user input. This is distinct from formatting mistakes by annotators, which are errors on the target output.

Noise Patterns
Beyond categorization, manually reviewing 10K+ utterances also provides unique insights.
How often does noise appear? The precise answer is that the median amount of noise is 10.6%, with an average of 11.2% and a std dev. of 3.7%. However, given the approximate nature of sampling, the extra digits may not be significant. Instead, we assert the rate of noise in curated dialogue datasets is usually over 5%, rarely above 20% and typically around 10%. Since these rates are relatively low, denoising techniques aiming to combat extremely high levels of noise may be impractical.
What noise types are most common? Class-level noise is predicated on the assumption that a latent noise transition matrix stochastically switches labels from one class to another. While most existing denoising algorithms are designed to resolve class confusion (Sukhbaatar et al., 2015;Patrini et al., 2017), our analysis reveals that instance-level noise is actually much more common, showing up in nearly 10% of cases compared to just 5% for class-level errors. The prevalence of instance-level issues implies that the more likely explanation of noise is that some examples are simply more confusing then others due to the genuinely ambiguous nature of dialogue (Pavlick and Kwiatkowski, 2019;Nie et al., 2020;Ferracane et al., 2021). From an algorithmic perspective, the upshot is that developing denoising methods to target individual examples rather than class errors are likely to be most effective. Furthermore, we discovered that noise is clustered rather than evenly distributed, so filtering out or relabeling these particularly noisy instances should have an out-sized impact.
Why is X source of noise missing? Building out the taxonomy shows the most likely sources of noise, but equally notable is uncovering the least likely noise types, especially those with exaggerated expectations of influence. Concretely, the threat of adversarial actors is largely overblown (Dinan et al., 2019a), as spam-like activity appears less than 2% of the time. Offensive speech is the subject of numerous dialogue studies (Khatri et al., 2018;Xu et al., 2021;Sun et al., 2022), but is practically non-existent in reality (<0.5% of cases). While hate speech may be a problem when training on raw web text (Schmidt and Wiegand, 2017), our empirical review reveals that toxic language is exceedingly rare in curated datasets. Instead, unnatural utterances generated by crowdworkers role-playing as real users occurs much more often (4% of cases). (Full breakdown in Appendix D) Other types of noise occur so infrequently that they are missing from the taxonomy completely! Noteworthy options include inconsistent names or titles within the ontology (See Appendix C), as well as improper reference texts for dialogue generation tasks. While these noise types are possible, they did not occur in practice. We intentionally exclude all such candidates from the taxonomy since the aim is not to be comprehensive, but rather to highlight where researchers should spend their efforts.
Where does noise come from? Our survey found that each data collection method had a propensity to produce certain kinds of noise. This suggests noise arises as a result of how examples are annotated, rather than other factors such as conversation length (number of utterances) or dialogue style (open-domain vs. taskoriented). For example, positive discourse attributes are most common with Post-conversation Annotation and Live Chat, which involve two human speakers engaging in real dialogue. Wizard-of-Oz datasets are less timeconsuming to produce, but contain more label noise. In contrast, dialogues from Machine-to-Machine or Dialog Self-play (ie. starting with the labels to generate the dialogue) contain fewer label errors, but also sound less natural. Separately, annotator and ontology issues can be mitigated with well-written agent guidelines and proactive crowdworker screening. Thus, practitioners should consider these noise trade-offs when collecting dialogue data.

Experiments and Results
This section explores to what degree various models and dialogue tasks are impacted by each of the seven different categories of noise outlined in Section 3. To study this, a model is trained on a clean version of the dataset and on a corrupted version with either natural and injected noise. The level of corruption for all trials is held constant at 10% to allow for comparison across noise types. Four datasets are selected to measure the effect of each noise type, where each group of datasets always includes one instance of MultiWOZ 2.3 to aid comparison. Intuitively, sources of noise that induce a larger gap in models trained on cleaned versus corrupted data are more significant, and consequently deserve more attention as targets to denoise.

Task Setup
All trials are conducted with GPT2-medium as a base model (Brown et al., 2020). The chosen tasks are: (1) Conversation Level Classification (CLC) -Choose from a finite list of labels for each conversation. (2) Turn Level Classification (TLC) -Make a prediction for each turn that contains a label. (3) Dialogue State Tracking (DST) -Predict the overall dialogue state, which many contain multiple slot-values or no new slot-values at all. Individual values may come from a fixed or nonenumerable ontology. (4) Response Generation (RG) -Produce the agent response given the dialogue context so far. (5) Information Retrieval (IR) -Find and rank the appropriate information from an external data source, such as a knowledge base (KB) or separate document. Metrics were chosen to adhere to the evaluation procedure introduced with the original dataset or from related follow-up work.

Noise Injection
For each noise category, we start by independently sampling 10% of the data, adding the corresponding noise and training a model to convergence. For example, consider instance-level label errors applied to MultiWOZ. This dataset contains 113,556 total utterances so 11,356 of them are selected for corruption. Next, one of the three sub-categories of instance noise are chosen uniformly at random. Over-labeling occurs when a label that has recently appeared in previous turns is no longer valid. To match this behavior, we keep a running tally of recent labels and occasionally insert an extra one from this pool into the training example. Partial-labeling is achieved by replacing a label with a randomly selected one from the recent pool, and under-labeling is achieved by simply dropping a label from the example. Finally, a model is trained with the noisy data applying the same hyper-parameters as the ones used for training the standard, original model. This process is repeated for each other noise type. Due to space constraints, details on how other noise is injected can be found in Appendix F.

Main Results
Denoising methods targeting class-level noise may have minimal impact since it turns out such label errors are not all that damaging with just 0.89% drop in performance. On the other hand, annotator noise is quite powerful causing a 9.7% disturbance and should be mitigated whenever possible. Luckily, our manual review showed that spamming behavior occurs infrequently in reality. Negative discourse attributes can also cause major harm leading to a 8.4% gap. Moving onto inference noise, ontology issues are not only quite common, but also have meaningful impact on performance, causing a 7.9% drop. Dialogue breakdowns also cause noticeable degradation, but the impact of OOD is most prominent among all noise types. Neural networks are powerful enough to learn from any training signal, even complete random noise (Zhang et al., 2017). However, OOD cases are by definition areas the network has not seen, leading to poor performance.  Table 3: Performance across various datasets when injected with 10% noise. Scores in parentheses are the percent improvement when compared to the clean version of the data. Datasets 2-4 contain a superscript representing the dataset name as described in Table 1. Please see Appendix 5 for the exact task and dataset mapping for each item.

Task Breakdown
In order to study tasks across noise types, we look at the percentage change between models, rather than absolute difference. Furthermore, to minimize the influence of outliers, we emphasize the median of change, rather than the average. The results in Table 4b show that RG and IR observe the largest drops when noise is added. Somewhat surprisingly, CLC has larger performance shift than TLC despite being an easier task. We hypothesize this is because CLC examples only occur once for each conversation, whereas TLC examples occur at every turn, leading to an order of magnitude less data.
Training with the existence of noisy data depends on both the rate of noisy data as well as on a minimum number of clean examples.

Model Robustness
Prior work has suggested that models behave differently when faced with distinct types of noise (Belinkov and Bisk, 2018). In addition to GPT2-medium (345M parameters), we also consider a masked language model in RoBERTa-Large (355M parameters) (Liu et al., 2019) and a sequence-to-sequence model with BART-large (406M parameters) (Lewis et al., 2020). These are selected due to having a comparable number of training parameters. Based on the results in Table 4a, RoBERTa is the weakest performer of the group. We hypothesize this is because many dialogue tasks are generation based, whereas BERT-based models typically perform well on classification. Conversely, BART deals quite well with noise and should be a reasonable starting point for any dialogue project.

Amount of Noise
We

Dialogue Denoising
Informed by our understanding of the sources of dialogue noise, we now design a preliminary denoising algorithm for learning in the presence of noisy labels. We select MultiWOZ to serve as our testbed not only because it is one of the most popular dialogue datasets, but also because it is representative of how noise affects most datasets in general (see Figure 6). While our method produces promising results, our aim is not to declare the noise issue solved, but rather to showcase the noticeable benefit of cleaning the data and thereby motivate others to pursue this line of work.

Algorithm
Our general process for denoising dialogue data is to identify the most prominent sources of noise and resolve them accordingly. Based on our review in Section 3, MultiWOZ 2.0 is most plagued by ontology inconsistencies, instance-level label errors and out-ofdistribution issues. (a) To clean up the ontology, we drop values that do not conform to the correct format for their given slots, and remove the associated examples from training. (b) To deal with possible label errors, we filter out individual instances where the predicted label from a pre-trained model disagrees with the annotator label (Cuendet et al., 2007). (c) Lastly, we augment In order to augment, we begin by pseudo-labeling the datapoints that have been stripped out in the first two steps. However, the pretrained model's predictions are unlikely to be all correct, so rather than keep all the new labels, we only keep the examples where the probability of the max value crosses the 0.5 threshold. Then, since neural networks are often over-confident, we perform calibration with temperature scaling using a λ parameter (Guo et al., 2017). Next, pseudo-labeling with the same model that is used to perform filtering causes errors to propagate which hinders performance gains. As a result, inspired by co-teaching (Han et al., 2018), we instead use a different model to force divergence of model parameters and avoid the existing biases. In more detail, we rely on a BART-base model rather than the original GPT-2 medium. This works even though BARTbase has much fewer parameters. Lastly, we combine all three parts together to form the final dataset used for training our strongest model.

Denoising Results
We once again evaluate with MultiWOZ 2.4 since this is cleanest version of test data. As seen in Figure 3, we are able to outperform MultiWOZ 2.0 (39.8) by 16.9% absolute accuracy and 42.5% relative accuracy. Ontology Clean (43.2), Filter Disagree (53.7) and Coteaching (46.7) all show marked improvement over the original baseline, but Combined (58.6) does the best overall, reaching a score that even surpasses MultiWOZ 2.1 (56.5).
Despite the great results, many avenues still exist for improvement. For example, a model trained on NLI can be used to screen out inconsistent discourse examples from the training data (Welleck et al., 2019). Besides, methods such as core-set selection (Mirzasoleiman et al., 2020) or the Shapley algorithm (Liang et al., 2021) can be used to identify important datapoints and thereby filter out noisy ones. Overall, our curated noise taxon-omy and proposed methods establish promising groundwork for building denoising techniques to deal with real, rather than imagined, dialogue noise.

Related Works
Our work is related to efforts to categorize noise within speech and dialog. Clark (1996) proposed a theory of miscommunication consisting of channel, signal, intention and conversation where each of the four levels serves as a potential vector for noise. Others have also studied noise in spoken dialogue systems, where they found that the main culprit stems from errors in speech transcription (Paek, 2003;Bohus, 2007). Rather than a high-level framework of general communication, our hierarchical taxonomy focuses on understanding the multiple layers of noise found in written text.
More recent works on dialogue noise discuss robustness to noisy user inputs, whereas we expand this view to also analyze noisy training inputs. Peng et al. (2021) introduce RADDLE as a platform which covers OOD due to paraphrasing, verbosity, simplification, and unseen entities, as well as general typos and speech errors. Liu et al. (2021)  Most prior works exploring learning with noisy labels were originally developed for the computer vision domain (Smyth et al., 1994;Mnih and Hinton, 2012;Sukhbaatar et al., 2015). Some methods model the noise within a dataset in order to remove it, often through the use of a noise transition matrix (Dawid and Skene, 1979;Goldberger and Ben-Reuven, 2017). Others have designed noise-insensitive training schemes by modifying the loss function (van Rooyen et al., 2015;Ghosh et al., 2017;Patrini et al., 2017), while a final set of options manipulate noisy examples by either reweighting or relabeling them. (Reed et al., 2015;Jiang et al., 2018;Li et al., 2020). While denoising work certainly exists for NLP (Snow et al., 2008;Raykar et al., 2009;Wang et al., 2019), none of them specifically touch upon the dialogue scenario.

Conclusion
This paper categorizes the different sources of noise found in dialogue data and studies how models react to them. We find that dialogue noise is divided into issues that occur during training and during inference. We also find that conversations pose unique challenges not found in other NLP corpora, such as discourse naturalness and dialogue breakdowns. Our study further reveals that the most common sources of noise are actually based on the ambiguity of individual instances, rather than systematic noise across classes or adversarial annotators. Despite being surprisingly resilient, dialogue models nonetheless experience a notable drop in performance when exposed to high levels of noise. To combat this, we design a proof-of-concept denoising algorithm to serve as a strong foundation for other researchers. We hope our survey informs the collection of cleaner dialogue datasets and the development of advanced denoising algorithms targeting the true sources of dialogue noise.
In terms of the noise taxonomy, the main limitation is that we only consider natural language text within dialogue. It could be useful to conduct a detailed breakdown of speech noise or multi-modal noise that occurs when collecting conversations grounded by images. Furthermore, we could only make a best effort attempt at covering dialogue noise by reviewing ten datasets. Doubling the number of datasets reviewed or sampling more than 1% of the data would certainly lead to more precise error rates and may also lead to greater insights.
The main limitation of our proposed denoising method is that it has only been applied to the MultiWOZ dataset. Although we have strong reason to believe in its generalizability to other settings, this has not been proven. Furthermore, we would have liked to also test out exactly how our method works on MultiWOZ 2.1 and 2.2. However, the differing data formats make this task non-trivial.

A Dataset Descriptions
In no particular order, the datasets we study are:

B Label Error Details
Class Level Examples are labeled incorrectly due to confusion with another class.
• Uniform Label Swapping: symmetric noise where all classes have equal likelihood to be confused with any other class. The assumption is that noise is injected through a randomly initialized noise transition matrix.
• Structured Label Swapping: asymmetric noise where certain classes are more likely to confused with other related classes. For example, a cheetah is more likely to be confused with leopard than a refrigerator when performing image recognition. Alternatively, dogs and wolves are likely to be confused for each other much more often than with horses since those animals are similar to each other.
Instance Level Noise comes from the example itself due to the complexity of interpreting natural language. This is the realization that even when annotators act in good-faith, mistakes are still made since the instances themselves are difficult to label. Here, errors must be determined on a case-by-case basis.
• Over Labeling: annotator added a label, but should be removed since it is unnecessary. Example: carrying over a slot-value from the previous turn to the current dialogue state when it is not warranted since the user changed their mind. • Under Labeling: annotator missed the label, when most people would include it. Example: failing to notice a newly mentioned criteria in the dialogue state. This also includes cases where a better label could have been used, but the option is missing from the ontology and consequently prevents the example from being properly labeled. • Partial Labeling: part of the label is correct, but other parts are not. For multi-intent utterances, the annotator may have captured one intent, but not the other. For slot-filling tasks, the annotator may have selected the appropriate value, but assigned it to the wrong slot.
Annotation Level Noise arises due to the labeler or data collection process. (Snow et al., 2004) • Distant Supervision: the noise results from the fact that the label is not from a human, but rather weakly labeled from distant supervision (Sun et al., 2017). For example, using a gazetteer for labeling named entities in NER. As another example, you use the SQL results to train a semantic parser, rather than an annotated SQL query. • Adversarial Actors: meant to mimic spammers, this is characterized by repeating patterns or irrational behavior. For example, the annotator selects "greeting" dialogue act as the label for every single utterance regardless of the underlying text. (Raykar et al., 2009;Hovy et al., 2013;Khetan et al., 2018) Other examples include bad actors in social media who provoke chatbots into producing unsafe content or labelers who mark every review as possessing positive sentiment without actually reading the review. • Formatting Mistakes: Caused by non-experts making human mistakes, which are independent of the dialogue context. For example, typos or off-by-one errors, such as when the labeler failed to highlight the entire phrase during span selection. (See Ta-ble 2)

C Ontology Inconsistency Details
Another source of noise comes from inconsistent formatting when constructing the ontology. More specifically, the creators of the dataset did not set a canonical format for each type of slot being tracked. While we can imagine many other slot-types causing issues, the types of errors which actually occurred in practice include: • Dates: tomorrow, Jan 3rd, 1/3/2022, Monday, January 3, mon For some datasets, the variety in slot-values were so wide that it becomes impractical to pose slot-filling as a classification task. Instead, a model must perform generation or otherwise copy from the input in order to properly match the slot-value given in the label. To minimize the amount of noise from ontology inconsistency, a recommendation is to declare the allowable slot-values upfront before data collection begins.  2. Non Sequitur -response is plausibly indistribution, but does not reasonably answer the question.

D Results Breakdown
Agent: What part of town would you like to eat? User: I would like Italian food.
Note that the user's response is still in distribution since it could have been a reasonable answer to "What cuisine do you prefer?". However, in this instance, this type of response is very noisy because it fails to answer the agent's question.
3. Verbosity -the request contains extra words or entities, which makes it confusing as to exactly what the answer may be.
Agent: What part of town would you like to eat? User: I prefer food in the East, but I live in the South right now.
In this case, the user's response is not necessarily long, but it is verbose enough to make it unclear whether the user wants food in the east side of town or the south side of town.
True paraphrasing noise should alter the text without altering the user's underlying intent. If the text has changed so much that the user's intent has also shifted, then it should be considered adversarial behavior beyond the scope of typical dialogue noise. Agent: What part of town would you like to eat? User: The Northern Lights are beautiful this time of year.
The example above displays positive sentiment, but the user has completely ignored the agent's request. This case borders on being incoherent and fails to move the dialogue forward.

F Noise Injection Methods
Class-level Label Errors We create a noise transition matrix to mimic structured confusion. Specifically, given a certain class label, we want to determine what is likely to be confused with it so we can substitute the current label for that other class. To fill the noise transition matrix, we embed all class labels into bag-of-word Parentheses also includes the target of the task. For example, 'CLC on topics' means that the task is to classify the associated topic label at a conversation level, while 'TLC on intents' means the task is to classify the intent of each user turn.
GloVe embeddings and measure their similarity to other classes by cosine distance. Then, for 10% of examples, we sample an incorrect label given the original class according to the likelihood in the transition matrix.
Instance-level Label Errors Over-labeling occurs when a label that has recently appeared in previous turns is no longer valid. To match this behavior, we keep a running tally of recent labels and occasionally insert an extra one from this pool into the example. Partiallabeling is achieved by replacing a label with a randomly selected one from the recent pool, and under-labeling is achieved by simply dropping a random label from the example.

Annotator-level Label Errors
We mimic spammers who apply preset answers to every occasion without considering the actual dialogue. For the classification tasks, we assume a spammer randomly picks from one of the three most common labels for that task as the noisy target label. For response generation tasks, we assume a spammer randomly responds with one of three generic phrases.

Undesirable Discourse Attributes
We replace a subset of the utterances with noisy versions 10% of the time.
Incoherent utterances are randomly selected sentences from other dialogues within the dataset. Disfluent utterances are generated by shuffling the tokens within the current utterance. Unnatural utterances are generated by selecting from a list of awkward sentences referencing the task.

G MultiWoz 2.0 Noise Analysis
MultiWOZ was one of the largest, most well-designed datasets upon release and still remains popular as a standard dialogue benchmark. Despite these strengths, it is also undeniable that the original dataset contained a noticeable amount of errors, which prompted the release of MultiWoz 2.1, 2.2, 2.3 and 2.4. Some ideas for improving that data collection quality: 1. Annotator quality filter: Screening for annotators with reach a minimum accept rate and language filters are currently widely used best practices. Additionally, the use of qualifications (ie. quals) on Amazon Mechanical Turk and follow-up agent training are common for maintaining high quality annotators. Following modern best practices will greatly improve the data quality.
2. Ontology created pre-conversation: Defining the allowed entities in the ontology up front and enforcing this (by having checks upon submission) can be quite impactful. Alternatively, cleaning the ontology after the fact and then removing the related entries can also be quite helpful.
3. Larger dataset size: Most datasets could always use more examples to increase the diversity and coverage of the solution space to limit OOD errors. MultiWoz was the largest dialogue dataset when it was released, but we have seen since then that the ability of dialogue systems continues to benefit as ever larger corpora are fed in for training.
Analyzing how specific sources of noise impact Mul-tiWoz, Figure 6 shows that MultiWoz is most largely impacted by OOD and Annotation-level issues. Luckily, we found annotation-level issues to be relatively rare ( 3% of conversations) compared to instance-level ( 34%) and class-level ( 1%) labeling errors. Each type of noise is injected into the MultiWoz2.3 dataset at 10% noise rate. As usual, evaluation for all models are conducted on MultiWoz 2.4 test set.

H Noise as Uncertainty
An interesting way to view the impact of noise is through the lens of Bayesian uncertainty. In particular, aleatoric and epistemic uncertainty can be seen caused by different types of noise. Kendall and Gal (2017) describe aleatoric uncertainty as uncertainty which "captures noise inherent in the observations." In contrast "epistemic uncertainty accounts for uncertainty in the model parameters which can be explained away given enough data." Roughly speaking, labeling errors cause epistemic uncertainty since these errors produce uncertainty in the model parameters. If given enough clean data to Figure 6: Impact of the different noise types on the Mul-tiWoz2.3 dataset. DST is dialogue state tracking, RG is response generation and TLC is turn level classification. train a model, the issues caused by the noisy labels should largely be erased. In other words, epistemic uncertainty describes what the model does not know because training data was not appropriate, so by resolving the labeling errors, the training data is now appropriate and the dialogue system can be trained successfully.
On the other hand, ontology inconsistencies cause aleatoric uncertainty since they can lead to situations where it is impossible to fix the problem by altering the training data alone. Suppose we want the dialogue model to predict the desired time for a restaurant reservation (such as 11 AM, 6PM or 8PM), but options such as 'Sunday' or 'afternoon' keep appearing, which are never correct. This would make it harder for a classifier to choose the correct time. In the degenerate case, suppose the ontology only consisted of days of the week such as 'Monday', 'Wednesday' or 'Friday', such that the classifier would only have the ability to choose from seven incorrect options. In this case, adding any amount of extra data (even those labeled in the correct format) would do nothing to resolve the issue since the problem itself has been modeled incorrectly. Accordingly, a model developer should focus on eliminating certain types of noise based on the type of uncertainty they are seeing in their dialogue system. If the model is consistently making a handful of random mistakes, then relabeling some data or collecting new data may resolve the issue. Alternatively, if the model is a making systematic errors then looking into the ontology or data collection procedure might be a better route.

I Experiment Hyper-parameters
Learning rates were tested among [1e-5, 3e-5, 1e-4, 3e-4]. Batch sizes were held constant at 72 examples per batch. Early stopping was employed when a model failed to improve on the development set for 5 epochs in a row. The temperature parameter for calibrating model confidence was tested in the range of λ = [1.3, 1.5, 1.7, 1.9]. NLTK is used for calculation of BLEU score.

J Additional Noise Examples
Two examples were chosen for each of the ten datasets, giving a total of 20 examples. Since four examples can be found in the main portion of the paper, this section contains the remaining 16 examples. The examples were carefully selected to give good coverage of the different types of noise that occurred frequently within the data.

Dataset
Noise Labeling Error wondering how that effects international → Instance level shipping → Over Customer If it'll still be free coming from out of the country Agent Sure, I'll be happy to look into that for you. Action Searching the faq pages ... [FAQ Search] Action Membership level of 'gold' has been noted. [Membership Level]