Robustness Testing of Language Understanding in Task-Oriented Dialog

Most language understanding models in task-oriented dialog systems are trained on a small amount of annotated training data, and evaluated in a small set from the same distribution. However, these models can lead to system failure or undesirable output when being exposed to natural language perturbation or variation in practice. In this paper, we conduct comprehensive evaluation and analysis with respect to the robustness of natural language understanding models, and introduce three important aspects related to language understanding in real-world dialog systems, namely, language variety, speech characteristics, and noise perturbation. We propose a model-agnostic toolkit LAUG to approximate natural language perturbations for testing the robustness issues in task-oriented dialog. Four data augmentation approaches covering the three aspects are assembled in LAUG, which reveals critical robustness issues in state-of-the-art models. The augmented dataset through LAUG can be used to facilitate future research on the robustness testing of language understanding in task-oriented dialog.


Introduction
Recently task-oriented dialog systems have been attracting more and more research efforts (Gao et al., 2019;Zhang et al., 2020b), where understanding user utterances is a critical precursor to the success of such dialog systems. While modern neural networks have achieved state-of-the-art results on language understanding (LU) (Wang et al., 2018;Zhao and Feng, 2018;Goo et al., 2018;Liu et al., 2019;Shah et al., 2019), their robustness to changes in the input distribution is still one of the biggest challenges in practical use. * Equal contribution. † Corresponding author.
Real dialogs between human participants involve language phenomena that do not contribute so much to the intent of communication. As shown in Fig. 1, user expressions can be of high lexical and syntactic diversity when a system is deployed to users; typed texts may differ significantly from those recognized from voice speech; interaction environments may be full of chaos and even users themselves may introduce irrelevant noises such that the system can hardly get clean user input.
Unfortunately, neural LU models are vulnerable to these natural perturbations that are legitimate inputs but not observed in training data. For example, Bickmore et al. (2018) found that popular conversational assistants frequently failed to understand real health-related scenarios and were unable to deliver adequate responses on time. Although many studies have discussed the LU robustness (Ray et al., 2018;Zhu et al., 2018;Iyyer et al., 2018;Yoo et al., 2019;Ren et al., 2019;Jin et al., 2020;He et al., 2020), there is a lack of systematic studies for real-life robustness issues and corresponding benchmarks for evaluating task-oriented dialog systems.
In order to study the real-world robustness issues, we define the LU robustness from three aspects: language variety, speech characteristics and noise perturbation. While collecting dialogs from deployed systems could obtain realistic data distribution, it is quite costly and not scalable since a large number of conversational interactions with real users are required. Therefore, we propose an automatic method LAUG for Language understanding AUGmentation in this paper to approximate the natural perturbations to existing data. LAUG is a black-box testing toolkit on LU robustness composed of four data augmentation methods, including word perturbation, text paraphrasing, speech recognition, and speech disfluency.
We instantiate LAUG on two dialog corpora Frames (El Asri et al., 2017) and MultiWOZ (Budzianowski et al., 2018) to demonstrate the toolkit's effectiveness. Quality evaluation by annotators indicates that the utterances augmented by LAUG are reasonable and appropriate with regards to each augmentation approach's target. A number of LU models with different categories and training paradigms are tested as base models with in-depth analysis. Experiments indicate a sharp performance decline in most baselines in terms of each robustness aspect. Real user evaluation further verifies that LAUG well reflects real-world robustness issues. Since our toolkit is model-agnostic and does not require model parameters or gradients, the augmented data can be easily obtained for both training and testing to build a robust dialog system. Our contributions can be summarized as follows: (1) We classify the LU robustness systematically into three aspects that occur in real-world dialog, including linguistic variety, speech characteristics and noise perturbation; (2) We propose a general and model-agnostic toolkit, LAUG, which is an integration of four data augmentation methods on LU that covers the three aspects. (3) We conduct an in-depth analysis of LU robustness on two dialog corpora with a variety of baselines and standardized evaluation measures. (4) Quality and user evaluation results demonstrate that the augmented data are representative of real-world noisy data, therefore can be used for future research to test the LU robustness in task-oriented dialog 1 .

Robustness Type
We summarize several common interleaved challenges in language understanding from three aspects, as shown in Fig. 1b: Language Variety A modern dialog system in a text form has to interact with a large variety of real users. The user utterances can be characterized by a series of linguistic phenomena with a long tail of variations in terms of spelling, vocabulary, lexical/syntactic/pragmatic choice (Ray et al., 2018;Jin et al., 2020;He et al., 2020;Zhao et al., 2019;Ganhotra et al., 2020).
Speech Characteristics The dialog system can take voice input or typed text, but these two differ in many ways. For example, written language 1 The data, toolkit, and codes are available at https: //github.com/thu-coai/LAUG, and will be merged into https://github.com/thu-coai/ConvLab-2 . tends to be more complex and intricate with longer sentences and many subordinate clauses, whereas spoken language can contain repetitions, incomplete sentences, self-corrections and interruptions (Wang et al., 2020a;Park et al., 2019;Wang et al., 2020b;Honal and Schultz, 2003;Zhu et al., 2018).
Noise Perturbation Most dialog systems are trained only on noise-free interactions. However, there are various noises in the real world, including background noise, channel noise, misspelling, and grammar mistakes (Xu and Sarikaya, 2014;Li and Qiu, 2020;Yoo et al., 2019;Henderson et al., 2012;Ren et al., 2019).

LAUG: Language Understanding Augmentation
This section introduces commonly observed out-ofdistribution data in real-world dialog into existing corpora. We approximate natural perturbations in an automatic way instead of collecting real data by asking users to converse with a dialog system. To achieve our goals, we propose a toolkit LAUG, for black-box evaluation of LU robustness. It is an ensemble of four data augmentation approaches, including Word Perturbation (WP), Text Paraphrasing (TP), Speech Recognition (SR), and Speech Disfluency (SD). Noting that LAUG is modelagnostic and can be applied to any LU dataset theoretically. Each augmentation approach tests one or two proposed aspects of robustness as Table  1 shows. The intrinsic evaluation of the chosen approaches will be given in Sec. 4. Task Formulation Given the dialog context where each x is an utterance and m is the size of sliding window that controls the length of utilizing dialog history, the model should recognize y t , the dialog act (DA) of x 2t . Empirically, we set m = 2 in the experiment. Let U, S denote the set of user/system utterances, respectively. Then, we have x 2t−2i ∈ U and x 2t−2i−1 ∈ S. The task of this paper is to examine different LU models whether they can predict y t correctly given a perturbed inputX t . The perturbation is only performed on user utterances.
Word Perturbation Inspired by EDA (Easy Data Augmentation) (Wei and Zou, 2019), we propose its semantically conditioned version, SC-EDA, which considers task-specific augmentation operations in LU. SC-EDA injects word-level perturbation into each utterance x and updates its corresponding semantic label y .   Table 2 shows an example of SC-EDA. Original EDA randomly performs one of the four operations, including synonym replacement, random insertion, random swap and random deletion 2 . Noting that, to keep the label unchanged, words related to slot 2 See the EDA paper for details of each operation. values of dialog acts are not modified in these four operations. Additionally, we design slot value replacement, which changes the utterance and label at the same time to test model's generalization to unseen entities. Some randomly picked slot values are replaced by unseen values with the same slot name in the database or crawled from web sources. For example in Table 2, "Cambridge" is replaced by "Liverpool", where both belong to the same slot name "dest" (destination).
Synonym replacement and slot value replacement aim at increasing the language variety, while random word insertion/deletion/swap test the robustness of noise perturbation. From another perspective, four operations from EDA perform an Invariance test, while slot value replacement conducts a Directional Expectation test according to CheckList (Ribeiro et al., 2020).
Text Paraphrasing The target of text paraphrasing is to generate a new utterance x = x while maintaining its dialog act unchanged, i.e. y = y. We applied SC-GPT , a finetuned language model conditioned on the dialog acts, to paraphrase the sentences as data augmentation. Specifically, it characterizes the conditional probability p θ (x|y) = K k=1 p θ (x k |x <k , y), where x <k denotes all the tokens before the k-th position. The model parameters θ are trained by maximizing the log-likelihood of p θ .  We observe that co-reference and ellipsis frequently occurs in user utterances. Therefore, we propose different encoding strategies during paraphrasing to further evaluate each model's capacity for context resolution. In particular, if the user mentions a certain domain for the first time in a dialog, we will insert a "*" mark into the sequential dialog act y to indicate that the user tends to express without co-references or ellipsis, as shown in Table 3. Then SC-GPT is finetuned on the processed data so that it can be aware of dialog context when generating paraphrases. As a result, we find that the average token length of generated utterances with/without "*" is 15.96/12.67 respectively after SC-GPT's finetuning on MultiWOZ.
It should be noted that slot values of an utterance can be paraphrased by models, resulting in a different semantic meaning y . To prevent generating irrelevant sentences, we apply automatic value detection in paraphrases with original slot values by fuzzy matching 3 , and replace the detected values in bad paraphrases with original values. In addition, we filter out paraphrases that have missing or redundant information compared to the original utterance.

Speech Recognition
We simulate the speech recognition (SR) process with a TTS-ASR pipeline (Park et al., 2019). First we transfer textual user utterance x to its audio form a using gTTS 4 (Oord et al., 2016), a Text-to-Speech system. Then audio data is translated back into text x by DeepSpeech2 (Amodei et al., 2016), an Automatic Speech Recognition (ASR) system. We directly use the released models in the DeepSpeech2 repository 5 with the original configuration, where the speech model is trained on Baidu Internal English Dataset, and the language model is trained on CommonCrawl Data.   Table 4 shows some typical examples of our SR augmentation. ASR sometimes wrongly identifies one word as another with similar pronunciation. Liaison constantly occurs between successive words. Expressions with numbers including time and price are written in numerical form but different in spoken language.
Since SR may modify the slot values in the translated utterances, fuzzy value detection is employed here to handle similar sounds and liaison problems when it extracts slot values to obtain a semantic label y . However, we do not replace the noisy value with the original value as we encourage such misrecognition in SR, thus y = y is allowed. Moreover, numerical terms are normalized to deal with the spoken number problem. Most slot values could be relocated by our automatic value detection rules. The remainder slot values which vary too much to recognize are discarded along with their corresponding labels.
Speech Disfluency Disfluency is a common feature of spoken language. We follow the categorization of disfluency in previous works (Lickley, 1995;Wang et al., 2020b): filled pauses, repeats, restarts, and repairs.
Original I want to go to Cambridge. Pauses I want to um go to uh Cambridge. Repeats I, I want to go to, go to Cambridge. Restarts I just I want to go to Cambridge. Repairs I want to go to Liverpool, sorry I mean Cambridge. We present some examples of SD in Table 5. Filler words ("um", "uh") are injected into the sentence to present pauses. Repeats are inserted by repeating the previous word. In order to approximate the real distribution of disfluency, the interruption points of filled pauses and repeats are predicted by a Bi-LSTM+CRF model (Zayats et al., 2016) trained on an annotated dataset SwitchBoard (Godfrey et al., 1992), which was collected from real human talks. For restarts, we insert false start terms ("I just") as a prefix of the utterance to simulate self-correction. In LU task, we apply repairs on slot values to fool the models to predict wrong labels. We take the original slot value as Repair ("Cambridge") and take another value with the same slot name as Reparandum ("Liverpool"). An edit term ("sorry, I mean") is inserted between Repair and Reparandum to construct a correction. The filler words, restart terms, and edit terms and their occurrence frequency are all sampled from their distribution in SwitchBoard.
In order to keep the spans of slot values intact, each span is regarded as one whole word. No insertions are allowed to operate inside the span. Therefore, SD augmentation do not change the original semantic and labels of the utterance, i.e. y = y.

Data Preparation
In our experiments we adopt Frames 6 (El Asri et al., 2017) and MultiWOZ (Budzianowski et al., 2018)  semantic labels of user utterances are annotated. In particular, MultiWOZ is one of the most challenging datasets due to its multi-domain setting and complex ontology, and we conduct our experiments on the latest annotation-enhanced version MultiWOZ 2.3 (Han et al., 2020), which provides cleaned annotations of user dialog acts (i.e. semantic labels). The dialog act consists of four parts: domain, intent, slot names, and slot values. The statistics of two datasets are shown in Table 6  The data are augmented with the inclusion of its copies, leading to a composite of all 4 augmentation types with equal proportion. Other setups are described in each experiment 7 .   pects by comparing our augmented utterances with the original counterparts. We could find each augmentation method has a distinct effect on the data. For instance, TP rewrites the text without changing the original meaning, thus lexical and syntactic representations dramatically change, while most slot values remain unchanged. In contrast, SR makes the lowest change rate in characters and words but modifies the most slot values due to the speech misrecognition.

Quality Evaluation
To ensure the quality of our augmented test set, we conduct human annotation on 1,000 sampled utterances in each augmented test set of Multi-WOZ. We ask annotators to check whether our augmented utterances are reasonable and our autodetected value annotations are correct (two true-orfalse questions). According to the feature of each augmentation method, different evaluation protocols are used. For TP and SD, annotators check whether the meaning of utterances and dialog acts are unchanged. For WP, changing slot values is allowed due to slot value replacement, but the slot name should be the same. For SR, annotators are asked to judge on the similarity of pronunciation rather than semantics. In summary, all the high scores in Table 7 demonstrate that LAUG makes reasonable augmented examples.

Baselines
LU models roughly fall into two categories: classification-based and generation-based models. Classification based models (Hakkani-Tür et al., 2016;Goo et al., 2018) extract semantics by intent detection and slot tagging. Intent detection is commonly regarded as a multi-label classification task, and slot tagging is often treated as a sequence labeling task with BIO format (Ramshaw and Marcus, 1999), as shown in Fig. 2a. Generation-based mod-  els (Liu and Lane, 2016;Zhao and Feng, 2018) generate a dialog act containing intent and slot values. They treat LU as a sequence-to-sequence problem and transform a dialog act into a sequential structure as shown in Fig. 2b. Five base models with different categories are used in the experiments, as shown in Table 9.  To support a multi-intent setting in classificationbased models, we decouple the LU process as follows: first perform domain classification and intent detection, then concatenate two special tokens which indicate the detected domain and intent (e.g. [restaurant][inf orm]) at the beginning of the input sequence, and last encode the new sequence to predict slot tags. In this way, the model can address overlapping slot values when values are shared in different dialog acts.

Main Results
We conduct robustness testing on all three capacities for five base models using four augmentation methods in LAUG. All baselines are first trained on the original datasets, then finetuned on the augmented datasets. Overall F1-measure performance on Frames and MultiWOZ is shown in Table 8. All experiments are conducted over 5 runs, and averaged results are reported.
Robustness for each capacity can be measured by performance drops on the corresponding augmented test sets. All models achieve some performance recovery on augmented test sets after trained on the augmented data, while keeping a comparable result on the original test set. This indicates the effectiveness of LAUG in improving the model's robustness.
We observe that pre-trained models outperform non-pre-trained ones on both original and augmented test sets. Classification-based models have better performance and are more robust than generation-based models. ToD-BERT, the state-

F1-measure Augmentation Ratio
Ori. of-the-art model which was further pre-trained on task-oriented dialog data, has comparable performance with BERT. With most augmentation methods, ToD-BERT shows slightly better robustness than BERT.

WP
Since the data volume of Frames is far less than that of MultiWOZ, the performance improvement of pre-trained models on Frames is larger than that on MultiWOZ. Due to the same reason, augmented training data benefits the non-pre-trained models performance of on Ori. test set more remarkably in Frames where data is not sufficient.
Among the four augmentation methods, SR has the largest impact on the models' performance, and SD comes the second. The dramatic performance drop when testing on SR and SD data indicates that robustness for speech characteristics may be the most challenging issue. Fig. 3 shows how the performance of BERT and GPT-2 changes on MultiWOZ when the ratio of augmented training data to the original data varies from 0.1 to 4.0. F1 scores on augmented test sets increase when there are more augmented data for training. The performance of BERT on augmented test sets is improved when augmentation ratio is less than 0.5 but becomes almost unchanged after 0.5 while GPT-2 keeps increasing stably. This result shows the different characteristics between classification-based models and generation-based models when finetuned with augmented data.

Ablation Study
Between augmentation approaches In order to study the influence of each augmentation approach in LAUG, we test the performance changes when one augmentation approach is removed from constructing augmented training data. Results on Mul-tiWOZ are shown in  Large performance decline on each augmented test set is observed when the corresponding augmentation approach is removed in constructing training data. The performance after removing an augmentation method is comparable to the one without augmented training data. Only slight changes are observed without other approaches. These results indicate that our four augmentation approaches are relatively orthogonal.
Within augmentation approach Our implementation of WP and SD consist of several func-  Original EDA consists of four functions as described in Table 2. Performance differences (Diff.) can reflect the influences of those components in Table 11a. The additional function of our SC-EDA is slot value replacement. We can also observe an increase in performance when it is removed, especially for MILU. This implies a lack of LU robustness in detecting unseen entities. Table 11b shows the results of ablation study on SD. Among the four types of disfluencies described in Table 5, repairs has the largest impact on models' performance. The performance is also affected by pauses but to a less extent. The influences of repeats and restarts are small, which indicates that neural models are robust to handle these two problems.

User Evaluation
In order to test whether the data automatically augmented by LAUG can reflect and alleviate practical robustness problems, we conduct a real user evaluation. We collected 240 speech utterances from real humans as follows: First, we sampled 120 combinations of DA from the test set of MultiWOZ. Given a combination, each user was asked to speak two utterances with different expressions, in their own language habits. Then the audio signals were recognized into text using DeepSpeech2, thereby constructing a new test set in real scenarios 8 . Results on this real test set are shown in Table 12 Table 12: User evaluation results on MultiWOZ. Ori. and Avg. have the same meaning as the ones in Table  8, and Real is the real user evaluation set.
The performance on the real test set is substantially lower than that on Ori. and Avg., indicating that real user evaluation is much more challenging. This is because multiple robustness issues may be included in one real case, while each augmentation method in LAUG evaluates them separately. Despite the difference, model performance on the real data is remarkably improved after every model is finetuned on the augmented data, verifying that LAUG effectively enhances the model's real-world robustness.   Table 13 investigates which error type the model has made on the real test set by manually checking all the error outputs of BERT Ori. "Others" are the error cases which are not caused by robustness issues, for example, because of the model's poor performance. It can be observed that the model seriously suffers to LU robustness (over 70%), and that almost half of the error is due to Language Variety. We find that this is because there are more diverse expressions in real user evaluation than in the original data. After augmented training, we can observe that the number of error cases of Speech Characteristics and Noise Perturbation is relatively decreased. This shows that BERT Aug. can solve these two kinds of problems better. Noting that the sum of four percentages is over 100% since 25% error cases involve multiple robustness issues. This again demonstrates that real user evaluation is more challenging than the original test set 9 .

Related Work
Robustness in LU has always been a challenge in task-oriented dialog. Several studies have investigated the model's sensitivity to the collected data distribution, in order to prevent models from overfitting to the training data and improve robustness in the real world. Kang et al. (2018) collected dialogs with templates and paraphrased with crowdsourcing to achieve high coverage and diversity in training data. Dinan et al. (2019) proposed a training schema that involves human in the loop in dialog systems to enhance the model's defense against human attack in an iterative way. Ganhotra et al. (2020) injected natural perturbation into the dialog history manually to refine over-controlled data generated through crowd-sourcing. All these methods require laborious human intervention. This paper aims to provide an automatic way to test the LU robustness in task-oriented dialog.
Various textual adversarial attacks (Zhang et al., 2020a) have been proposed and received increasing attentions these years to measure the robustness of a victim model. Most attack methods perform whitebox attacks (Papernot et al., 2016;Ebrahimi et al., 2018) based on the model's internal structure or gradient signals. Even some black-box attack models are not purely "black-box", which require the prediction scores (classification probabilities) of the victim model (Jin et al., 2020;Ren et al., 2019;Alzantot et al., 2018). However, all these methods address random perturbation but do not consider linguistic phenomena to evaluate the real-life generalization of LU models.
While data augmentation can be an efficient method to address data sparsity, it can improve the generalization abilities and measure the model robustness as well (Eshghi et al., 2017). Paraphrasing that rewrites the utterances in dialog has been used to get diverse representation and thus enhancing robustness (Ray et al., 2018;Zhao et al., 2019;Iyyer et al., 2018). Word-level operations (Kolomiyets et al., 2011;Li and Qiu, 2020;Wei and Zou, 2019) including replacement, insertion, and deletion were also proposed to increase language variety. Other studies (Shah et al., 2019;Xu and Sarikaya, 2014) worked on the out-of-vocabulary problem when facing unseen user expression. Some other research 9 See appendix for case study. focused on building robust spoken language understanding (Zhu et al., 2018;Henderson et al., 2012;Huang and Chen, 2019) from audio signals beyond text transcripts. Simulating ASR errors (Schatzmann et al., 2007;Park et al., 2019;Wang et al., 2020a) and speaker disfluency (Wang et al., 2020b;Qader et al., 2018) can be promising solutions to enhance robustness to voice input when only textual data are provided. As most work tackles LU robustness from only one perspective, we present a comprehensive study to reveal three critical issues in this paper, and shed light on a thorough robustness evaluation of LU in dialog systems.

Conclusion and Discussion
In this paper, we present a systematic robustness evaluation of language understanding (LU) in taskoriented dialog from three aspects: language variety, speech characteristics, and noise perturbation. Accordingly, we develop four data augmentation methods to approximate these language phenomena. In-depth experiments and analysis are conducted on MultiWOZ and Frames, with both classification-and generation-based LU models. The performance drop of all models on augmented test data indicates that these robustness issues are challenging and critical, while pre-trained models are relatively more robust to LU. Ablation studies are carried out to show the effect and orthogonality of each augmentation approach. We also conduct a real user evaluation and verifies that our augmentation methods can reflect and help alleviate real robustness problems.
Existing and future dialog models can be evaluated in terms of robustness with our toolkit and data, as our augmentation model does not depend on any particular LU models. Moreover, our proposed robustness evaluation scheme is extensible. In addition to the four approaches in LAUG, more methods to evaluate LU robustness can be considered in the future. A Experimental Setup

A.1 Hyperparameters
As for hyperparameters in LAUG, we set the ratio of perturbation number to text length α = n/l = 0.1 in EDA . The learning rate used to finetune SC-GPT in TP is 1e-4, the number of training epoch is 5, and the beam size during inference is 5. In SR, the beam size of the language model in DeepSpeech2 is set to 50. The learning rate of Bi-LSTM+CRF in SD is 1e-3. The threshold of fuzzy matching in automatic value detection is set to 0.9 in TP and 0.7 in SR. For hyperparameters of base models. The learning rate is set to 1e-4 for BERT, 1e-5 for GPT2, and 1e-3 for MILU and CopyNet. The beam-size of GPT2 and CopyNet is 5 during the decoding step.

A.2 Real Data Collection
Among the 120 sampled DA combinations, each combination contains 1 to 3 DAs. Users can organize the DAs in any order provided that they describe DAs with the correct meaning so as to imitate diverse user expressions in real scenarios. Users are also asked to keep natural in both intonation and expression, and communication noise caused by users in speech and language is included during collection. The audios are recorded by users' PCs under their real environmental noises. We use the same settings of DeepSpeech2 in SR to recognize the collected audios. After automatic span detection (also the same as SR's) are applied, we conduct human check and annotation to ensure the quality of labels.  Table 14: Robustness on different schemes on Multi-WOZ. The coupled scheme predicts dialog acts with a joint tagging scheme; the decoupled scheme first detects domains and intents, then recognizes the slot tags.

B.1 Prediction Schemes
In this section, we study the influence of training/prediction schemes on LU robustness. As described in Sec. 4.3 of the main paper, the process of classification-based LU models is decoupled into two steps to handle multiple labels: one for domain/intent classification and the other for slot tagging. Another strategy is to use the cartesian product of all the components of dialog acts, which yields a joint tagging scheme as presented in Con-vLab . To give an intuitive illustration, the slot tag of the token "Los" becomes "Train-Inform-Depart-B" in the example described in Fig. 2 of the main paper. The classificationbased models can predict the dialog acts within a single step in this way. Table 14 shows that MILU and BERT gain from the decoupled scheme on the original test set. This indicates that the decoupled scheme decreases the model complexity by decomposing the output space. Interestingly, there is no consistency between two models in terms of robustness. MILU via the coupled scheme behaves more robustly than the decoupled counterpart (-2.61 vs. -7.05), while BERT with the decoupled scheme outperforms its coupled version in robustness (-6.45 vs. -8.61). Meanwhile, BERT benefits from the decoupled scheme and still achieves 86.95% accuracy, but BERT training with the coupled scheme seems more susceptible. In addition, both MILU and BERT recover more performance by the proposed decoupled scheme. All these results demonstrate the superiority of the decoupled scheme in classification-based LU models.

B.2 Case Study
In Table 15, we present some examples of augmented utterances in MultiWOZ. In terms of model performance, MILU, BERT and GPT-2 perform well on WP and TP in the example while Copy-Net misses some dialog acts. For the SR utterance, only BERT obtains all the correct labels. MILU and Copynet both fail to find the changed value spans "lester" and "thirteen forty five". Copynet's copy mechanism is fully confused by recognition error and even predicts discontinuous slot values. GPT-2 successfully finds the non-numerical time but misses "leseter". In the SD utterance, the repair term fools all the models. Overall, in this example, BERT performs quite well while MILU and Copy-Net expose some of their defects in robustness. Table 16 shows some examples from real user evaluation. In case-1, the user says "seventeen o'clock" while time is always represented in numeric formats (e.g. "17:00") in the dataset, which is a typical Speech Characteristics problem. Case-2 could be regarded as a Speech Characteristics or Noise Perturbation case because "please" is wrongly recognized as "police" by ASR models. Case-3 is an example of Language Variety, the user expresses the request of getting ticket price in a different way comparing to the dataset. MILU and BERT failed in most of these cases but fixed some error after augmented training.