Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical Analysis of System-wise Evaluation

There is a growing interest in developing goal-oriented dialog systems which serve users in accomplishing complex tasks through multi-turn conversations. Although many methods are devised to evaluate and improve the performance of individual dialog components, there is a lack of comprehensive empirical study on how different components contribute to the overall performance of a dialog system. In this paper, we perform a system-wise evaluation and present an empirical analysis on different types of dialog systems which are composed of different modules in different settings. Our results show that (1) a pipeline dialog system trained using fine-grained supervision signals at different component levels often obtains better performance than the systems that use joint or end-to-end models trained on coarse-grained labels, (2) component-wise, single-turn evaluation results are not always consistent with the overall performance of a dialog system, and (3) despite the discrepancy between simulators and human users, simulated evaluation is still a valid alternative to the costly human evaluation especially in the early stage of development.


Introduction
Many approaches and architectures have been proposed to develop goal-oriented dialog systems to help users accomplish various tasks (Gao et al., 2019a;Zhang et al., 2020b). Unlike open-domain dialog systems, which are designed to mimic human conversations rather than complete specific tasks and are often implemented as end-to-end systems, a goal-oriented dialog system has access to an external database on which to inquire about information to accomplish tasks for users. Goal-oriented dialog systems can be grouped into three classes based on their architectures, as illustrated in Fig. 1. Figure 1: Different architectures of goal-oriented dialog systems. It can be constructed as a pipeline or endto-end system with different granularity.
The first class is the pipeline (or modular) systems which typically consist of the four components: Natural Language Understanding (NLU) (Goo et al., 2018;Pentyala et al., 2019), Dialog State Tracker (DST) (Xie et al., 2015;Lee and Stent, 2016), Dialog Policy Takanobu et al., 2019), and Natural Language Generation (NLG) (Wen et al., 2015;Balakrishnan et al., 2019). The second class is the end-to-end (or unitary) systems (Williams et al., 2017;Dhingra et al., 2017;Liu et al., 2018;Lei et al., 2018;Mehri et al., 2019), which use a machine-learned neural model to generate a system response directly from a dialog history. The third one lies in between the above two types, where some systems use joint models that combine some (but not all) of the four dialog components. For example, a joint wordlevel DST model combines NLU and DST (Zhong et al., 2018;Wu et al., 2019;, and a joint word-level policy model combines dialog policy and NLG (Chen et al., 2019;Zhao et al., 2019;Budzianowski and Vulić, 2019).
It is particularly challenging to properly evaluate and compare the overall performance of goaloriented dialog systems due to the wide variety of system configurations and evaluation settings. Nu-merous approaches have been proposed to tackle different components in pipeline systems, whereas these modules are merely evaluated separately. Most studies only compare the proposed models with baselines of the same module, assuming that a set of good modules can always be assembled to build a good dialog system, but rarely evaluate the overall performance of a dialog system from the system perspective. A dialog system can be constructed via different combinations of these modules, but few studies investigated the overall performance of different combinations (Kim et al., 2019;. Although end-to-end systems are evaluated in a system-wise manner, none of such systems is compared with its pipeline counterpart. Furthermore, unlike the component-wise assessment, system-wise evaluation requires simulated users or human users to interact with the system to be evaluated via multi-turn conversations to complete tasks. To this end, we conduct both simulated and human evaluations on dialog systems with a wide variety of configurations and settings using a standardized dialog system platform, Convlab (Lee et al., 2019b), on the MultiWOZ corpus . Our work attempts to shed light on evaluating and comparing goal-oriented dialog systems by conducting a system-wise evaluation and a detailed empirical analysis. Specifically, we strive to answer the following research questions: (RQ1) Which configurations lead to better goaloriented dialog systems? ( §3.1); (RQ2) Whether the component-wise, single-turn metrics are consistent with system-wise, multi-turn metrics for evaluation? ( §3.2); (RQ3) How does the performance vary when a system is evaluated using tasks of different complexities, e.g., from single-domain to multi-domain tasks? ( §3.3); (RQ4) Does simulated evaluation correlate well with human evaluation? ( §3.4).
Our results show that (1) pipeline systems trained using fine-grained supervision signals at different component levels often achieve better overall performance than the joint models and end-to-end systems, (2) the results of component-wise, singleturn evaluation are not always consistent with that of system-wise, multi-turn evaluation, (3) as expected, the performance of dialog systems of all three types drops significantly with the increase of task complexity, and (4) despite the discrepancy between simulators and human users, simulated evaluation correlates moderately with human evaluation, indicating that simulated evaluation is still a valid alternative to the costly human evaluation, especially in the early stage of development.
2 Experimental Setting

Data
In order to conduct a system-wise evaluation and an in-depth empirical analysis of various dialog systems, we adopt the MultiWOZ  corpus in this paper. It is a multidomain, multi-intent task-oriented dialog corpus that contains 3,406 single-domain dialogs and 7,032 multi-domain dialogs, with 13.18 tokens per turn and 13.68 turns per dialog on average. The dialog states and system dialog acts are fully annotated. The corpus also provides the domain ontology that defines all the entities and attributes in the external databases. We also use the augmented annotation of user dialog acts from (Lee et al., 2019b).

User Goal
During evaluation, a dialog system interacts with a simulated or human user to accomplish a task according to a pre-defined user goal. A user goal is the description of the state that a user wants to reach in a conversation, containing indicated constraints (e.g., a restaurant serving Japanese food in the center of the city) and requested information (e.g., the address, phone number of a restaurant).
A user goal is initialized to launch the dialog session during evaluation. To ensure a fair comparison, we apply a fixed set of 1,000 user goals for both simulated and human evaluation. In the goal sampling process, we first obtain the frequency of each slot in the dataset and then sample a user goal from the slot distribution. We also apply additional rules to remove inappropriate combinations, e.g., a user cannot inform and inquire about the arrival time of a train in the same session. In the case where no matching database entry exists based on the sampled goal, we resample a new user goal until there is an entity in the database that satisfies the new constraints. In evaluation, the user first communicates with the system based on the initial constraints, and then can change the constraints if the system informs the user that the requested entity is not available. The detailed distribution of these goals is shown in Fig. 2. Among the 1,000 user goals, the numbers of goals involving 1/2/3 domains are 328/549/123, respectively.

Platform and Simulator
We use the open-source end-to-end dialog system platform, ConvLab (Lee et al., 2019b), as our experimental platform. ConvLab enables researchers to develop a dialog system using preferred architectures and supports system-wise simulated evaluation. It also provides an integration of crowdsourcing platforms such as Amazon Mechanical Turk for human evaluation.
To automatically evaluate a multi-turn dialog system, Convlab implements an agenda-based user simulator (Schatzmann et al., 2007). Given a user goal, the simulator's policy uses a stack-like structure with complex hand-crafted heuristics to inform its goal and mimics complex user behaviors during a conversation. Since the system interacts with the simulator in natural language, the user simulator directly takes system utterances as input and outputs a user response. The overall architecture of user simulator is presented in Fig. 3. It consists of three modules: NLU, policy, and NLG. We use the default configuration of the simulator in Convlab: a RNN-based model MILU (Multi-Intent Language Understanding, extended (Hakkani-Tür et al., 2016)) for NLU, a hand-crafted policy, and a retrieval model for NLG.

Evaluation Metrics
We use the number of dialog turns, averaging over all dialog sessions, to measure the efficiency of accomplishing a task. A user utterance and a subsequent system utterance are regarded as one dialog turn. The system should help each user accomplish his/her goal within 20 turns, otherwise the dialog is regarded as failure. We utilize two other metrics: inform F1 and match rate to estimate the task success. Both metrics are calculated based on the dialog act (Stolcke et al., 2000), an abstract representation that extracts the semantic information of an utterance. The dialog act from the input and output of the user simulator's policy will be used to calculate two scores, as shown in Fig. 3. Inform F1 evaluates whether all the information requests are fulfilled, and match rate assesses whether the offered entity meets all the constraints specified in a user goal. The dialog is marked as successful if and only if both inform recall and match rate are 1.

System Configurations
To investigate how much system-wise and component-wise evaluations differ, we compare a set of dialog systems that are assembled using different state-of-the-art modules and settings in our experiments. The full list of these systems are shown in Table 1, which includes 4 pipeline systems (SYSTEM-1∼4), 10 joint-model systems (SYSTEM-5∼13) and 2 end-to-end systems (SYSTEM-15∼16). Note that some systems (e.g. SYSTEM-4, SYSTEM-10) generate delexicalized responses where the slot values are replaced with their slot names. We convert these responses to natural language by filling the slot values based on dialog acts and/or database query results.
In what follows, we briefly introduce these modules and the corresponding models 1 used in our experiments. The component-wise evaluation results of these modules are shown in Table 2. For published works, we train all the models using the open-source code with the training, validation and test split offered in MultiWOZ, and replicate the performance reported in the original papers or on the leaderboard.
NLU A natural language understanding module identifies user intents and extracts associated information from users' raw utterances. We consider two approaches that can handle multi-intents as reference: a RNN-based model MILU which extends (Hakkani-Tür et al., 2016) and is fine-tuned on multiple domains, intents and slots; and a fine-tuned BERT model (Devlin et al., 2019). Following the joint tagging scheme (Zheng et al., 2017), the labels of intent detection and slot filling are annotated for domain classification during training. Both models use dialog history up to the last dialog turn as context. Note that there can be multiple intents or slots in one sentence, we calculate two F1 scores for intents and slots, respectively.
DST A dialog state tracker encodes the extracted information as a compact set of dialog state that contains a set of informable slots and their corresponding values (user constraints), and a set of requested slots 2 . We have implemented a rule-based DST to update the slot values in the dialog state based on the output of NLU. We then compare four word-level DST: a multi-domain classifier MDBT (Ramadan et al., 2018) which enumerates all possible candidate slots and values, SUMBT (Lee et al., 2019a) that uses a BERT encoder and a slotutterance matching architecture for classification, TRADE (Wu et al., 2019) that shares knowledge among domains to directly generate slot values, and COMER (Ren et al., 2019) which applies a hierarchical encoder-decoder model for state generation. We use two metrics for evaluation. The joint goal accuracy compares the predicted dialog states to the ground truth at each dialog turn, and the output is considered correct if and only if all the predicted values exactly match the ground truth. The slot accuracy individually compares each (domain, slot, value) triplet to its ground truth label.
Policy A dialog policy relies on the dialog state provided by DST to select a system action. We compare two dialog policies: a hand-crafted policy, and a reinforcement learning policy GDPL (Takanobu et al., 2019) that jointly learns a reward function. We also include in our comparison three joint models, known as word-level policies, which combine the policy and the NLG module to produce natural language responses from dialog states. They are MDRG (Wen et al., 2017) where an at-tention mechanism is conditioned on the dialog states, HDSA (Chen et al., 2019) that decodes response from predicted hierarchical dialog acts, and LaRL (Zhao et al., 2019) which uses a latent action framework. We use BLEU score (Papineni et al., 2002), inform rate and task success rate as metrics for evaluation. Note that the inform rate and task success for evaluating policies are computed at the turn level, while the ones used in system-wise evaluation are computed at the dialog level.
NLG A natural language generation module generates a natural language response from a dialog act representation. We experiment with two models: a retrieval-based model that samples a sentence randomly from the corpus using dialog acts, and a generation-based model SCLSTM (Wen et al., 2015) which appends a sentence planning cell in RNN. To evaluate the performance of NLG, we adopt BLEU score to evaluate the quality of the generated text, and slot error rate (SER) to measure whether the generated response contains missing or redundant slot values.
E2E An end-to-end model takes user utterances as input and directly output system responses in natural language. We experiment with two models: TSCP (Lei et al., 2018) that uses belief spans to represent dialog states, and DAMD (Zhang et al., 2020a) that further uses action spans to represent dialog acts as additional information. For singleturn evaluation, BLEU, inform rate and success rate are provided.

Performance under Different Settings (RQ1)
We compare the performance of three types of systems, pipeline, joint-model and end-to-end. Results in Table 1 show that pipeline systems often achieve better overall performance than the joint models and end-to-end systems because using fine-grained labels at the component level can help pipeline systems improve the task success rate.
NLU with DST or joint DST It is essential to predict dialog states to determine what a user has expressed and wants to inquire. The dialog state is used to query the database, predict the system dialog act, and generate a dialog response. Although many studies have focused on the wordlevel DST that directly predicts the state using the user query, we also investigate the cascaded configuration where an NLU model is followed by a rule-based DST. As shown in Table 1, the success rate has a sharp decline when using word-level DST, compared to using an NLU model followed by a rule-based DST (17.3%∼27.8% in SYSTEM-(5∼8) vs. 80.9% in SYSTEM-1). The main reason is that the dialog act predicted by NLU contains both slot-value pairs and user intents, whereas the dialog state predicted by the word-level DST only records the user constraints in the current turn, causing information loss for action selection (via dialog policy) as shown in Fig. 4. For example, a user may want to confirm the booking time of the restaurant, but such an intent cannot be represented in the slot values. However, we can observe that word-level DST achieves better overall performance by combining with word-level policy, e.g., 40.4% success rate in SYSTEM-13 vs. 27.8% in SYSTEM-6. This is because word-level policy implicitly detects user intents by encoding the user utterance as additional input, as presented in Fig.  5. Neverthsless, all those joint approaches still under-perform traditional pipeline systems.
NLG from dialog act or state We compare two strategies for generating responses. One is based on an ordinary NLG module that generates a response according to dialog act predicted by dialog policy. The other uses the word-level policy to di- rectly generates a natural language response based on dialog state and user query. As we can see in Table 1 that the performance drops substantially when we replace the retrieval NLG module with a joint model such as MDRG or HDSA. This indicates that the dialog act has encoded sufficient semantic information so that a simple retrieval NLG module can give high-quality replies. However, the fact, that SYSTEM-11 which uses word-level policy LaRL even outperforms SYSTEM-4 which uses the NLG model SCLSTM in task success (47.7% vs. 43.0%), indicates that response generation can be improved by jointly training policy and NLG modules.
Database query As part of dialog management, it is crucial to identify the correct entity that satisfies the user goal. MultiWOZ contains a large number of entities across multiple domains, making it impossible to explicitly learn the representations of all the entities in the database as previous work did (Dhingra et al., 2017;Madotto et al., 2018). This requires the designed system to deal with a large-scale external database, which is closer to reality. It can be seen in Table 1 that most joint models have a lower match rate than the pipeline systems. In particular, SYSTEM-15 rarely selects an appropriate entity during the dialog (13.68% match rate) since the proposed belief spans only copy the values from utterances without knowing which domain or slot type the values belong to. Due to the poor performance in dialog state prediction, it cannot consider the external database selectively, thereby failing to satisfy the user's constraints. In comparison, SYSTEM-16 has achieved the highest success rate (48.5%) and the secondhighest match rate (59.67%) among all the systems using joint models (SYSTEM-5∼14). This is because DAMD utilizes action spans to predict both user and system dialog acts in addition to belief spans, which behaves like a pipeline system. This indicates that an explicit dialog act supervision can improve dialog state tracking.

Component-wise vs. System-wise Evaluation (RQ2)
It is important to verify whether the componentwise evaluation is consistent with system-wise evaluation. By comparing the results in Table 1 and Table 2, we can observe that sometimes they are consistent (e.g., BERT > MILU in Table 2a, and SYSTEM-1 > SYSTEM-2), but not always (e.g., TRADE > SUMBT in Table 2b, but SYSTEM-6 > SYSTEM-7).
In general, a better NLU model leads to a better multi-turn conversation, and SYSTEM-1 outperforms all other configurations in completing user goals. With respect to DST, though word-  level DST models directly predict dialog states without explicitly detecting user intents, most of them perform poorly in terms of joint accuracy as shown in Table 2b. This severely harms the overall performance because the downstream tasks strongly rely on the predicted dialog states. Interestingly, TRADE has higher accuracy than SUMBT on DST. But TRADE performs worse than SUMBT in system-wise evaluation (22.4% in SYSTEM-7 vs. 27.8% in SYSTEM-6). The observation is similar to COMER vs. TRADE. This indicates that the results of component-wise evaluation in DST are not consistent with those of system-wise evaluation, which may be attributed to the noisy dialog state annotations (Eric et al., 2019). As for word-level policy, HDSA that uses explicit dialog acts in supervision has higher BLEU than LaRL that uses latent dialog acts, but LaRL that is finetuned with reinforcement learning has much higher match rate than HDSA in system-wise evaluation ( Table 2c), the gap is increased (19.2% in SYSTEM-9 vs. 34.3% in SYSTEM-10) in system-wise evaluation. In addition, even SCLSTM achieves a higher BLEU score than the retrieval-based model (51.6% vs. 33.1% in Table 2d), it only obtains a lower success rate (43.0% in SYSTEM-4 vs. 80.9% in SYSTEM-1) when assembled with other modules. These results show again the discrepancy between component-wise and system-wise evaluation. The superiority of the systems using retrieval models may imply that lower SER in NLG is more critical than higher BLEU in goal-oriented dialog systems.
Error in multi-turn interactions Most existing work only evaluates the model with singleturn interactions. For instance, inform rate and task success at each dialog turn are computed given the current user utterance, dialog state and database query results for context-to-context generation Budzianowski and Vulić, 2019). A strong assumption is that the model would be fed with the ground truth from the upstream modules or the last dialog turn. However, this assumption does not hold since a goal-oriented dialog consists of a sequence of associated inquiries and responses between the system and its user, and the system may produce erroneous output at any time.
The errors may propagate to the downstream mod-ules and affect the following turns. For instance, end-to-end models get worse success rate in multiturn interactions than in single-turn evaluation in Table 2e. A sample dialog from SYSTEM-1 and SYSTEM-6 is provided in Table 6. SYSTEM-6 does not extract the pricerange slot (highlighted in red color) correctly. The incorrect dialog state further harms the performance of dialog policy, and the conversation gets stuck where the user (simulator) is always asking for the postcode, thereby failing to complete the task. To summarize, the component-wise, single-turn evaluation results do not reflect the real performance of the system well, and it is essential to evaluate a dialog system in an end-to-end, interactive setting.

Performance of Task with Different Complexities (RQ3)
With the increasing demands to address various situations in multi-domain dialog, we choose 9 representative systems across different configurations and approaches to further investigate how their performance varies with the complexities of the tasks. 100 user goals are randomly sampled under each domain setting. Results in Table 3 and 4 show that the overall performance of all systems varies with different task domains and drops significantly with the increase of task complexity, while pipeline systems are relatively robust to task complexity. Table 3 shows the performance with respect to different single domains. Restaurant is a common domain where users inquiry some information about a restaurant and make reservations. Train has more entities and its domain constraints can be more complex, e.g., the preferred train should arrive before 5 p.m. Attraction is an easier one where users do not make reservations. There are 7/6/3 informable slots that need to be tracked in Restaurant/Train/Attraction respectively. Surprisingly, most systems perform better in Restaurant or Train than Attraction. This may result from the noise database in Attraction where pricerange information is missing sometimes, and from the uneven data distribution where Restaurant and Train appear more frequently in the training set. In general, pipeline systems perform more stably across multiple domains than joint models and end-to-end systems.

Performance with different single domains
Performance with different number of domains Table 4 demonstrates how the performance varies with the number of domains in a task. We can observe that most systems fall short to deal with multi-domain tasks. Though some systems such as SYSTEM-13 and SYSTEM-16 can achieve a relatively high inform F1 or match rate for a single domain, the overall success rate drops substantially on two-domain tasks, and most systems fail to complete three-domain tasks. The number of dialog turns also increases remarkably when the number of domains increases. Among all these configurations, only the pipeline systems SYSTEM-2 and SYSTEM-1 can keep a high success rate when there are three domains in a task. These results show that current dialog systems are still insufficient to deal with complex tasks, and that pipeline systems outperform joint models and end-to-end systems.

Simulated vs. Human Evaluation (RQ4)
Since the ultimate goal of a task-oriented dialog system is to help users accomplish real-world tasks, it is essential to justify the correlation between simulated and human evaluation. For human evaluation, 100 Amazon Mechanical Turk workers are hired to interact with each system and then give their judgement on task success.  Table 5: System-wise evaluation with human users. Correlation coefficient between simulated and human evaluation is presented in the last column.
the same time, and each worker gives a score on these two metrics with a five-point scale. We compare 5 systems that achieve the best performance in the simulated evaluation under different settings. Table 5 shows the human evaluation results of 5 dialog systems. Comparing with the simulated evaluation in Table 1, we can see that Pearson's correlation coefficient lies around 0.5 to 0.6 for most systems, indicating that simulated evaluation correlates moderately well with human evaluation. Similar to simulated evaluation, the pipeline system SYSTEM-1 obtains the highest task success rate in human evaluation. A sample human-machine dialog from SYSTEM-1 and SYSTEM-6 is provided in Table 7. The result is similar to the simulated session in Table 6 but SYSTEM-6 fails to respond with the phone number in Table 7 instead (highlighted in red color). All these imply the reliability of the simulated evaluation in goal-oriented dialog systems, showing that simulated evaluation can be a valid alternative to the costly human evaluation for system developers.
However, compared to simulated evaluation, we can observe that humans converse more naturally than the simulator, e.g., the user confirms with SYSTEM-1 whether it has booked 7 seats in Table 7, and most systems have worse performance in human evaluation. This indicates that there is still a gap between simulated and human evaluation. This is due to the discrepancy between the corpus and human conversations. The dataset only contains limited human dialog data, on which the user simulator is built. Both the system and the simulator are hence limited by the training corpus. As a result, the task success rate of most systems decreases significantly in human evaluation, e.g., from 40.4% to 14% in SYSTEM-13. This indicates that existing dialog systems are vulnerable to the variation of human language (e.g., the sentence highlighted in brown in Table 7), which demonstrates a lack of ro-bustness in dealing with real human conversations.

Related Work
Developers have been facing many problems when evaluating a goal-oriented dialog system. A range of well-defined automatic metrics have been designed for different components in the system, e.g., joint goal accuracy in DST and task success rate in policy optimization introduced in Table 2b and 2c. A broadly accepted evaluation scheme for the goaloriented dialog was first proposed by PARADISE (Walker et al., 1997). It estimates the user satisfaction by measuring two types of aspects, namely dialog cost and task success. Paek (2001) suggests that a useful dialog metric should provide an estimate of how well the goal is met and allow for a comparative judgement of different systems. Though a model can be optimized against these metrics via supervised learning, each component is trained or evaluated separately, thus difficult to reflect real user satisfaction.
As human evaluation by asking crowd-sourcing workers to interact with a dialog system is much expensive (Ultes et al., 2013;Su et al., 2016) and prone to be affected by subjective factors (Higashinaka et al., 2010;Schmitt and Ultes, 2015), researchers have tried to realize automatic evaluation of dialog systems. Simulated evaluation (Araki and Doshita, 1996;Eckert et al., 1997) is widely used in recent works (Williams et al., 2017;Takanobu et al., 2019 and platforms Lee et al., 2019b;Papangelis et al., 2020;, where the system interacts with a user simulator which mimics human behaviors. Such evaluation can be conducted at the dialog act or natural language level. The advantages of using simulated evaluation are that it can support multi-turn language interaction in a full end-to-end fashion and generate dialogs unseen in the original corpus.

Conclusion and Discussion
In this paper, we have presented the system-wise evaluation result and empirical analysis to estimate the practicality of goal-oriented dialog systems with a number of configurations and approaches. Though our experiments are only conducted on MultiWOZ, we believe that such results can be generalized to all goal-oriented scenarios in dialog systems. We have the following observations: 1) We find that rule-based pipeline systems generally outperform state-of-the-art joint systems and end-to-end systems, in terms of both overall performance and robustness to task complexity. The main reason is that fine-grained supervision on dialog acts would remarkably help the system plan and make decisions, because the system should predict the user intent and take proper actions during the conversation. This supports that good pragmatic parsing (e.g. dialog acts) is essential to build a dialog system.
2) Results show that component-wise, singleturn evaluation results are not always consistent with the overall performance of dialog systems. In order to accurately assess the effectiveness of each module, system-wise, multi-turn evaluation should be used from the practical perspective. We advocate assembling the proposed model of a specific module into a complete system, and evaluating the system with simulated or human users via a standardized dialog platform, such as Rasa (Bocklisch et al., 2017) or ConvLab. Undoubtedly, this will realize a full assessment of the module's contribution to the overall performance, and facilitate fair comparison with other approaches.
3) Simulated evaluation can have a good assessment of goal-oriented dialog systems and show a moderate correlation with human evaluation, but it remarkably overestimates the system performance in human interactions. Thus, there is a need to devise better user simulators that resemble humans more closely. A simulator should be able to generate a natural and diverse response, and may change goals in complex dialog, etc. In addition, the simulator itself may make mistakes which derive the wrong estimation of the performance. However even with human evaluation a dialog system needs to deal with more complicated and uncertain situations. Therefore, it is vital to enhance the robustness of the dialog systems. Despite the discrepancy between simulators and human users, simulated evaluation is still a valid alternative to the costly human evaluation especially in the early stage of development.