Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with User Simulator

Task-Oriented Dialogue (TOD) systems are drawing more and more attention in recent studies. Current methods focus on constructing pre-trained models or fine-tuning strategies while the evaluation of TOD is limited by a policy mismatch problem. That is, during evaluation, the user utterances are from the annotated dataset while these utterances should interact with previous responses which can have many alternatives besides annotated texts. Therefore, in this work, we propose an interactive evaluation framework for TOD. We first build a goal-oriented user simulator based on pre-trained models and then use the user simulator to interact with the dialogue system to generate dialogues. Besides, we introduce a sentence-level and a session-level score to measure the sentence fluency and session coherence in the interactive evaluation. Experimental results show that RL-based TOD systems trained by our proposed user simulator can achieve nearly 98% inform and success rates in the interactive evaluation of MultiWOZ dataset and the proposed scores measure the response quality besides the inform and success rates. We are hoping that our work will encourage simulator-based interactive evaluations in the TOD task.


Introduction
Building intelligent dialogue systems has become a trend in natural language process applications especially with the help of powerful pre-trained models.Specifically, task-oriented dialogue (TOD) systems (Zhang et al., 2020b) are to help users with scenarios such as booking hotels or flights.These TOD systems (Wen et al., 2017;Zhong et al., 2018;Chen et al., 2019) usually first recognize user's intents and then generate corresponding responses based on an external database containing booking information.Therefore, the key factor in TOD is the interaction between users and dialogue systems.
However, in current TOD system evaluation process, traditional evaluation process uses annotated user utterances in multi-turn dialogue sessions no matter what responses the dialogue system generated, as illustrated in Figure 1.While in realworld dialogues, the user utterances are coherent with responses from the other speaker (which is the service provider).Therefore, in TOD evaluation, using annotated utterances without interaction with the dialogue system will cause a policy mismatch, which would weaken the soundness of the evaluation results.The mismatch might hurt the evaluation process since some responses may be correct and coherent but use a different policy with the annotated responses.Also, incoherent dialogue histories will affect the response generation.With current state-of-the-art models achieving similar performance, it is natural to consider that the bottleneck in the performance of current TOD systems is not the model capability but the evaluation strategy.Since incorporating human interactions during evaluation is costly, a feasible method is to build an automatic interactive evaluation framework that can solve the policy mismatch problem.
In this paper, we propose a complete interactive evaluation framework to evaluate the TOD system.We first build a strong dialogue user simulator based on pre-trained models, and we use the proposed simulator to deploy interactive evaluations.
In simulator learning, we introduce a goalguided user utterance generation model based on sequence-to-sequence pre-trained models.Then we use reinforcement learning to train both the user simulator and the dialogue system to boost interaction performance.
In interactive evaluations, we use the simulator to generate user utterances based on interactions with the generated responses instead of using static user utterances that have been annotated in advance.Therefore, during evaluation, user utterances will respond to the generated responses, which can avoid the mismatch between stale user utterances and generated responses.Further, in interactive evaluations, the quality of the generated texts cannot be measured since traditional BLEU cannot be calculated without oracle texts.To better evaluate the performance of dialogue systems, we introduce two automatic scores to evaluate the response quality at both sentence-level and sessionlevel.The sentence-level score is to evaluate the sentence fluency and the session-level score is to evaluate the coherence between turns in a dialogue session.Also, these proposed scores can be used in traditional evaluation methods as well as the annotated dataset as a meta-evaluation to explore the importance of using user simulators to construct interactive evaluations.
We construct experiments on MultiWOZ dataset (Budzianowski et al., 2018) based on pre-trained models and use our proposed simulator and scores to run interactive evaluations.Experimental results show that interactive evaluations can achieve over 98% inform and success rates, indicating that the bottleneck of TOD performance is the lack of proper evaluation methods.The proposed scores show that our proposed simulator can help achieve promising evaluation results in the interactive evaluation framework.Also, we explore the performance of RL-based models and we also use proposed scores to find that RL methods might hurt response quality to achieve high success rates.
Therefore, we can summarize our contributions: (A) We construct an evaluation framework that avoids policy mismatch problems in TOD.
(B) We build a strong user simulator for TOD systems that can be used in TOD training and evaluation.
(C) Experimental results show the importance of using our proposed simulator and evaluation framework and provide hints for future TOD system developments with public available codes.

Task-Oriented Dialogue Systems
Task-oriented dialogue systems aim to achieve users' goals such as booking hotels or flights (Wen et al., 2017;Eric et al., 2017).With the widespread use of pre-trained models (Qiu et al., 2020), endto-end TOD systems based on pre-trained models become more and more popular: Hosseini-Asl et al. ( 2020) fine-tunes all subtasks of TOD using multitask learning based on a single pre-trained model.Yang et al. (2021) encodes results of intermediate subtasks, such as belief states and system actions, into dialogue history to boost responses generation.Su et al. (2021) and He et al. (2021) use additional dialogue corpus to further pre-train the language model and then fine-tune the model on MultiWOZ dataset.Lee (2021) introduces an auxiliary task based on T5 models (Raffel et al., 2020) and achieves state-of-the-art performance without using further pre-training methods.

Automatic Evaluations
Recent trends leverage neural models to automatically evaluate generated texts from different perspectives.Automatic evaluation methods can help evaluate certain aspects in certain tasks such as factuality checking in text summarization (Kryscinski et al., 2020), stronger BLEU score in machine translation (Sellam et al., 2020) and coherence in dialogue systems (Tao et al., 2018;Pang et al., 2020).With pre-trained models, the quality of text generation can be measured by evaluation methods such as BERTScore (Zhang et al., 2020a) and BARTScore (Yuan et al., 2021).With properly designed neural model scores, the performance of dialogue systems can be more accurately evaluated.

User Simulators
User simulators are designed to simulate users' behaviors in dialogue interactions, including rulebased simulators (Lee et al., 2019) and modelbased simulators (Takanobu et al., 2020;Tseng et al., 2021).Usually, user simulators are introduced along with reinforcement learning strategies to enhance the dialogue policy modeling (Li et al., 2016;Shi et al., 2019), which can help the model learn better policies not included in the annotated data.Takanobu et al. (2020) treats the model-based simulator as a dialogue agent like the dialogue system and formulate TOD as a multi-agent policy learning problem.Tseng et al. (2021) focuses on using reinforcement learning to jointly train the simulator and the dialogue system to boost the domain adaption capability of the model.

Interactive Evaluation Framework
In our proposed interactive evaluation framework, we first build a goal-state guided user simulator to model user policies and generate high-quality user utterances.Then we construct the interactive evaluation framework and introduce two scores to evaluate the interactive inference results.

User Simulator Construction
A user simulator is to generate user utterances for interactions with the dialogue system.Similar to the dialogue system construction, the user simulator also considers dialogue histories and generates utterances via a sequence-to-sequence text generation framework.We propose a goal-state guided simulator that controls the user utterance generation based on the goal-state tracking.Further, we adopt reinforcement learning methods to boost the interaction performance between our proposed goal-state guided simulator and dialogue systems.

Goal-State Guided Simulator
We introduce a goal-state guided simulator that generates user utterances based on sequence-tosequence pre-trained models.The basic idea is to use pre-defined user's goals as initial goal states and track goal states based on user and system actions, which is similar with belief state tracking.As seen in Figure 2, we illustrate the interaction process of the user simulator and the dialogue system.We first add current goal states at the front of the user simulator inputs.Plus, the user simulator will encode previous dialogue histories including user utterances and dialogue system responses.The user simulator will predict the user actions and then obtain finished goals by combining both user actions and dialogue system actions.By cutting off finished goals from the current goal states, we obtain unfinished goals and the user simulator will generate user utterances based on these unfinished goals at next turn.When the user simulator has finished User Simulator: I am looking for an Indian restaurant in the expensive price range.
System: There are 5 Indian restaurants in the city.Is there a particular area you would like to dine in?
User Simulator: No, I don't care.I would like to book a table for 6 people at 13:15 on friday.
System: I have booked you a table at Nusha.your reference number is 021... Is there anything else i can help you with? all the required goals, the unfinished goal slot is empty, the user simulator will cease to generate utterances.Besides, we add two additional terminate signals for the user simulator.When the dialogue session exceeds a certain number of turns and the goal states still cannot be fully finished or when the user simulator or the dialogue system generates definite actions to stop the session like action 'bye' or action 'thank', the user simulator will terminate the dialogue session.

Simulator Training
The training process of the user simulator includes sequence-to-sequence supervised learning and reinforcement learning.
In supervised learning, the user simulator encodes the goal states at the front of the input texts and considers all dialogue histories including texts of both user utterances and system responses.The generation texts include current user actions and user utterances.Therefore, the entire training process is a standard sequence-to-sequence generation task training process optimized by cross entropy loss.
In addition, we incorporate a reinforcement learning strategy to further boost the interaction performance between user simulators and dialogue systems.It is intuitive to use the inform and success rates as rewards to increase the interaction policies between simulators and dialogue systems.That is, supervised learning methods can only learn policies annotated in the training dataset, while reinforcement learning methods can bring new interaction policies for both simulators and dialogue systems.
Therefore, following policy gradient algorithm (Sutton et al., 1999), we use the success rate to construct rewards to optimize both user simulator θ and dialogue system ϕ.For each turn t, we consider the process of generating each token as an action.Therefore, we calculate turn-level RL gradients J based on each generated token i in user simulator and j in dialogue system: Here, γ is the discounting factor and |A t | is the policy sequence length of turn t, π(•) is the forward strategy of corresponding simulator θ and dialogue system ϕ.The reward R t is 1 when the generated session from simulator and dialogue system interaction successfully achieves user's goals and 0 otherwise.

Evaluation Framework Construction
Current TOD evaluation strategies will face a policy mismatch between users and dialogue systems.Therefore it is necessary to construct a complete evaluation framework based on simulator interactions to avoid the policy mismatch when evaluating TOD systems.Based on the proposed simulator, we evaluate TOD systems interactively.Further, we propose sentence-level and session-level scores in the interactive evaluation process to measure the quality of generated responses.Plus, we apply the introduced scores for RL training to further explore the dialogue system performance.

Traditional Evaluation
Traditional evaluation process to test TOD systems is to use inform and success rates to measure whether the responses achieve user's goals in the given session.During the evaluation process, the responses are generated from the dialogue systems but the user utterances are from the annotated dataset regardless of the interactions between responses and the following user utterances.Therefore, the whole session can be unnatural due to the policy mismatch problem.Further, BLEU score is used to measure the quality of the generated responses compared with the annotated oracle responses.The annotated responses may be entirely different from the generated responses when there is a policy mismatch.For instance, in the MultiWOZ dataset, responses generated by MT-TOD (Lee, 2021) model achieve an average BLEU score of 19.47, yet 4686 of 7372 responses obtain BLEU scores below 1.Such a phenomenon that the responses can achieve either high or nearly zero BLEU scores indicates that the evaluation process might suffer from policy mismatch problems.Therefore, we believe that the bottleneck of current TOD systems is the stale evaluation process with policy mismatches.

Interactive Evaluation
Interactive evaluation is used to evaluate the dialogue responses based on interactions between users and systems.Therefore, the user utterances can interact with the dialogue responses, which can avoid evaluation errors when the responses are reasonable but do not match the following utterances.During interactive evaluation, we suppose that the user simulator can produce high-quality user utterances based on the pre-defined goals the user aimed to achieve.The evaluation metrics inform and success rates are the same with traditional evaluation.

Sentence-Level Score
During the interactive evaluation, the quality of the generated response cannot be measured by BLEU score.To properly evaluate the quality of the responses generated during interactions with the user simulator, we use an automatic score to evaluate the quality of the generated responses.
Specifically, similar to text perplexity and BARTScore evaluation (Yuan et al., 2021), we measure the sentence fluency by using the score calculated as: Here, y i denotes the i th token in the generated response; θ represents the language model, which is a fine-tuned GPT-2 (Radford et al., 2019) in our experiments; L is the sequence length.
With such a score, we can properly measure the response quality when the BLEU score is no longer applicable as the metric for dialogue response quality evaluation.

Session-Level Score
Besides sentence-level response quality, it is also important to evaluate the coherence between the dialogue system and the user simulator.Therefore, we propose a session-level score to explore the coherence between user utterances and dialogue responses.
Inspired by the sentence-relation tasks such as natural language inference task (MacCartney and Manning, 2008) and next sentence prediction task (Devlin et al., 2019), we construct a binary classification model to predict whether the interactions between utterances and responses are coherent and fluent.Naturally, the user utterance and the response pair are usually coherent since the dialogue system is trained to answer the user's queries.The response might not be coherent with following-up user utterances in the traditional evaluation process.Therefore, we consider a continuous session including one user utterance denoted as u t and its corresponding response r t , plus the next user utterance u t+1 .We suppose that dialogue turns such as [u t , r t ] and [r t , u t+1 ] are coherent in nature.We select a random response r * to construct negative samples to train the binary classifier to score the session-level coherence to evaluate the interaction between user utterances and system responses.After building the binary classifier, we use the average score (softmaxed confidence) of all [u t , r t ] and [r t , u t+1 ] pairs in a session as the session-score.
The session-level score can be used as an evaluator that measures the entire session fluency, that is, during the interaction process, the session-level scorer evaluates the overall quality of both user utterances and dialogue responses.Therefore, such a score can be used in evaluating both user simulator performance and dialogue system performance.

Datasets
We use MultiWOZ 2.0 dataset (Budzianowski et al., 2018) to construct all our experiments.MultiWOZ 2.0 contains 10,438 sessions with 115,434 turns in 7 different domains.In MultiWOZ dataset, the initial goal state information is given, which is the information originally provided for annotators to craft the MultiWOZ dataset.Therefore, these goal states are suitable for the simulator training.

Dialogue System Implementations
In the interactive evaluation process, we use several state-of-the-art TOD systems including UBAR (Yang et al., 2021), PPTOD (Su et al., 2021) and MTTOD (Lee, 2021).For a fair comparison, we re-implement all these models and adapt the same data processing and evaluation scripts.For the MT-TOD model, we use a T5-small as the backbone and remove its auxiliary task and additional decoder for the simplicity of the experiment, which causes no performance degeneration.For the PP-TOD model, we also use its small version.Our code is based on huggingface transformers and the MTTOD implementations.
For the RL-based model, we initialize model parameters from a supervised learning based model and use a Monte Carlo based policy gradient method which is time consuming.Therefore, in each RL epoch, we random pick 200 user's goals from training set (corresponding to 200 sessions from training set) to do 400 episodes for reinforcement learning.And during training, we first fix one agent and update the other one for 200 episodes and then vice versa.

Simulator Implementations
We implement the simulator based on T5 (Raffel et al., 2020), specifically the small and base version.The supervised training process of the simulator is similar to dialogue system training using similar hyper-parameters.For the reinforcement learning process, we use different random seeds to construct multiple runs.

Score Implementations
For the sentence-level score, we use a GPT-2 model and fine-tune the model using the Multi-WOZ dataset for the special token learning.For the session-level score, we use a BERT-base model as the binary classification model and use the average softmax logits.

Inform and Success Rates
As seen in Table 2, on the same model (such as MTTOD), the success rate of using traditional evaluation is much lower compared with using interactive evaluation (85.40% compared to 91.00%).We can observe that when the dialogue system and the user simulator are both optimized by RL algorithm, the inform and success rates can reach nearly 98%, indicating that with our proposed user simulator, the user's goals can be well accomplished by the dialogue system.Therefore, with pre-trained models, multi-task learning and reinforcement learning using a strong user simulator, is MultiWOZ a solved dataset already?We might begin to consider such a possibility.The incredibly high inform and success rates using interactive evaluation compared with traditional evaluation results indicate that the bottleneck that constrains previous evaluation performance might not be the deficiency of the dialogue system but the policy mismatch problem in traditional evaluation methods.

Different Models
We use both traditional and interactive evaluation methods to compare the performance of several state-of-the-art models.As seen in Table 2, the MTTOD model achieves the highest inform and success rates in the traditional evaluation while the UBAR model achieves the highest in the interactive evaluation, indicating that the improvements between these state-of-the-art models are not large enough to create significant difference.Meanwhile, results of the session-score show that the generated dialogues in traditional evaluations are incoherent, which is caused by the policy mismatch problem.The generated dialogues in interactive evaluations are more coherent.We can observe that the results of interactive evaluations are better than the results of traditional evaluations on all models.This result shows that policy mismatch problems in traditional evaluations do limit model's performance.

Sentence-Score Evaluation
In the interactive evaluation framework, the fluency of the generated responses cannot be properly measured since traditional BLEU score is not available.With the sentence-score, we can measure the sentence fluency without reference texts.As seen in Table 2, responses generated in the interactive evaluation obtain the best performance.Compared with the oracle test set, sentence-scores of the interactive evaluation are even better.This is because the generated utterances from models trained by maximum likelihood estimation prefer to use the patterns that appear more frequently in the training set, therefore the cross entropy loss of these utterances will be lower.However, in the oracle test set, utterances are more complicated and diversity caused by different annotators, which may lead to higher cross entropy loss then generated utterances.Therefore, sentence-scores of generated utterances are better than oracle utterances from test set.Besides, we can observe that RL-based models obtain worse sentence-scores compared with supervised learning based models (0.87 compared to 0.79).RL-based models achieving promising inform and success rates yet worse sentence-scores indicates that using reinforcement learning methods to optimize inform and success rates will cause the degeneration of sentence quality.Through such an observation of the sentence-score in our interactive evaluation framework, we can conclude that RL-based models might hurt the quality of the generated utterances.

Session-Score Evaluation
It is also important to consider session-level interaction quality in the interactive evaluation framework.Therefore, we construct the session-score to measure the coherence between user utterances and system responses.As seen, traditional evaluations achieve relatively low session-scores (about 83%), indicating that the user utterances and the system responses in a single session are incoherent.As for the interactive evaluation process, session-scores are considerable high, indicating that sessions generated by interactions are coherent.Also, we can use the session-score as a meta evaluation to measure the test set sessions.As seen, test set session-  score achieves a significantly higher score compared with the sessions generated in the traditional evaluation process.As for the comparison between sessions in the interactive evaluation and the test set, we can observe that the sessions from interactions achieve higher session-scores (97.1% compared to 93.1%).We assume that this is because the annotated test set contains more variance in session construction, which cannot be completely understood by neural-based scores.Through sessionscore evaluations, we can conclude that when using the interactive evaluation to test the TOD system, the dialogue session is more natural compared with using annotated user utterances and generated responses in the traditional evaluation process.Besides, we can observe that using reinforcement learning will slightly hurt the session coherence(93.8%compared to 97.1%).

RL Based Model Performance
We run 5 times for each RL-based model with different random seeds and show the variance in Appendix, since RL methods are unstable.

RL-Method Results
As seen in Table 2, with RL training, the average inform and success rates achieve a significant improvement while the sentence-score and the session-score grow lower.These results indicate that though the inform and success rates get better, the text quality might be worse.While RL-based methods are widely used in joint training of dialogue systems and user simulators at present, we raise concerns about using the success rate as the reward function will hurt the quality of generated utterances considering our experiment results.
Score Results as Rewards Further, since we introduce new scores to evaluate the dialogue system, it is intuitive to utilize these scores as rewards to improve the RL training of the dialogue system.Therefore, in Table 3, we conduct experiments to explore whether using these scores as rewards is helpful.We design two new reward functions: RL-Sent and RL-Sess.
When α is 0.1 and β is 0, the reward setting is RL-Sent;when α is 0 and β is 0.1, the reward setting is RL-Sess. 2s seen, adding sentence-score or session-score to the rewards to train the dialogue system can obtain better results on the corresponding score.Still, using these rewards in RL causes more instability that some seeds may achieve promising results given these scores as rewards but some seeds may reach frustrating performance.We conclude that using these scores as rewards to improve the quality of the generated responses is a huge challenge that requires more careful design of RL-algorithm.

RL-Enhanced Responses
The responses generated by the RL-based models can be sometimes stale in format and focus on key values that might help improve inform and success rates.As seen in the second group of Table 4, the system responses include inarticulate phrases such as cote is cb21uf (cote is the name of the restaurant and cb21uf is its postcode) since the key values postcode and name are important for calculating inform and success metrics.The sentence-score also gives a worse score for this utterance.Such a phenomenon indicates that incorporating RL algorithm in training dialogue systems and simulators requires further attention to maintaining high response quality besides inform and success rates.

Session-Level Coherence
As seen in Table 4, the session-score predicts a lower score when evaluating the session from the traditional evaluation.The utterance I want to arrive by 12:45.cannot match the dialogue response querying the departure place.Such a result indicates that sessions in the traditional evaluation process can be unnatural which may constrain further improvements of TOD system developments.Also, high session-scores of sessions generated in the interactive evaluation process indicate that such an evaluation process is more natural therefore can be a more appropriate evaluation standard for TOD systems.

Diversity in the Testset
Further, since we observe that the testset sessions achieve rather poor performance in the sentencescore results, we assume that the diversity in the testset brings difficulties for the sentence-score and session-score.Therefore, we explore cases in the testset and we find that as seen in Table 4, humanwrote system responses contain a large proportion of diversified responses such as be on the look out for.These patterns are common in the testset sessions and we do not cherry pick a bad sentencescore system response case.Therefore, we can summarize that our proposed sentence-score can make fair comparisons between model generated responses while neural model based scores can still be improved.

Conclusion
In this paper, we focus on the evaluation of current end-to-end TOD systems.We construct an automatic interactive evaluation framework with a strong user simulator.Besides, we obtain extremely high interactive evaluation performance on Multi-WOZ dataset by jointly training our user simulator and dialogue system.Through the interactive evaluation framework, we can conclude several hints for future studies on TOD systems: (1) current TOD needs more challenging and complicated dataset and scenarios; (2) interactive evaluation process should be considered in proper evaluation of TOD systems; (3) reinforcement learning used in training user simulators and dialogue systems requires more careful design to consider both task success rate and quality of generated texts.

Limitations
Based on our experiments and conclusions, we conclude some limitations of our work as follows: • One limitation of our work is that, given the current datasets used in TOD, we can only conclude that the datasets are well-solved under the interactive evaluation setting without exploring more challenging dataset settings that have more challenging scenarios and evaluation perspectives.
• In training our proposed sentence-score and session-score, these neural model-based scores are trained by maximum likelihood estimation(MLE), which intend to give better scores to patterns that appear more frequently in training data.Such a limitation is a challenge in current neural network based automatic scores which can be further explored in not only the dialogue area.
• For the design of the reward function, our method using a weighted sum of success rates, sentence-scores and session-scores is a straightforward method to maintain both high task success rate and generation quality.The design of RL-based methods can be further improved given multiple rewards from different perspectives in order to train a better dialogue system, which could be a potential future work.

Figure 1 :
Figure 1: Illustration of interactions between users and systems.Traditional evaluation might face a policy mismatch between utterances annotated with red color.

FINISHEDFigure 2 :
Figure 2: Illustration of goal-state guided simulator interaction process, including goal states tracking and utterance generation.

Table 1 ,
we first test the simulator quality based on the traditional evaluation, specifically

Table 2 :
Results of different models under different evaluation framework

Table 3 :
Results of different RL settings

Table 4 :
am also looking for a restaurant that serves french food and is expensive.System: There are 3 restaurants that meet your criteria.Would you like me to book you a table?User: No, I just need the postcode and phone number.System: The phone number is ... and the postcode is ... Is there anything I can help you with?There are 3 options.Is there a particular area you would like to dine in?User: No , I don't care.Can you give me the postcode and phone number?System: Cote is cb21uf.Their phone number is 213-... Case studies in the MultiWOZ dataset with sentence and session scores.

Table A1 :
Results of different random seeds under different RL settings