Shades of BLEU, Flavours of Success: The Case of MultiWOZ

The MultiWOZ dataset (Budzianowski et al.,2018) is frequently used for benchmarkingcontext-to-response abilities of task-orienteddialogue systems. In this work, we identifyinconsistencies in data preprocessing and re-porting of three corpus-based metrics used onthis dataset, i.e., BLEU score and Inform &Success rates. We point out a few problemsof the MultiWOZ benchmark such as unsat-isfactory preprocessing, insufficient or under-specified evaluation metrics, or rigid database.We re-evaluate 7 end-to-end and 6 policy opti-mization models in as-fair-as-possible setups,and we show that their reported scores cannotbe directly compared. To facilitate compari-son of future systems, we release our stand-alone standardized evaluation scripts. We alsogive basic recommendations for corpus-basedbenchmarking in future works.


Introduction
While human judgements are irreplaceable in dialogue systems evaluation and using full dialogue evaluation instead of evaluating isolated responses given ground-truth contexts cannot fully measure system performance (Liu et al., 2016;, corpus-based evaluation metrics, such as BLEU and corpus-based entity match and success rate (Wen et al., 2017), are still very important for model development and are often used to compare models and establish state-of-the-art. We show on the MultiWOZ benchmark (Budzianowski et al., 2018), one of the most frequently used and most challenging dialogue system datasets today, that these comparisons do not hold if several basic conditions are not met, and that these conditions are not met for most of the recent works using corpusbased evaluation on this dataset. This means the assessment of progress in terms of dialogue modeling is obscured by noise coming from differences in preprocessing or metrics implementation variants. This paper is not a critique of the MultiWOZ benchmark or of systems evaluated on it. Instead, it is a call for consistency and increased rigor in automatic evaluation. In addition to providing the analysis and identifying problems with the benchmark and current state-of-the-art reporting, we include recommendations for consistency in corpus-based score comparisons. In particular, we advocate for: (1) using standardized implementations of metrics; (2) evaluating either on detokenized surface texts, or using standardized preprocessing and postprocessing; (3) reporting the exact scripts used for evaluation; (4) release of system outputs. We also show that there is room for additional metrics of output diversity, and we add an observation on the overlap between the dialogue goals and states in training and test sections of the MultiWOZ data.
Our work can be summarized as follows: • We identify, list, and discuss consistency issues associated with the MultiWOZ benchmark; • We compare and re-evaluate 13 end-to-end or policy optimization systems, using a single implementation of metrics and preprocessing; • We release the outputs of all compared systems in a unified format and provide stand-alone standardized evaluation scripts that allow for consistent comparison of future works on this dataset; 1 • In addition to standard MultiWOZ corpus-based metrics, we evaluate all systems in terms of the diversity of their outputs.

Related Work
Most works on evaluation methods in dialogue response generation (Deriu et al., 2021) focus on human evaluation (Walker et al., 1997), e.g., choosing the best methodology with respect to quality and consistency (Santhanam and Shaikh, 2019) or robustness (Dinan et al., 2019). Recent surveys in natural language generation reflect on divergence and inconsistency in human evaluation practice (Howcroft et al., 2020;Belz et al., 2020), in a similar spirit to our examination, but on a broader scale. Despite the availability of simulator evaluation (Schatzmann et al., 2006;Young et al., 2010;, corpus-based metrics have been the go-to evaluation method in end-to-end neural dialogue systems since the first implementations (Wen et al., 2017;Eric and Manning, 2017) and are a defacto standard until today (cf. Section 3.3). There are works showing problems of corpus-based metrics: limited correlation with human judgements (Novikova et al., 2017; and mixed performance depending on human reference texts used (Freitag et al., 2020) or evaluated systems (Mathur et al., 2020). Many works aim at creating more reliable metrics (Galley et al., 2015). Recent focus is on trained neural metrics (Dziri et al., 2019;Mehri and Eskenazi, 2020), but they are not yet in wide use.
Our work is probably the closest to Post (2018)'s assessment of inconsistencies in different implementations of BLEU (Papineni et al., 2002, see Section 3.2), calling for comparability and proposing a standard implementation. To our knowledge, we are the first to evaluate the use of corpus-based metrics in dialogue systems in this fashion.

The MultiWOZ dataset
The MultiWOZ 2.0 dataset (Budzianowski et al., 2018) includes about 10k task-oriented dialogues in 7 domains (restaurants, hotels, tourist attractions, trains, taxi, hospital, police) with dialogue state and system action annotation. Larger domains (restaurants, hotels, attractions, trains) have an associated database. The data was collected via human-tohuman interaction on a crowdsourcing platform using the Wizard-Of-Oz approach (Wen et al., 2017). Crowd workers were instructed with goals such as booking or finding information about a restaurant or train (see Table 1). The dataset authors provided supporting code 2 and baselines for dialogue state tracking (DST), context-to-text (CTR), and actionto-text generation tasks.  MultiWOZ 2.1: Eric et al. (2020) released an update with re-annotated dialogue states and added explicit system action annotation.
MultiWOZ 2.2 (Zang et al., 2020) has more fixes for state annotation in 17.3% of turns, a redefined ontology, and canonical forms for slot values (e.g. "13:00" for "1pm") for better DST evaluation. Additionally, it introduces slot span annotations allowing easy delexicalization, which was previously based only on string matching heuristics.

Corpus-based Metrics on MultiWOZ
All standard CTR metrics on MultiWOZ -BLEU, Inform & Success rate -are calculated on delexicalized texts, i.e., texts where dialogue slot values, such as venue names, are replaced by placeholders . While using delexicalized utterances prevents errors in venue names to affect the evaluation, it prevents the use of an interactive human evaluation, model-based evaluation metrics known from open-domain dialogue research , or end-to-end evaluation with user simulators such as ConvLab .
BLEU (Papineni et al., 2002), originally designed for machine translation (MT) evaluation, is based on comparison of n-grams in human-written references and machine-generated hypotheses. Following Wen et al. (2017), BLEU is used to measure fluency of output responses where the human utter-ances are used as the reference. Using the metric for assessing fluency of the responses is not ideal, because as opposed to the intended use of BLEU, there is only a single reference available. Moreover, the set of valid responses is arguably larger for dialogue than for MT. Liu et al. (2016) show that metrics adopted from MT correlate very weakly with human judgements in dialogue responses.
Inform & Success rates: The Inform rate relates to informable slots, which are attributes that allow the user to constrain database searches, e.g., restaurant location or price range. The Success rate focuses on requestable slots, i.e., those that can be asked by the user, e.g., phone number. Both are calculated on the level of dialogues.  consider a dialogue to be successful if the evaluated system provided all of the requested information for an entity satisfying the user's constraints. Following this definition, Wen et al. (2017) set aside the Match rate describing whether the entity found at the end of each dialogue matches the user's goal. However, MultiWOZ dialogues include multiple interleaving domains and calculating the rates only at the end is not sufficient.
Therefore, Budzianowski et al. (2018) mark a dialogue as successful if for each domain in the user's dialogue goal: (1) the last offered entity matches (satisfies the goal constraints), and (2) the system mentioned all requestable slots required by the user. The Inform rate then marks the proportion of dialogues complying to (1), Success rate is the proportion of fully successful dialogues.
The offered entities and mentions of requestable slots are tracked over the delexicalized responses for the whole dialogue, making use of slot placeholders. If an utterance contains a slot naming an entity, e.g., restaurant name or train ID, the current dialogue state for the corresponding domain is used to query the database and an entry is sampled from the search results. At the end of a dialogue, the recorded entities and requestable slots are compared to expected values from the dialogue goal (see Appendix A for an example). The dialogue can thus be considered unsuccessful if the system does not mention a venue name or train ID at the right turn, 3 does not track the user's search constraints, or ignores the user's requests. 3 It must in practice hit the single suitable turn because responses are generated given ground-truth dialogue context.

Systems Evaluating on MultiWOZ
We discuss performance of 13 recent systems that use CTR evaluation on MultiWOZ -7 end-toend and 6 policy-optimization systems, which use ground-truth dialogue states during training and inference. We include models for which we got test set predictions and systems with public code for which we managed to replicate reported results. 4 Out of the 13 compared works, 7 only report BLEU, Inform, and Success with no other evaluation; 4 use human ratings of individual outputs, and only 2 include human evaluation on full dialogues. 5 An important representative of the end-to-end systems is DAMD (Zhang et al., 2020b). It uses a multi-action data augmentation and multiple GRU (Cho et al., 2014) decoders. Similarly, LABES (Zhang et al., 2020a) employs a few GRU-based decoders, but it represents the dialog state as a latent variable. DoTS (Jeon and Lee, 2021) also uses GRUs, but the model makes use of a BERT encoder (Devlin et al., 2019) to get a context representation. MinTL ) applies a diff-based approach to state updates, with backbones based on the T5 and BART models (Raffel et al., 2020;Lewis et al., 2020). UBAR is based on a fine-tuned GPT-2 model (Radford et al., 2019), similarly to AuGPT (Kulhánek et al., 2021) which uses back-translations for response augmentation, and SOLOIST  which makes use of machine teaching (Shukla et al., 2020). We used author-provided outputs for SOLOIST and AuGPT, author-trained checkpoints for DoTS, LABES, 6 and UBAR, and we trained DAMD and MinTL 7 from scratch using publicly available code. DAMD, MinTL and SOLOIST use MultiWOZ 2.0; the remaining models trained on the 2.1 version. DAMD, LABES, MinTL, and UBAR are based on the same code base and use similar evaluation scripts.
We also compared 6 policy optimization models. SFN (Mehri et al., 2019), HDNO (Wang et al., 2021), and LAVA (Lubis et al., 2020)   We use the public predictions for LAVA and the provided pretrained models for other models. Uni-Conv and HDNO are trained on MultiWOZ 2.1, other systems use the 2.0 version. As opposed to end-to-end models, the version affects the evaluation because the ground-truth state is supplied to the model. The comparison of these systems is thus not completely fair, but we believe that the differences are small in comparison with the differences in evaluation scripts and setups (see Section 5.2).

Benchmark Caveats
While MultiWOZ and the associated metrics described in Section 3 represent the state-of-the-art in corpus-based dialogue evaluation practice, the benchmark has the following limitations that researchers need to be aware of: (1) delexicalization problems -imprecise delexicalization based on string matching and varying implementations thereof (Section 4.1), (2) lack of standardized postprocessing (i.e., lexicalization methods, Section 4.2), (3) database problems, i.e., multiple surface forms of database values and no information about booking availability (Section 4.3), (4) atypical metric implementations (Section 4.4), (5) lack of diversity evaluation (Section 4.5), (6) similarity between training and test data (Section 4.6).

Preprocessing
CTR evaluation metrics used in the benchmark work with delexicalized texts (see Section 3.2). However, the implementation of delexicalization provided with the dataset is limited; it only applies to some expressions, leaving other slot values lexicalized. That is why most systems use their own delexicalization methods. The original delexicalization uses placeholders consisting of the domain name and the slot name, e.g. taxi phone. Recent works following DAMD (Zhang et al., 2020b) remove domain names from the placeholders and determine the active domain from changes in the predicted dialogue state or model it directly. We identified five different delexicalization styles among the 13 systems described in Section 3.3. Table 2 shows a sample system turn for which the outputs of all the delexicalization approaches are different. This is a problem since all works use their own preprocessed data as references for BLEU computation. We checked the test set for slot placeholders and found that 70.61% of the utterances contain a slot in at least one delexicalized variant and only 17.52% responses with slots exactly match for all the systems. 8 Moreover, preprocessing scripts of some works remove contracted verb forms or keep suffixes such as "-s", "-ly" when delexicalizing nouns or adverbs, e.g., "moderately" becomes "[pricerange]-ly".

Postprocessing
The MultiWOZ code base does not implement backward lexicalization of texts. Out of 12 systems for which we have the source code available, only four offer scripts for lexicalizing slot values and thus allow further in-depth evaluation.

Database: Surface Forms and Booking
The original MultiWOZ implementation of the database performs only subtle normalization of the database search constraints, such as replacing "&" with "and". However, the slot values can have multiple valid surface forms; e.g., "4pm" and "16:00" or "the botanical gardens at cambridge university" and "cambridge university botanic gardens" correspond to the same database entities. Database query normalization is crucial for end-to-end systems, as opposed to the policy optimization models, which use ground-truth dialogue states with normalized values. The flexibility of the database might affect the Inform & Success rates, because they are based on information about database entries complying with the current dialogue state.
The original database does not contain any information about booking availability, because during the data collection, crowd workers were sometimes instructed to refuse a booking at a specific time, ask for another place, etc., and accept the booking with new constraints. This brings a problem into the evaluation, because some works use the ground-truth booking information (mined from the dialogue state and system action annotations) even during evaluation, whereas other ignore it and let their systems behave randomly.

Evaluation
BLEU: The original MultiWOZ BLEU implementation internally uses a trivial tokenization splitting on whitespace. However, current models often use subword tokenization and complex detokenization to remove any redundant whitespace (Sennrich et al., 2016;Kudo and Richardson, 2018). This new-style detokenization might produce words with leading or trailing punctuation. Some works ignore this fact completely, or use an alternative BLEU implementation, including tokenization, from NLTK (Bird and Loper, 2004).
Inform & Success rate: We found two main problems here. The first one comes from random database entry sampling -if multiple entities match the dialogue state, one of them is sampled at random from the database results. The set of entries complying with the dialogue state does not have to be a subset of the ground-truth set of entries complying with a given prescribed user goal from the test set. If the database results and the ground-truth set have an imperfect overlap, the sampling may choose an entry from the difference of the two sets, which is counted as a failure. However, if an entry from the intersection of the two sets is chosen, it counts as a match, which may lead to overestimating the system performance. Some systems bypass this by comparing the sets and accepting a dialogue as matching if the sets are intersecting, or if the offered set is a non-empty subset of the groundtruth set. However, these differences result in large variances in the rates (see Section 5).
Another problem is related to the domainoblivious delexicalization proposed by Zhang et al. (2020b). MultiWOZ responses contain slots from multiple domains at the same time very rarely, so it is sufficient to consider a single active domain for each turn. However, some works that adopt this new delexicalization use the ground-truth active domain during evaluation. Note that true domains have to be inferred from changes in ground-truth dialogue states and system actions.

Output Diversity Metrics
The standard MultiWOZ metrics do not cover the diversity of the outputs, which can show the formulaic or repetitive nature of a system's responses (Holtzman et al., 2020). While diversity is typically measured for non-task-oriented dialogue (Li et al., 2016), we argue that it can serve as an indicator of the naturalness of using a system over longer periods of time even in task-oriented dialogue such as MultiWOZ (Oraby et al., 2018).

Dataset folds
MultiWOZ authors split the data into train, validation, and test folds randomly. Following Lampouras and Vlachos (2016)'s analysis of train-test overlap on other datasets, we inspected the goals of all 1000 test dialogues; 174 of them are also present in the train or validation folds. The test fold does not contain any unseen slot-value pairs, and has only 12 new domain-slot-value triplets. This means that the evaluation does not really check the generalization capabilities of the systems' state tracking, and it theoretically allows the systems to memorize the whole database and bypass it during operation, which is a rather unrealistic assumption.

Experiments
In this section, we work with outputs produced by all systems described in Section 3.3. We: (1) unify their responses in terms of delexicalization styles, and then compare BLEU when different  (3) evaluate diversity and discuss similarity of the responses.

Setup
We report BLEU scores for six different delexicalized references (see Table 2). Five of them are styles used in HDSA, DAMD, AuGPT, UniConv, and LAVA. The sixth is delexicalization obtained from the MultiWOZ 2.2 span annotations. To make the BLEU-based comparison as fair as possible, we normalized the raw models' outputs. First, we remove start-of-sequence tokens, all "-s" and "-ly" strings and all "s" or "es" attached to a slot placeholder. Subsequently, we lowercase the utterances, identify slots names and map them to a unified slot name ontology. The ontology contains only 18 slot names (the original domain-aware delexicalization uses around 40 slot names). It is possible to map all the slot names used in the 6 different delexicalization styles onto it. To make a single mapping possible, the result is not lossless and reduces the finer level of detail provided by some systems. For example, slots named departure, destination, and taxi destination are all replaced with the PLACE placeholder. Finally, we pass the utterances through Moses tokenizer and detokenizer 10 (Koehn et al., 2007). To calculate BLEU, we use the Sacre-BLEU package 11 (Post, 2018), which provides an 9 Note that we work with original authors' predictions, published pre-trained weights, or models trained from scratch, and thus we are not able to carry out a statistical analysis for the reported numbers. 10 See https://github.com/alvations/sacremoses 11 See https://github.com/mjpost/sacrebleu implementation compatible with the original and is now a de-facto standard in MT (cf. Section 2).
Inform & Success rates depend on the database. Our database uses fuzzy matching for the different surface forms (see Section 4.3) using the Fuzzy-Wuzzy package 12 with a similarity threshold of 90%. We use several rules to transform time strings, venue names, food types, and venue types to canonical forms matching the entries in the database (e.g., "ten o'clock p.m." is replaced with "22:00").
Our implementation of the Inform & Success rates follows the definition in Section 3.2. The list of offered database entries, i.e. those complying to the current dialogue state, is updated only if a venue name or a train ID is mentioned (cf. Table 3). Following HDSA, we accept a dialogue as matching if the set of offered entries is a non-empty subset of the set of entries matching the particular dialogue goal. Active domains of turns are taken from the original slot names if possible. If slot placeholders do not include the domain name, we either use model predictions if available, or estimate the domain from changes of state predictions in subsequent turns.
To better explain differences in the reported and our scores, we provide an optimistic Inform & Success following differences from the original implementation found in some systems, which can potentially overestimate results. In this setting, we: (1) use the intersection entry matching instead of subset matching, (2) ignore other search constraints if a name or ID is provided, (3) use ground-truth   active domains. 13 Note that (2) is more permissive with respect to the system's state tracking as the ground-truth context used during response prediction often contains ground-truth names or IDs. These are then used for the database search even if user constraints are not predicted correctly.

Results
BLEU: Table 4 summarizes BLEU evaluation using different reference texts. We notice that using a different delexicalization might substantially change the score (up to 2% BLEU absolute). Most systems perform best on the references produced by their native delexicalization used for training. We can also see that different delexicalization styles result not only in different absolute values, but also in a different relative ordering of the systems. This shows that having a single standard delexicalization (which should always be used for model evaluation and score comparison, and preferably also during model development) is very important for any fair comparison between the models. Unlike in the case of end-to-end systems, the reported scores of the policy optimization models are higher then ours. 13 We adopt the scripts for getting ground-truth active domains from DAMD's code base.
Inform & Success rate: Table 5 shows our and reported numbers for Inform & Success. The corpus data, i.e. ground-truth responses and dialogue states, yield Inform 93.7% and Success of 90.9%. When evaluating in the optimistic setup, these numbers grow to 97.9% and 96.6%, respectively.
Our numbers differ from the reported scores of end-to-end models to a large degree, e.g., DAMD's reported performance is around 20% higher for both rates. However, the optimistic setting results in much lower differences. This shows that DAMD has problems with DST, which is hidden in the optimistic setup. The original UBAR numbers are very high because some ground-truth data were used during evaluation. AuGPT reports higher rates caused by a different Inform rate computation, where the set of offered venues is obtained only at the end of the dialogue. Our scores are similar to the reported ones for SOLOIST and DoTS. UniConv has the most different rates among the policy optimization models (ca. 17% for both metrics). LAVA reports higher rates similar to ours in the optimistic setting, but the difference is small and may be caused by MultiWOZ version differences. Our rates for SFN are much higher than the reported. MarCo's and HDSA's difference in rates can be accounted to our more flexible database.

Measure
Ref.
End-to-end models Policy optimization models   3). Each system has its own column. "*" denotes that scores for this system are computed on a subset of 91.66% test utterances. SOLO., LAB., UC stand for SOLOIST, LABES, and UniConv, respectively.

Evaluating Diversity
While the scores and rates differ between the evaluated systems, the generated utterances are similar and uniform (cf. Appendix B). To further understand differences between the systems, we analyzed the diversity of their responses (see Table 6).
We compare the texts on several diversity measures, following van Miltenburg et al. (2018) and Dušek et al. (2020): number of unique output tokens and trigrams, Shannon entropy and bigram conditional entropy, mean segmental type-token ratio (MSTTR-50), 14 and average output length. We used the normalized texts with unified slot ontology (see Section 5.2) for the comparison. The ground-truth responses with MultiWOZ 2.2 delexicalization were used as reference. Even though the systems use different delexicalization schemes, we can draw some conclusions from the analysis. First, all the systems use rather small vocabularies. The number of used trigrams is orders of magnitude lower compared to human-produced texts. The bigram conditional entropy is also much lower for all systems. Models which employ reinforcementlearning, i.e. HDNO, SFN, and LAVA, produce the least diverse outputs. HDNO uses only 315 trigrams, which is around 1.2% of the distinct trigrams seen in reference texts. On the other hand, AuGPT, UBAR, and DoTS seem to use a broader range of expressions. Extraordinarily diverse and long are the outputs of SOLOIST. However, they are still much more closer to other models then to the human reference.
14 MSTTR measures the average type-token ratio over the output text cut into segments of equal length (50 in our case). This reduces dependency on the overall text length, which is very strong in regular type-token ratio.

Conclusion
The MultiWOZ benchmark is unique for its size and the inclusion of a complete database, making it possible to build end-to-end task-oriented dialogue systems. Because of its naturalness and thanks to multiple fixes and revisions of state annotations, it became very popular for dialogue state tracking. However, it still has limitations for contextto-response generation, partially because of lack of standardized preprocessing and postprocessing. Since standard, easy-to-use evaluation scripts are not available, researches are motivated to include their own modifications. This may appear unimportant, but as we showed in our analysis of 13 systems' outputs, it results in large differences in scores and makes any comparison or tracking of progress in this area problematic.
We contribute to the solution of this problem by releasing evaluation scripts, which allow consistent evaluation of future work. We further include the evaluation of output diversity, which adds an important aspect missing from corpus-based MultiWOZ evaluation so far.
Future work should include a manual revision of MultiWOZ 2.2 span annotation to reduce training noise and to enable fair evaluation on lexicalized outputs. More important, however, is the use of human evaluation and evaluation of full dialogues in addition to corpus-based metrics (Liu et al., 2016;, which is still not standard for end-to-end dialogue systems (cf. Section 3.3).
Goal database entries (ID): 19212, 19185, 19197, 19219 (cheap and Chinese search constraints) System: thank you for using our services.  Table 7 walks through the process of Inform & Success calculation. Rows group conversation turns. The first column shows the last user utterance, the corresponding ground-truth system response and the delexicalized and normalized generated response. The second column shows the current dialogue state. The "Offered entities" column shows the changes of the set of matching venue or train IDs. Note that the set is updated only if the generated response contains the NAME or TRAINID placeholder. The "Active domain" column shows the currently active domain. The "Provided Info" column lists requestable slots mentioned until the given point. This sample conversation is (1) matching, i.e. contributes positively to the Inform rate, because the set of offered entities or database entries at the end of the dialogue is a non-empty subset of the set of goal database entries, and also (2) successful because it is matching and all requested information defined by the dialogue goal, i.e., the restaurant address and post code, were provided. Note that the ground-truth context is used to generate system utterances during the evaluation, and thus the dialogue state might contain information from past ground-truth system utterances (such as the area slot in our example, which was never mentioned in the generated system or user utterances). Tables 8 and 9 give a comparison of delexicalized utterances generated by all examined systems for two different contexts. Slot placeholders are shown in capital letters. While there are some differences in the systems' behavior, the outputs often contain very similar phrasing.

Context
User: Hello, I am doing research on plant life in England, and visiting cambridge as part of that, can you give me more info on the cambridge university botanic gardens.

System Normalized utterance
Corpus that sounds like an exciting trip! NAME is located in the AREA on ADDRESS. its admission fee is PRICE.
DAMD the address is ADDRESS and the postcode is POST. is there anything else i can help you with?
MinTL NAME is a TYPE in the AREA. it is located at ADDRESS. their phone number is PHONE.
UBAR NAME is a TYPE located in the AREA at ADDRESS. it is PRICE to enter and is located on ADDRESS. is there anything else i can help you with?
SOLOIST certainly! NAME is located at ADDRESS. their phone number is PHONE. is there anything else i can help you with?
AuGPT NAME is a TYPE located in the AREA of town. the entrance fee is 4 pounds. the address is ADDRESS.
LABES NAME is located in the AREA area of town DoTS NAME is a TYPE in the AREA. it's free to enter. do you need any other information?
MarCO NAME is a nightclub in the AREA of town. the address is ADDRESS and the postcode is POST.
HDSA it is a museum in the AREA of town. the entrance fee is free. would you like the address?
HDNO NAME is a museum in the AREA. the address is ADDRESS. the phone number is PHONE. is there anything else i can help you with? SFN NAME is located at ADDRESS. the postcode is POST. the phone number is PHONE. can i help you with anything else?
UniConv sure! it is a TYPE in the AREA part of town. would you like the address?
LAVA the NAME is located at ADDRESS, postcode POST. the phone number is PHONE. Table 8: Side-by-side comparison of system outputs for the 1st system turn of the conversation PMUL3012.

Context
User: I'm looking for a nice place to eat.
System: There are many fine places to eat. What type of cuisine would you prefer?
User: I don't care about the cuisine type, but I want it to be somewhere expensive and in the centre please.

System Normalized utterance
Corpus there are COUNT PRICE eateries in the AREA. how about one of my favorites -NAME? they serve FOOD food and it's delicious! DAMD how about NAME? it is an FOOD restaurant that is in the AREA of town and is in the PRICE price range . LAVA NAME is located at ADDRESS, postcode POST. the phone number is PHONE. Table 9: Side-by-side comparison of system outputs for the 2nd system turn of the conversation PMUL2489.