MultiWOZ 2.4: A Multi-Domain Task-Oriented Dialogue Dataset with Essential Annotation Corrections to Improve State Tracking Evaluation

The MultiWOZ 2.0 dataset has greatly stimulated the research of task-oriented dialogue systems. However, its state annotations contain substantial noise, which hinders a proper evaluation of model performance. To address this issue, massive efforts were devoted to correcting the annotations. Three improved versions (i.e., MultiWOZ 2.1-2.3) have then been released. Nonetheless, there are still plenty of incorrect and inconsistent annotations. This work introduces MultiWOZ 2.4, which refines the annotations in the validation set and test set of MultiWOZ 2.1. The annotations in the training set remain unchanged (same as MultiWOZ 2.1) to elicit robust and noise-resilient model training. We benchmark eight state-of-the-art dialogue state tracking models on MultiWOZ 2.4. All of them demonstrate much higher performance than on MultiWOZ 2.1.


Introduction
Task-oriented dialogue systems serve as personal assistants. They play an important role in helping users accomplish numerous tasks such as hotel booking, restaurant reservation, and map navigation. An essential module in task-oriented dialogue systems is the dialogue state tracker, which aims to keep track of users' intentions at each turn of a conversation . The state information is then leveraged to determine the next system action and generate the next system response.
However, substantial noise has been found in the dialogue state annotations of MultiWOZ 2.0 (Eric et al., 2020). To remedy this issue, Eric et al. (2020) fixed 32% of dialogue state annotations across 40% of the dialogue turns, resulting in an improved version MultiWOZ 2.1. Despite the significant improvement on annotation quality, MultiWOZ 2.1 still severely suffers from incorrect and inconsistent annotations Hosseini-Asl et al., 2020). The state-of-the-art joint goal accuracy (Zhong et al., 2018) for dialogue state tracking on MultiWOZ 2.1 is merely around 60% (Li et al., 2021). Even worse, the noise in the validation set and test set makes it relatively challenging to assess model performance properly and adequately. To reduce the impact of noise, different preprocessing strategies have been utilized by existing models. For example, TRADE (Wu et al., 2019) fixes some general annotation errors. SimpleTOD (Hosseini-Asl et al., 2020) cleans partial noisy annotations in the test set. TripPy (Heck et al., 2020) constructs a label map to handle value variants. These preprocessing strategies, albeit helpful, lead to an unfair performance comparison. In view of this, we argue that it is valuable to further refine the annotations of MultiWOZ 2.1.
As a matter of fact, massive efforts have already been made to further improve the annotation quality of MultiWOZ 2.1, resulting in MultiWOZ  In this work, we introduce MultiWOZ 2.4, an updated version on top of MultiWOZ 2.1, to improve dialogue state tracking evaluation. Specifically, we identify and fix all the incorrect and inconsistent annotations in the validation set and test set. This refinement results in changes to the state annotations of more than 40% of turns over 65% of dialogues. Since our main purpose is to improve the correctness and fairness of model evaluation, the annotations in the training set remain unchanged. Even so, the empirical study shows that much better performance can be achieved on MultiWOZ 2.4 than on all the previous versions (i.e., MultiWOZ 2.0-2.3). Furthermore, a noisy training set motivates us to design robust and noise-resilient training mechanisms, e.g., data augmentation (Summerville et al., 2020) and noisy label learning (Han et al., 2020a). Considering that collecting noise-free large multi-domain dialogue datasets is costly and laborintensive, we believe that training robust dialogue state tracking models from noisy training data will be of great interest to both industry and academia.

Annotation Refinement
In the MultiWOZ 2.0 dataset, the dialogue state is represented as a set of predefined slots (refer to Table 1 for all the slots of each domain) and their corresponding values. The slot values are extracted from the dialogue context. For example, attraction-area=centre means that the slot is attraction-area and its value is centre. Since a dialogue may involve multiple domains and each domain also has multiple slots, it is impractical to ensure that the state annotations obtained via a crowdsourcing process are consistent and noise-free. Even though MultiWOZ 2.1 has tried to correct the annotation errors, the refining process was based on crowdsourcing as well. Therefore, MultiWOZ 2.1 still suffers from incorrect and inconsistent annotations.

Annotation Error Types
We identify and fix ten types of annotation errors (including inconsistent annotations) in the validation set and test set of MultiWOZ 2.1. Figure 1 shows an example for each error type.
• Context Mismatch: The slot has been annotated, however, its value is inconsistent with the one mentioned in the dialogue context.
• Mis-Annotation: The slot is not annotated, even though its value has been mentioned.
• Not Mentioned: The slot has been annotated, however, its value has not been mentioned in the dialogue context at all.
• Multiple Values: The slot should have multiple values, but not all values are included.
• Typo: The slot has been correctly annotated, except that its value includes a typo.
• Implicit Time Processing: This relates to the slots that take time as the value. Instead of copying the time specified in the dialogue context, the value has been implicitly processed (e.g., adding or subtracting 15 min).
• Slot Mismatch: The extracted value is correct, but it has been matched to a wrong slot.
• Incomplete Value: The slot value is a substring or an abbreviation of its full shape (e.g., "Thurs" vs. "Thursday").
• Delayed Annotation: The slot has been annotated several turns later than its value first mentioned in the dialogue context.  • Unnecessary Annotation: These unnecessary annotations are not incorrect but they exacerbate inconsistencies as different annotators have different opinions on whether to annotate these slots or not. In general, the values of these slots are mentioned by the system to respond to previous user requests or provide supplementary information. We found that in most dialogues, these slots are not annotated. Hence, we remove these annotations. However, the name-related slots are an exception.
If the user requests more information (e.g., address and postcode) about the recommended "name", the slots will be annotated.

Annotation Refinement Procedure
In the validation set and test set of MutliWOZ 2.1, there are 2,000 dialogues with more than 14,000 dialogue turns. And there are 5 domains with a total of 30 slots (the bus domain and hospital domain only occur in the training set). To guarantee that the refined annotations are as correct and consistent as possible, we decided to rectify the annotations by ourselves rather than crowd-workers. However, if we check the annotations of all 30 slots at each turn, the workload is too heavy. To ease the burden, we instead only checked the annotations of turn-active slots. A slot being turn-active indicates that its value is determined by the dialogue context of cur-rent turn and is not inherited from previous turns. The average number of turn-active slots in the original annotations and in the refined annotations is 1.16 and 1.18, respectively. The full dialogue state is then obtained by accumulating all turn-active states from the first turn to current turn. We also observed that some slot values are mentioned in different forms, such as "concert hall" vs. "concerthall" and "guest house" vs. "guest houses". The name-related slot values may have a word the at the beginning, e.g., "Peking restaurant" vs. "the Peking restaurant". We normalized these variants by selecting the one with the most frequency. In addition, all time-related slot values have been updated to the 24:00 format. We performed the above refining process twice to reduce mistakes and it took us one month to finish this task. Table 2 shows the count and percentage of slot values changed in MultiWOZ 2.4 compared with MultiWOZ 2.1. Note that none and dontcare are regarded as two special values. As can be seen, most slot values remain unchanged. This is because a dialogue only has a few active slots and all the other slots always take the value none. Table 3 further reports the ratio of refined slots, turns and dialogues. Here, the ratio of refined slots is computed on the basis of refined turns. It is shown   that the corrected states relate to more than 40% of turns over 65% of dialogues. On average, the annotations of 1.53 (30 × 5.10%) slots at each refined turn have been rectified. We then report the value vocabulary size (i.e., the number of candidate values) of each slot and its value change ratio in Table 4. For some slots, the value vocabulary size decreases due to value normalization and error correction. For some slots, the value vocabulary size increases mainly because a few labels that contain multiple values have been additionally introduced. Table 4 also indicates that the value change ratio of the name-related slots is the highest. Since these slots usually have "longer" values, the annotators are more likely to make incomplete and inconsistent annotations.

Benchmark Evaluation
In this part, we present some benchmark results.

Benchmark Models
In recent years, many neural dialogue state tracking models have been proposed based on the Mul-tiWOZ dataset. These models can be roughly divided into two categories: predefined ontologybased methods and open vocabulary-based methods. The ontology-based methods perform classification by scoring all possible slot-value pairs in the ontology and selecting the value with the highest score as the prediction. By contrast, the open vocabulary-based methods directly generate or extract slot values from the dialogue context. We benchmark the performance of our refined dataset on both types of methods, including    them, the first three are ontology-based approaches.
The rest are open vocabulary-based methods. For all these models, we employ the default hyperparameter settings to retrain them on MultiWOZ 2.4.

Results Analysis
We adopt the joint goal accuracy (Zhong et al., 2018) and slot accuracy as the evaluation metrics. The joint goal accuracy is defined as the ratio of dialogue turns in which each slot value has been correctly predicted. The slot accuracy is defined as the average accuracy of all slots. The detailed results are presented in Table 5. As can be observed, all models achieve much higher performance on Mul-tiWOZ 2.4. The ontology-based models demonstrate the highest performance promotion, mainly benefiting from the improved ontology. SAVN and TripPy show the least performance increase, because they have already utilized some value normalization techniques to tackle label variants in MultiWOZ 2.1. We then report the joint goal accuracy of SUMBT and TRADE on different versions of the dataset in  We also include the domain-specific accuracy of SOM-DST and STAR in Table 7, which shows that except SOM-DST in the taxi domain, both methods demonstrate higher performance in each domain.

Per-Slot (Slot-Specific) Accuracy
In the previous subsection, we have presented the joint goal accuracy and average slot accuracy of eight state-of-the-art dialogue state tracking models. The results have strongly verified the quality of our refined annotations. Here, we further report the per-slot (slot-specific) accuracy of SUMBT on different versions of the MultiWOZ dataset. The slotspecific accuracy is defined as the ratio of dialogue turns in which the value of a particular slot has been correctly predicted. The results are shown in Table 8, from which we can observe that the majority of slots (21 out of 30) demonstrate higher accuracies on MultiWOZ 2.4. Even though MultiWOZ 2.3-cof additionally introduces the co-reference annotations as a kind of auxiliary information, it still only shows the best performance in 7 slots. Compared with MultiWOZ 2.1, SUMBT has achieved higher slot-specific accuracies in 26 slots on Mul-tiWOZ 2.4. These results confirm again the utility and validity of our refined version MultiWOZ 2.4.   (Han et al., 2020b). It is shown that most slots demonstrate stronger performance on MultiWOZ 2.4 than all the other versions.

Case Study
In order to understand more intuitively why the refined annotations can boost the performance, we showcase several dialogues from the test set in   benchmark models are retrained on the original noisy training set. The only difference is that we use the cleaned validation set to choose the best model and then report the results on the cleaned test set. Even so, we have shown in our empirical study that all the benchmark models obtain better performance on MultiWOZ 2.4 than on all the previous versions. Considering that all the previous refined versions also corrected the (partial) annotation errors in the training set, the superiority of MultiWOZ 2.4 indicates that existing versions haven't fully resolved the incorrect and inconsistent annotations. The cleaned validation set and test set of MultiWOZ 2.4 can more appropriately reflect the true performance of existing models. In addition, the refined validation set and test set can also be combined with the training set of MultiWOZ 2.3 and thus even higher performance of existing methods can be expected, as MultiWOZ 2.3 has the cleanest training set by far.
On the other hand, it is well-understood that deep (neural) models are data-hungry. However, it is costly and labor-intensive to collect high-quality large-scale datasets, especially dialogue datasets that involve multiple domains and multiple turns. The dataset composed of a large noisy training set and a small clean validation set and test set is more common in practice. In this regard, our refined dataset is a better reflection of the realistic situation we encounter in our daily life. Moreover, a noisy training set may motivate us to design more robust and noise-resilient training paradigms. As a matter of fact, noisy label learning (Han et al., 2020a;Song et al., 2020) has been widely studied in the machine learning community to train robust models from noisy training data. Numerous advanced techniques have been investigated as well. We hope to see these techniques can also be applied to the study of dialogue systems and thus accelerate the development of conversational AI.
In this work, we introduce MultiWOZ 2.4, an updated version of MultiWOZ 2.1, by rectifying all the annotation errors in the validation set and test set. We keep the annotations in the training set as is to encourage robust and noise-resilient model training. We further benchmark eight state-of-theart models on MultiWOZ 2.4 to facilitate future research. All the chosen benchmark models have demonstrated much better performance on Multi-WOZ 2.4 than on MultiWOZ 2.1.

Potential Impacts
We believe that our refined dataset MultiWOZ 2.4 would have substantial impacts in the academia. At first, the cleaned validation set and test set can help us evaluate the performance of dialogue state tracking models more properly and fairly, which undoubtedly is beneficial to the research of taskoriented dialogue systems. In addition, MultiWOZ 2.4 may also serve as a potential dataset to assist the research of noisy label learning in the machine learning community.