Mismatch between Multi-turn Dialogue and its Evaluation Metric in Dialogue State Tracking

Dialogue state tracking (DST) aims to extract essential information from multi-turn dialog situations and take appropriate actions. A belief state, one of the core pieces of information, refers to the subject and its specific content, and appears in the form of domain-slot-value. The trained model predicts “accumulated” belief states in every turn, and joint goal accuracy and slot accuracy are mainly used to evaluate the prediction; however, we specify that the current evaluation metrics have a critical limitation when evaluating belief states accumulated as the dialogue proceeds, especially in the most used MultiWOZ dataset. Additionally, we propose relative slot accuracy to complement existing metrics. Relative slot accuracy does not depend on the number of predefined slots, and allows intuitive evaluation by assigning relative scores according to the turn of each dialog. This study also encourages not solely the reporting of joint goal accuracy, but also various complementary metrics in DST tasks for the sake of a realistic evaluation.


Introduction
The dialogue state tracking (DST) module structures the belief state that appears during the conversation in the form of domain-slot-value, to provide an appropriate response to the user. Recently, multi-turn DST datasets have been constructed using the Wizard-of-Oz method to reflect more realistic dialogue situations (Wen et al., 2017;Mrkšić et al., 2017;Budzianowski et al., 2018). The characteristic of these datasets is that belief states are "accumulated" and recorded every turn. That is, the belief states of the previous turns are included in the current turn. It confirms whether the DST model tracks essential information that has appeared up to the present point.
Joint goal accuracy and slot accuracy are utilized in most cases to evaluate the prediction of accumu-lated belief states. Joint goal accuracy strictly determines whether every predicted state is identical to the gold state, whereas slot accuracy measures the ratio of correct predictions. However, we determined that these two metrics solely focus on "penalizing states that fail to predict," not considering "reward for well-predicted states." Accordingly, as also pointed out in Rastogi et al. (2020a), joint goal accuracy underestimates the model prediction because of its error accumulation attribute, while slot accuracy overestimates it because of its dependency on predefined slots.
However, there is a lack of discussion on the metric for evaluating the most used MultiWOZ dataset, despite a recently published dataset (Rastogi et al., 2020b) proposing some metrics. To address the above challenge, we propose reporting the relative slot accuracy along with the existing metrics in MultiWOZ dataset. While slot accuracy has the challenge of overestimation by always considering all predefined slots in every turn, relative slot accuracy does not depend on predefined slots, and calculates a score that is affected solely by slots that appear in the current dialogue. Therefore, relative slot accuracy enables a realistic evaluation by rewarding the model's correct predictions, a complementary approach that joint goal and slot accuracies cannot fully cover. It is expected that the proposed metric can be adopted to evaluate model performance more intuitively.

Joint Goal Accuracy
Joint goal accuracy, developed from Henderson et al. (2014b) and Zhong et al. (2018), can be said to be an ideal metric, in that it verifies that the predicted belief states perfectly match the gold label. Equation 1 expresses how to calculate the joint goal accuracy, depending on whether the slot values match each turn. JGA = 1 if predicted state = gold state 0 otherwise (1) However, the joint goal accuracy underestimates the accumulated states because it scores the performances of later turn to zero if the model mispredicts even once in a particular turn, regardless of the model prediction quality at later turns. As illustrated in Figure 1, we measured the relative position of the turn causing this phenomenon for the dialogue. We used MultiWOZ 2.1 (Eric et al., 2019), and analyzed 642 samples from a total of 999 test sets in which the joint goal accuracy of the last turn is zero. The DST model selected for primary verification is the SOM-DST (Kim et al., 2020), which is one of the latest DST models. Accordingly, the relative position where joint goal accuracy first became zero was mainly at the beginning of the dialogue 1 . This means that the joint goal accuracy after the beginning of the dialogue is unconditionally measured as zero because of the initial misprediction, although the model may correctly predict new belief states at later turns. Failure to measure the performance of the latter part means that it cannot consider various dialogue situations provided in the dataset, which is a critical issue in building a realistic DST model. 1 59 samples of the 642 samples have a joint goal accuracy of 1 in the middle, owing to a coincidental situation or differences in the analysis of annotation. Table A1 and Table A2 show the dialogue situation in detail, and Table A3 and Table  A4 show the belief states accordingly. Refer to Appendix A.

Slot Accuracy
Slot accuracy can compensate for situations where joint goal accuracy does not fully evaluate the dialogue situation. Equation 2 expresses how to calculate the slot accuracy. T indicates the total number of predefined slots for all the domains. M denotes the number of missed slots that the model does not accurately predict among the slots included in the gold state, and W denotes the number of wrongly predicted slots among the slots that do not exist in the gold state. Figure 2 illustrates the total number of annotated slots in MultiWOZ 2.1 to figure out the limitation of slot accuracy. Each value of x-axis in Figure 2 indicates the "maximum" number of slots that appear in a single dialogue, and we confirmed that approximately 85% of the test set utilized solely less than 12 of the 30 predefined slots in the experiment. Because the number of belief states appearing in the early and middle turns of the dialogue are smaller, and even fewer states make false predictions, calculating slot accuracy using Equation 2 reduces the influence of M and W , and the final score is dominated by the total slot number T . Accordingly, several previous studies still report the model performance using solely joint goal accuracy because slot accuracy excessively depends on the number of predefined slots, making the performance deviation among models trivial (refer to Table A5).
Furthermore, according to Table A6, we determined that slot accuracy tends to be too high. The slot accuracies of turns 0 and 1 show approximately 96% accuracy, despite the model not cor- rectly predicting states at all. It becomes difficult to compare various models in detail, if each model shows a high performance, even though nothing is adequately predicted. In addition, as the turn progresses, there are no rewards for a situation in which the model tracks the belief state without any challenges. The case correctly predicting two out of three in turn 4, and the case correctly predicting three out of four in turn 5 exhibit the same slot accuracy. Therefore, the slot accuracy measured according to Equation 2 differs from our intuition.

Other Metric
Recently, Rastogi et al. (2020b) proposed a metric called average goal accuracy. The main difference between the average goal accuracy and the proposed relative slot accuracy is that the average goal accuracy only considers the slots with non-empty values in the gold states of each turn, whereas the proposed relative slot accuracy considers those in both gold and predicted states. Since average goal accuracy ignores the predicted states, it cannot properly distinguish a better model from a worse model in some specific situations. We will discuss it in more detail in Section 4.1.

Relative Slot Accuracy
As can be observed in Equation 2, slot accuracy has the characteristic that the larger the number of predefined slots (T ), the smaller the deviation between the prediction results. The deviation among DST models will be even more minor when constructing datasets with various dialogue situations, because the number of predefined slots will continually in-crease. It is not presumed to be an appropriate metric in terms of scalability. Therefore, we propose relative slot accuracy, that is not affected by predefined slots, and is evaluated with adequate rewards and penalties that fit human intuition in every turn. Equation 3 expresses how to calculate the relative slot accuracy, and T * denotes the number of unique slots appearing in the predicted and gold states in a particular turn.
Relative slot accuracy rewards well-predicted belief states by measuring the scores in accumulating turns. Further discussions on the relative score will be discussed in Section 4.1.

Experiments
We measured MultiWOZ 2.1, an improved version of MultiWOZ 2.0 (Budzianowski et al., 2018), which has been adopted in several studies, according to Table A5. Five domains (i.e., hotel, train, restaurant, attraction, and taxi) are adopted in the experiment, following Wu et al. (2019), and there are a total of 30 domain-slot pairs. We selected the DST models in Table A5 that perform the Mul-tiWOZ experiment with the original authors' reproducible code 2 . Additionally, we reported the F1 score, which can be calculated using the current predicted and gold states.  Table 1 presents the overall results. Regarding slot accuracy, the difference between the largest and smallest values is solely 1.09%. It can be one of the reasons that several researchers do not report it. Meanwhile, relative slot accuracy can explicitly highlight the deviation among models by showing a 5.47% difference between the largest and smallest values. Furthermore, the correlation with joint goal accuracy, a mainly adopted metric, and relative slot accuracy with respect to each turn is lower than the correlation with joint goal accuracy and slot accuracy, as illustrated in Figure 3. Specifically, it can be compared with a different perspective when using the proposed reward-considering evaluation metric.

Domain-specific Evaluation
We reported the joint goal, slot, and relative slot accuracies per domain utilizing the SOM-DST model in Table  2. Relative slot accuracy derives a specific score in the turn configuration and prediction ratio of each domain by excluding slots that do not appear in the conversation. For example, the taxi domain shows a low score, meaning that it has relatively several cases of incorrect predictions, compared to the number of times slots belonging to the taxi domain appear. Because slot accuracy cannot distinguish the above trend, the score of the hotel domain is lower than that of the taxi domain. In summary, relative slot accuracy enables relative comparison according to the distribution of the domain in a dialogue.    Table 1.

Dependency on Predefined Slots
As discussed in Section 2.2, slot accuracy requiring total predefined slots is not a scalable method for evaluating the current dialogue dataset that contains a few domains in each dialogue. For example, when evaluating a dialogue sample that solely deals with the restaurant domain, even domains that never appear at all (i.e., hotel, train, attraction, and taxi) are involved in measuring performance, making deviations among different models trivial. However, relative slot accuracy can evaluate the model's predictive score without being affected by slots never seen in the current dialogue, which is a more realistic way, considering that each dialogue contains its own turn and slot composition. Figure 4 illustrates the mean and standard deviations of the model performance in Table 1. As can be observed from the results, the relative slot accuracy has a higher deviation than the slot accuracy, enabling a detailed comparison among the methodologies.
Reward on Relative Dialogue Turn Relative slot accuracy is able to reward the model's correct prediction by measuring the accuracy on a relative basis for each turn. Table A6 compares the slot and relative slot accuracies. The relative slot accuracy from turns 0 -3 is measured as 0 because it cal-  culates the score based on the unique state of the current turn according to Equation 3. In addition, regarding slot accuracy in turns 4, 5, and 6, there is no score improvement for the additional wellpredicted state by the model, whereas the score increases when the newly added state is matched in the case of relative slot accuracy. Therefore, relative slot accuracy can provide an intuitive evaluation reflecting the current belief state recording method, in which the number of slots accumulates incrementally as the conversation progresses.
Comparison to Average Goal Accuracy Relative slot accuracy can compare DST model performances more properly than average goal accuracy, as mentioned in Section 2.3. Table 3 describes how these two metrics result in different values for the same model predictions. In this example, average goal accuracy cannot consider additional belief states incorrectly predicted by Model B, resulting in the same score between the two models. In contrast, relative slot accuracy can give a penalty proportional to the number of wrong predictions because it includes both gold and predicted states when calculating the score. Consequently, relative slot accuracy has a more elaborated discriminative power than the average goal accuracy.

Conclusion
This paper points out the challenge that the existing joint goal and slot accuracies cannot fully evaluate the accumulating belief state of each turn in the MultiWOZ dataset. Accordingly, the relative slot accuracy is proposed. This metric is not affected by unseen slots in the current dialogue situation, and compensates for the model's correct predic-tion. When the DST task is scaled up to deal with more diverse conversational situations, a realistic model evaluation will be possible using relative slot accuracy. Moreover, we suggest reporting various evaluation metrics to complement the limitations of each metric in future studies, not solely reporting the joint goal accuracy.

A Complementary discussions of joint goal accuracy
Our findings show that if the model makes an incorrect prediction, the error accumulates until the end of the dialogue, and the joint goal accuracy remains at zero. In this section, we discuss a few cases of 59 dialogues that do not show the trend among 642 dialogues selected in Section 2.1; however, it is important to note that these few cases have negligible effect on the trend in Figure 1, solely changing the position where the joint goal accuracy first becomes zero.
We sampled dialogues of the MultiWOZ 2.1 test set in Table A1 and Table A2, and marked values appearing in the dialogue in bold. Table A3 and  Table A4 indicate the corresponding belief states of each dialogue. In the first dialogue presented in Table A1, the joint goal accuracy is measured as 1 at turn 2. In this case, the model incorrectly predicted the restaurant-pricerange slot at turns 0 and 1, and then the utterance about the slot appeared by chance. In a general case, the wrong prediction of the restaurant-pricerange slot at turn 0 will accumulate to the last turn. However, in this case, another incorrect prediction at turn 3 will cause error accumulation in this dialogue.
The second dialogue presented in Table A2, reports the incorrect prediction according to the interpretation of annotations at turn 4. In other words, because the dialogue about the hotel-internet slot appears over turns 4 and 5, it is solely an error depending on the prediction timing of the model. Because the correct belief state was predicted right from turn 5, it cannot be said to be an error accumulation phenomenon; however, the model did not predict the hotel-pricerange slot at turn 6, which is the last turn in this case.
In conclusion, it can be determined that the model does not seem to accumulate erroneous predictions because of an accidental situation or interpretation of annotations, but this does not negate the error accumulation phenomenon. Furthermore, the fact that the starting point of making the joint goal accuracy of subsequent turns to 0 mainly occurs at the beginning of the dialogue does not change.
Turn Dialogue History 0 System: " " User: "can you help me find a nice restaurant ?"  Turn Dialogue History 0 System: " " User: "i would like help finding a train headed to cambridge ." 1 System: "i will be happy to help you find a train . can you tell me where you will be departing from ?" User: "departing from london kings cross on tuesday ." 2 System: "when would you like to leave or arrive by ?" User: "i need to arrive by 18,30 ." 3 System: "take train tr1434 , which will arrive at 18:08 . shall i book you for that train ?" User: "can i get the price for a ticket , first ?" 4 System: "sure ! the ticket is 23.60 pounds ." User: "thanks ! i am also looking for a hotel called archway house . can you tell me if they have free wifi ?" 5 System: "they do . would you like to book a room ?" User: "i would first like to know what their price range and hotel type are , thank you ." 6 System: "archway house is a moderate -ly priced guesthouse . would you like their address or perhaps to book a room there ?" User: "thank you , but no . you've already helped me with everything i needed today ."     Table A6: SOM-DST prediction of MultiWOZ 2.1 test sample (PMUL4648.json). The joint goal accuracy of every turn is 0 because of belief states with red color. When calculating score, the number of total slots is set to 30, which is of hotel, train, restaurant, attraction, and taxi domains in MultiWOZ 2.1. Relative slot accuracy can be calculated just using slot-values appearing in the dialogue, not being affected by unused information.