Comprehensive Study: How the Context Information of Different Granularity Affects Dialogue State Tracking?

Dialogue state tracking (DST) plays a key role in task-oriented dialogue systems to monitor the user’s goal. In general, there are two strategies to track a dialogue state: predicting it from scratch and updating it from previous state. The scratch-based strategy obtains each slot value by inquiring all the dialogue history, and the previous-based strategy relies on the current turn dialogue to update the previous dialogue state. However, it is hard for the scratch-based strategy to correctly track short-dependency dialogue state because of noise; meanwhile, the previous-based strategy is not very useful for long-dependency dialogue state tracking. Obviously, it plays different roles for the context information of different granularity to track different kinds of dialogue states. Thus, in this paper, we will study and discuss how the context information of different granularity affects dialogue state tracking. First, we explore how greatly different granularities affect dialogue state tracking. Then, we further discuss how to combine multiple granularities for dialogue state tracking. Finally, we apply the findings about context granularity to few-shot learning scenario. Besides, we have publicly released all codes.


Introduction
Currently, task-oriented dialogue systems have attracted great attention in academia and industry (Chen et al., 2017), which aim to assist the user to complete certain tasks, such as buying products, booking a restaurant, etc. As a key component of task-oriented dialogue system, dialogue state tracking plays a important role in understanding the natural language given by the user and expressing it as a certain dialogue state (Rastogi et al., 2017Goel et al., 2018 Figure 1: Examples of dialogue state tracking with context information of different granularity at the sixth turn of a dialogue. Slot in a dialogue state refers to the concatenation of a domain name and a slot name. In the figure, (a) represents predicting the dialogue state from scratch, where slots in three domains need to be predicted and the challenge of encoding longer text is faced; (b) indicates updating dialogue state from the previous state, the slot taxi − departure cannot be predicted due to the absence of corresponding dialogue history content; (c) represents dialogue state tracking with context information of granularity 4, which tracks from the second turn and uses less dialogue history content (4 turns) to provide evidence for the prediction of all slots.
state for each turn of a dialogue is typically presented as a series of slot value pairs that represent information about the user's goal up to the current turn. For example, in Figure 1, the dialogue state at turn 2 is {(attraction − type, cinema), (attraction − area, south)}.
In general, there are two strategies to track a dialogue state: predicting it from scratch and updating it from previous state. The scratch-based strategy obtains each slot value in dialogue state by inquiring all the dialogue history (Xu and Hu, 2018;Lei et al., 2018;Goel et al., 2019;Ren et al., 2019;Wu et al., 2019;Shan et al., 2020;, the advantage of this strategy is to ensure the integrity of the dialogue information. The previous-based strategy relies on the current turn dialogue to update the previous dialogue state (Mrkšić et al., 2017;Chao and Lane, 2019;Kim et al., 2020;Heck et al., 2020;Zhu et al., 2020), the main character of this strategy is to greatly improve the efficiency of dialogue state prediction and avoid the computational cost of encoding all dialogue history.
However, both kinds of strategies above have great defects because of their own characters. For the scratch-based strategy, it is hard to correctly track short-dependency dialogue state because of the noise associated with encoding all dialogue history. For example, the dialogue history of turn 1 to 3 in Figure 1 (a) does not contribute to the prediction of slot values in the restaurant domain. For the previous-based strategy, it is difficult to solve the problem of long-dependency dialogue state tracking because it utilizes only limited dialogue information from the current turn dialogue and the previous state. As in Figure 1 (b), the slot taxi − departure cannot be predicted due to the absence of corresponding dialogue history content.
Obviously, it plays different roles for the context information of different granularity to track different kinds of dialogue states. Intuitively, less context information is needed for short-dependency dialogue state, while more context information must be taken into account for long-dependency dialogue state tracking. For example, the dialogue state in Figure 1 (c) is tracked from turn 2, which utilizes context information of granularity 4 (turn 3 to 6), providing evidence for the prediction of all slots while bringing as little noise as possible.
Thus, in this paper, we will study and discuss how the context information of different granularity affects dialogue state tracking. The contribution of this paper is that it is, to the best of our knowledge, the first detailed investigation of the impact of context granularity in dialogue state tracking and promotes the research on dialogue state tracking strategy. Our investigation mainly focuses on three points 1 : 1 The code is released at https://github.com/ yangpuhai/Granularity-in-DST • How greatly different granularities affect dialogue state tracking?
• How to combine multiple granularities for dialogue state tracking?
• Application of context information granularity in few-shot learning scenario.
The rest of paper is organized as follows: The relevant definitions and formulas in the dialogue state tracking strategy are introduced in section 2. Section 3 lists the detailed experimental settings. Section 4 presents the survey report and results, followed by conclusions in section 5.

Preliminary
To describe the dialogue state tracking strategy, let's introduce the formula definitions used in this paper: is the dialogue content of i-th turn, which includes the system utterance S i and the user utterance U i .

Dialogue State:
We define E = (B 0 , B 1 , B 2 , ..., B N ) as all dialogue states up to the N -th turn of the dialogue, where B i is the set of slot value pairs representing the information provided by the user up to the i-th turn. In particular, B 0 is the initial dialogue state which is an empty set.
Granularity: In dialogue state tracking, the number of dialogue turns spanning from a certain dialogue state B m in the dialogue to the current dialogue state B n is called granularity, that is, G = |(T m+1 , ..., T n )|. For example, the granularities of context information in (a), (b), and (c) in Figure 1 are 6, 1, and 4, respectively.
Assuming that the dialogue state of the N -th turn is currently required to be inferred, the dialogue state tracking under a certain granularity is as follows: where G ∈ {1, 2, ..., N } is the granularity of context information and tracker represents a dialogue state tracking model.
In particular, if G = 1, then:

Models
Open vocabulary Encoder Decoder Tracking strategy SpanPtr (Xu and Hu, 2018) RNN Extractive scratch-based TRADE (Wu et al., 2019) RNN Generative scratch-based BERTDST (Chao and Lane, 2019) BERT Extractive previous-based SOMDST (Kim et al., 2020) BERT Generative previous-based SUMBT (Lee et al., 2019) × BERT Classification previous-based Table 2: Statistics on the characteristics of the 5 baselines studied in the paper. In the decoder, the extractive mode refers to the extraction of slot values directly from the dialogue context, the generative mode refers to the vocabulary-dependent sequence decoding, and the classification mode is the slot value ontology-based classification.
this case corresponds to the strategy of updating from previous state. Therefore, the previous-based strategy is a special case where context granularity is minimal in dialogue state tracking. If G = N , then: this case corresponds to the strategy of predicting state from scratch. Similarly, the scratch-based strategy is also a special case of dialogue state tracking, with the context information of maximum granularity. Since the size of the maximum granularity N is different in different dialogues, so 0 is used in the paper to refer to the maximum granularity N , -1 to refer to granularity N − 1, and so on.

Experimental Settings
In order to investigate how the context information of different granularity affects dialogue state tracking, we analyze the performance of several different types of dialogue state tracking models on different datasets. For a clearer illustration, the detailed settings are introduced in this section.

Datasets
Our experiments were carried out on 5 datasets  Table 1.
Sim-M and Sim-R are multi-turn dialogue datasets in the movie and restaurant domains, respectively, which are specially designed to evaluate the scalability of dialogue state tracking model. A large number of unknown slot values are included in their test set, so the generalization ability of the model can be reflected more accurately.
WOZ2.0 and DSTC2 datasets are both collected in the restaurant domain and have the same three slots f ood, area, and price range. These two datasets provide automatic speech recognition (ASR) hypotheses of user utterances and can therefore be used to verify the robustness of the model against ASR errors. As in previous works, we use manuscript user utterance for training and top ASR hypothesis for testing.
MultiWOZ2.1 is the corrected version of the MultiWOZ (Budzianowski et al., 2018). Compared to the four datasets above, MultiWOZ2.1 is a more challenging and currently widely used benchmark for multi-turn multi-domain dialogue state tracking, consisting of 7 domains, over 30 slots, and over 4500 possible slot values. Following previous works (Wu et al., 2019;Kim et al., 2020;Heck et al., 2020;Zhu et al., 2020), we only use 5 domains (restaurant, train, hotel, taxi, attraction) that contain a total of 30 slots.

Baselines
We use 5 different types of baselines whose characteristics are shown in Table 2.
SpanPtr: This is the first model to extract slot values directly from dialogue context without an ontology, it encodes the whole dialogue history with a bidirectional RNN and extracts slot value for each slot by generating the start and end positions in dialogue history (Xu and Hu, 2018). TRADE: This model is the first to consider knowledge transfer between domains in the multidomain dialogue state tracking task. It represents a slot as a concatenation of domain name and slot name, encodes all dialogue history using bidirectional RNN, and finally decodes each slot value using a pointer-generator network (Wu et al., 2019).
BERTDST: This model decodes only the slot values of the slots mentioned in the current turn of dialogue, and then uses a rule-based update mechanism to update from the previous state to the current turn state. It uses BERT to encode the current turn of dialogue and extracts slot values from the dialogue as spans (Chao and Lane, 2019).
SOMDST: This model takes the dialogue state as an explicit memory that can be selectively overwritten, and inputs it into BERT together with the current turn dialogue. It then decomposes the prediction for each slot value into operation prediction and slot generation (Kim et al., 2020). SUMBT: This model uses an ontology and is trained and evaluated on the dialogue session level instead of the dialogue turn level. BERT is used in the model to encode turn level dialogues, and an unidirectional RNN is used to capture session-level representation (Lee et al., 2019).

Configurations and Metrics
Our deployments are based on the official implementation source code of SOMDST 2 and SUMBT 3 , in which SpanPtr, TRADE and BERTDST are reproduced in this paper. BERT in all models uses pre-trained BERT (Vaswani et al., 2017) (BERT-Base, Uncased) which has 12 hidden layers of 768 units and 12 self-attention heads, while RNN uses GRU (Cho et al., 2014). We use adam (Kingma and Ba, 2014) as the optimizer and use greedy decoding. We customize the training epochs for all models, and the training stopped early when the model's performance on development set failed to improve for 15 consecutive epochs, and all the results were averaged over the three runs with different random seeds. The detailed setting of the hyperparameters is given in Appendix A.
Since the length of the dialogue history is related to the granularity, the input length of the model needs to adapt to the granularity. Especially for the model with BERT as the encoder, in order to prevent the input from being truncated, we set the max sequence length to exceed almost all the inputs under different granularity. See Appendix A for details on the max sequence length settings.
Following previous works (Xu and Hu, 2018;Wu et al., 2019;Kim et al., 2020;Heck et al., 2020), the joint accuracy (Joint acc) and slot accuracy (Slot acc) are used for evaluation. The joint accuracy is the accuracy that checks whether all the predicted slot values in each turn are exactly the same as the ground truth slot values. The slot accuracy is the average accuracy of slot value prediction in all turns.

Experimental Analysis
This section presents our detailed investigation of how the context information of different granularity affects dialogue state tracking, focusing on the impact of granularity on dialogue state tracking, the combination of multiple granularities, and the application of context granularity in few-shot learning scenario. For simplicity, in all experimental results, the maximum granularity is expressed as 0, the maximum granularity minus 1 is expressed as -1, and so on.

How greatly different granularities affect
dialogue state tracking?
The first part of our investigation look at the validity of the context granularity used by the current various dialogue state tracking models and try to figure out how different granularities affect dialogue state tracking. The experimental results are shown in Table 3. It can be found that some dialogue state tracking models do not take the appropriate granularity, and their performance is greatly improved when they are trained with the the context of appropriate gran-  ularity. For example, the joint accuracy of SpanPtr with granularity -3 on WOZ2.0 improved by 42%, while the joint accuracy of BERTDST with granularity 4 on MultiWOZ2.1 improved by 19%. These results suggest that there are significant differences in dialogue state tracking at different granularities, therefore, we should be careful to determine the granularity to be used according to the characteristics of the model and dataset. By observing the experimental comparison results on different models and datasets in Table 3, it can be found that: • For different models, the model with generative decoding prefer larger granularity, because it requires more context information to effectively learn vocabulary-based distribution. For example, TRADE and SOMDST both perform better in larger granularity. Meanwhile, the model with extractive decoding is more dependent on the characteristics of the dataset. Besides, in general, the model with generative decoding has obvious advantages over the model with extractive decoding.
• For different datasets, when the dataset involves multiple domains and there are a large number of long-dependency dialogue states, context information of larger granularity can    determine the effectiveness of small granularity in dialogue state tracking. However, when there are more turns of dialogue resulting in less information in each turn, a larger granularity may be required to provide enough information, for example, SpanPtr performs best on the DSTC2 dataset at maximum granularity.
As can be seen from the above analysis, different granularities have their own advantages in different situations of dialogue, so it is natural to wonder whether multiple granularities can be combined to achieve better dialogue state tracking. Next, let's discuss the issue of multi-granularity combination.

How to combine multiple granularities
for dialogue state tracking?
Following the above analysis, here we mainly discuss how to combine multiple granularities in dialogue state tracking, mainly focusing on three aspects: (1) The relationship between granularities, (2) Performance of multi-granularity combination and (3) Limitations of multi-granularity combination.
The relationship between granularities: First, we use different granularities in the training and inference phases of dialogue state tracking to figure out the relationship between different granularities, as shown in Figure 2. It can be seen that when we fix the granularity of context information in the inference phase, the dialogue state tracking model trained with other granularity still obtains the generalization under this inference granularity. And even some models learned at other granularity, such as the BERTDST in Figure 2 (b) and (f), can perform better. Meanwhile, it can also be found that as the granularity gap increases, the context information becomes more and more inconsistent, and eventually the ability of the model to generalize across granularity is gradually reduced. Through these phenomena, we can summarize as follows: The knowledge learned by the dialogue state tracking model in context information of different granularity is transferable and the smaller the gap between granularity can bring more knowledge transfer effect.
Performance of multi-granularity combination: Then, we use the knowledge transfer between context information of different granularity to improve the baseline. In the specific experiment, we add the most adjacent granularity to the training phase of the model, that is, the context under two granularities is used for training, while the inference phase remains unchanged, as shown in Table 4. It can be observed that in most cases, the performance of the baseline models is significantly enhanced, suggesting that adding more granularity context information to the training phase of the model can indeed improve the generalization of the dialogue state tracking model. Of course, in some cases, multi-granularity combination results in a reduction in performance, such as SpanPtr, TRADE, and BERTDST on DSTC2 dataset. The main reason for this phenomenon should be the large deviation between the context information of different granularity in the multi-granularity combination, as can be seen from the large reduction of SpanPtr, TRADE, and BERTDST on the DSTC2 dataset with other granularity in Table 3.

Limitations of multi-granularity combination:
Given that multi-granularity combination can lead to improved generalization performance, is it better to have more context information of different granularity in training phase? To answer this question, we gradually add more granularities to the training phase while keeping the inference granularity unchanged, the experimental results are shown in Figure 3. It can be found that there is an upper limit to the use of multi-granularity combination in the training phase. Generally, adding the granularity with the smallest gap can bring the best effect, after that, with the increase of granularity number, the performance will decline.

Application of context information granularity in few-shot learning scenario
Considering the knowledge transfer between granularity in multi-granularity combination, we explore the application of multi-granularity combination in few-shot learning scenario. Figure 4 shows the joint accuracy of the model with different multi-granularity combinations and the percentage improvement relative to the baseline model on the WOZ2.0 dataset with different  Table 5: Joint accuracy of baseline models in few-shot learning before and after applying multi-granularity combination in training phase. TG and IG are the training granularity and inference granularity, respectively. * refers to the granularity originally used in the baseline. 10% and 5% refer to the scale of the training data.
training data scales. It can be found that under different scales of training data, multi-granularity combination can achieve better performance compared with single-granularity in most cases. Moreover, it can be seen from (a), (d) and (e) that the advantages of multi-granularity combination are gradually expanding with the decrease of the scale of training dataset. Therefore, the performance of multi-granularity combination in few-shot learning is worth exploring. We conduct detailed experiments on all the 5 datasets in the paper to fully explore the potential of multi-granularity combination in few-shot learning, as shown in Table 5. It can be found that multi-granularity combination has a very significant effect in few-shot learning, and in some cases can even achieve a relative improvement of more than 10%, such as SpanPtr on Sim-R and WOZ2.0, BERTDST on Sim-M, SOMDST on WOZ2.0 and DSTC2. Meanwhile, in few-shot learning, the upper limit of multi-granularity combination can be higher, and better performance can be achieved when more granularities are added in the training phase.
The above experimental results of multigranularity combination in few-shot learning show that, there is indeed knowledge transfer between different granularity contexts, and the model can obtain more adequate modeling of dialogue by learning context dialogues of different granularity.

Conclusion
In the paper, we analyze the defects of two existing traditional dialogue state tracking strategies when dealing with context of different granularity and make a comprehensive study on how the context information of different granularity affects dialogue state tracking. Extensive experimental results and analysis show that: (1) Different granularities have their own advantages in different situations of dialogue state tracking; (2) The multi-granularity combination can effectively improve the dialogue state tracking; (3) The application of multi-granularity combination in few-shot learning can bring significant effects. In future work, dynamic context granularity can be used in training and inference to further improve dialogue state tracking.

Ethical Consideration
This work may contribute to the development of conversational systems. In the narrow sense, this work focuses on dialogue state tracking in taskoriented dialogue system, hoping to improve the ability of conversational AI to understand human natural language. If so, these improvements could have a positive impact on the research and application of conversational AI, which could help humans to complete goals more effectively in a more intelligent way of communication. However, we never forget the other side of the coin. The agent substitution of conversational AI may affect the humanized communication and may lead to human-machine conflict problems, which need to be considered more broadly in the field of conversational AI.