Dual Slot Selector via Local Reliability Verification for Dialogue State Tracking

The goal of dialogue state tracking (DST) is to predict the current dialogue state given all previous dialogue contexts. Existing approaches generally predict the dialogue state at every turn from scratch. However, the overwhelming majority of the slots in each turn should simply inherit the slot values from the previous turn. Therefore, the mechanism of treating slots equally in each turn not only is inefficient but also may lead to additional errors because of the redundant slot value generation. To address this problem, we devise the two-stage DSS-DST which consists of the Dual Slot Selector based on the current turn dialogue, and the Slot Value Generator based on the dialogue history. The Dual Slot Selector determines each slot whether to update slot value or to inherit the slot value from the previous turn from two aspects: (1) if there is a strong relationship between it and the current turn dialogue utterances; (2) if a slot value with high reliability can be obtained for it through the current turn dialogue. The slots selected to be updated are permitted to enter the Slot Value Generator to update values by a hybrid method, while the other slots directly inherit the values from the previous turn. Empirical results show that our method achieves 56.93%, 60.73%, and 58.04% joint accuracy on MultiWOZ 2.0, MultiWOZ 2.1, and MultiWOZ 2.2 datasets respectively and achieves a new state-of-the-art performance with significant improvements.


Introduction
Task-oriented dialogue has attracted increasing attention in both the research and industry communities. As a key component in task-oriented dialogue systems, Dialogue State Tracking (DST) aims to extract user goals or intents and represent them as a compact dialogue state in the form of slot-value pairs of each turn dialogue. DST is an essential part of dialogue management in task-oriented dialogue systems, where the next dialogue system action is selected based on the current dialogue state.
Early dialogue state tracking approaches extract value for each slot predefined in a single domain Henderson et al., 2014a,b). These methods can be directly adapted to multi-domain conversations by replacing slots in a single domain with domain-slot pairs predefined. In multi-domain DST, some of the previous works study the scalability of the model (Wu et al., 2019), some aim to fully utilizing the dialogue history and context (Shan et al., 2020;Chen et al., 2020a;Quan and Xiong, 2020), and some attempt to explore the relationship between different slots (Hu et al., 2020;Chen et al., 2020b). Nevertheless, existing approaches generally predict the dialogue state at every turn from scratch. The overwhelming majority of the slots in each turn should simply inherit the slot values from the previous turn. Therefore, the mechanism of treating slots equally in each turn not only is inefficient but also may lead to additional errors because of the redundant slot value generation.
To address this problem, we propose a DSS-DST which consists of the Dual Slot Selector based on the current turn dialogue, and the Slot Value Generator based on the dialogue history. At each turn, all slots are judged by the Dual Slot Selector first, and only the selected slots are permitted to enter the Slot Value Generator to update their slot value, while the other slots directly inherit the slot value from the previous turn. The Dual Slot Selector is a two-stage judging process. It consists of a Preliminary Selector and an Ultimate Selector, which jointly make a judgment for each slot according to the current turn dialogue. The intuition behind this design is that the Preliminary Selector makes a coarse judgment to exclude most of the irrelevant slots, and then the Ultimate Selector makes an intensive judgment for the slots selected by the Preliminary Selector and combines its confidence with the confidence of the Preliminary Selector to yield the final decision. Specifically, the Preliminary Selector briefly touches on the relationship of current turn dialogue utterances and each slot. Then the Ultimate Selector obtains a temporary slot value for each slot and calculates its reliability. The rationale for the Ultimate Selector is that if a slot value with high reliability can be obtained through the current turn dialogue, then the slot ought to be updated. Eventually, the selected slots enter the Slot Value Generator and a hybrid way of the extractive method and the classification-based method is utilized to generate a value according to the current dialogue utterances and dialogue history.
Our proposed DSS-DST achieves state-of-theart joint accuracy on three of the most actively studied datasets: MultiWOZ 2.0 , MultiWOZ 2.1 (Eric et al., 2019), and MultiWOZ 2.2 (Zang et al., 2020) with joint accuracy of 56.93%, 60.73%, and 58.04%. The results outperform the previous state-of-the-art by +2.54%, +5.43%, and +6.34%, respectively. Furthermore, a series of subsequent ablation studies and analysis are conducted to demonstrate the effectiveness of the proposed method.
Our contributions in this paper are three folds: • We devise an effective DSS-DST which consists of the Dual Slot Selector based on the current turn dialogue and the Slot Value Generator based on the dialogue history to alleviate the redundant slot value generation.
• We propose two complementary conditions as the base of the judgment, which significantly improves the performance of the slot selection.
• Empirical results show that our model achieves state-of-the-art performance with significant improvements.

Related Work
Traditional statistical dialogue state tracking models combine semantics extracted by spoken language understanding modules to predict the current dialogue state (Williams and Young, 2007;Thomson and Young, 2010;Wang and Lemon, 2013;Williams, 2014) or to jointly learn speech understanding (Henderson et al., 2014c;Zilka and Jurcicek, 2015;Wen et al., 2017). With the recent development of deep learning and representation learning, most works about DST focus on encoding dialogue context with deep neural networks and predicting a value for each possible slot (Xu and Hu, 2018;Zhong et al., 2018;. For multi-domain DST, slot-value pairs are extended to domain-slot-value pairs for the target Wu et al., 2019;Chen et al., 2020b;Hu et al., 2020;Heck et al., 2020;. These models greatly improve the performance of DST, but the mechanism of treating slots equally is inefficient and may lead to additional errors. SOM-DST (Kim et al., 2020) considered the dialogue state as an explicit fixed-size memory and proposed a selectively overwriting mechanism. Nevertheless, it arguably has limitations because it lacks the explicit exploration of the relationship between slot selection and local dialogue information.
On the other hand, dialogue state tracking and machine reading comprehension (MRC) have similarities in many aspects (Gao et al., 2020). In MRC task, unanswerable questions are involved, some studies pay attention to this topic with straightforward solutions. (Liu et al., 2018) appended an empty word token to the context and added a simple classification layer to the reader. (Hu et al., 2019) used two types of auxiliary loss to predict plausible answers and the answerability of the question. (Zhang et al., 2020c) proposed a retrospective reader that integrates both sketchy and intensive reading. (Zhang et al., 2020b) proposed a verifier layer to context embedding weighted by start and end distribution over the context words representations concatenated to [CLS] token representation for BERT. The slot selection and the mechanism of local reliability verification in our work are inspired by the answerability prediction in machine reading comprehension. Figure 1 illustrates the architecture of DSS-DST. DSS-DST consists of Embedding, Dual Slot Selector, and Slot Value Generator. In the task-oriented dialogue system, given a dialogue Dial = {(U 1 , R 1 ); (U 2 , R 2 ) . . . ; (U T , R T )} of T turns where U t represents user utterance and R t represents system response of turn t. We define

Embedding
Preliminary Selector

Ultimate Selector Slot Value Generator
Dt Bt-1 the dialogue state at turn t as t are the corresponding slot values, and J is the total number of such slots. Following , we use the term "slot" to refer to the concatenation of a domain name and a slot name (e.g., "restaurant − f ood").

Embedding
We employ the representation of the previous turn dialog state B t−1 concatenated to the representation of the current turn dialogue D t as input: is a special token added in front of every turn input. Following SOM-DST (Kim et al., 2020), we denote the representation of the dialogue at turn t as where R t is the system response and U t is the user utterance. ; is a special token used to mark the boundary between R t and U t , and [SEP] is a special token used to mark the end of a dialogue turn. The representation of the dialogue state at turn t is is the representation of the j-th slot-value pair. − is a special token used to mark the boundary between a slot and a value.
[SLOT] j is a special token that represents the aggregation information of the j-th slot-value pair. We feed a pre-trained ALBERT (Lan et al., 2019) encoder with the input X t . Specifically, the input text is first tokenized into subword tokens. For each token, the input is the sum of the input tokens X t and the segment id embeddings. For the segment id, we use 0 for the tokens that belong to B t−1 and 1 for the tokens that belong to D t .
The output representation of the encoder is O t ∈ R |Xt|×d , and h ∈ R d are the outputs that correspond to [CLS] and [SLOT] j , respectively. To obtain the representation of each dialogue and state, we split the O t into H t and H B t−1 as the output representations of the dialogue at turn t and the dialogue state at turn t − 1.

Dual Slot Selector
The Dual Slot Selector consists of a Preliminary Selector and an Ultimate Selector, which jointly make a judgment for each slot according to the current turn dialogue.
Slot-Aware Matching Here we first describe the Slot-Aware Matching (SAM) layer, which will be used as the subsequent components. The slot can be regarded as a special category of questions, so inspired by the previous success of explicit attention matching between passage and question in MRC (Kadlec et al., 2016;Dhingra et al., 2017;Wang et al., 2017;Seo et al., 2016), we feed a representation H and the output representation h [SLOT] j t at turn t to the Slot-Aware Matching layer by taking the slot presentation as the attention to the representation H: The output represents the correlation between each position of H and the j-th slot at turn t.
Preliminary Selector The Preliminary Selector briefly touches on the relationship of current turn dialogue utterances and each slot to make an initial judgment. For the j-th slot (1 ≤ j ≤ J) at turn t, we feed its output representation h [SLOT] j t and the dialogue representation H t to the SAM as follows: where α j t ∈ R N ×1 denotes the correlation between each position of the dialogue and the j-th slot at turn t. Then we get the aggregated dialogue representation H j t ∈ R N ×d and passed it to a fully connected layer to get classification the j-th slot's logitsŷ j Ultimate Selector The Ultimate Selector will make the judgment on the slots in U 1,t . The mechanism of the Ultimate Selector is to obtain a temporary slot value for the slot and calculate its reliability through the dialogue at turn t as its confidence for each slot. Specifically, for the j-th slot in U 1,t (1 ≤ j ≤ J 1,t ), we first attempt to obtain the temporary slot value ϕ j t using the extractive method: We employ two different linear layers and feed H t as the input to obtain the representation H s t and H e t for predicting the start and end, respectively. Then we feed them to the SAM with the j-th slot to obtain the correlation representation α s j t and α e j t as follows: The position of the maximum value in α s j t and α e j t will be the start and end predictions of ϕ j t : Here we define V j , the candidate value set of the j-th slot. If ϕ j t belongs to V j , we calculate its proportion of all possible extracted temporary slot values and calculate the Ult score j t as the score of the j-th slot: Ult score j t = logit span j t − logit null j t If ϕ j t does not belong to V j , we employ the classification-based method instead to select a temporary slot value from V j . Specifically, the dialogue representation H j t is passed to a fully connected layer to get the distribution of V j . We choose the candidate slot value corresponding to the maximum value as the new temporary slot value ϕ j Threshold-based decision Following previous studies (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2019), we adopt the threshold-based decision to make the final judgment for each slot in U 1,t . The slot-selected threshold δ is set and determined in our model. The total score of the j-th slot is the combination of the predicted Preliminary Selector's score and the predicted Ultimate Selector's score: Total score j t = βPre score j t +(1−β)Ult score j t (19) where β is the weight. We define the set of the slot indices as U 2,t = {j|Total score j t > δ}, and its size as J 2,t = |U 2,t |. The slot in U 2,t will enter the Slot Value Generator to update the slot value.

Slot Value Generator
After the judgment of the Dual Slot Selector, the slots in U 2,t are the final selected slots. For each j-th slot in U 2,t , the Slot Value Generator generates a value for it. Conversely, the slots that are not in U 2,t will inherit the slot value of the previous turn (i.e., For the sake of simplicity, we sketch the process as follows because this module utilizes the same hybrid way of the extractive method and the classification-based method as in the Ultimate Selector: Significantly, the biggest difference between the Slot Value Generator and the Ultimate Selector is that the input utterances of the Slot Value Generator are the dialogues of the previous k − 1 turns and the current turn, while the Ultimate Selector only utilizes the current turn dialogue as the input utterances.

Optimization
During training, we optimize both Dual Slot Selector and Slot Value Generator.

Preliminary Selector
We use cross-entropy as a training objective: (25) whereŷ j t denotes the prediction and y j t is the target indicating whether the slot is selected.
Ultimate Selector The training objectives of both extractive method and classification-based method are defined as cross-entropy loss: where logit p j t is the target indicating the proportion of all possible extracted temporary slot values which is calculated according to the form of Equation 13, and y c j t,i is the target indicating the probability of candidate values.

Slot Value Generator
The training objective L gen,t of this module has the same form of training objective as in the Ultimate Selector.

Datasets and Metrics
We choose MultiWOZ 2.0 , MultiWOZ 2.1 (Eric et al., 2019), and the latest MultiWOZ 2.2 (Zang et al., 2020) as our training and evaluation datasets. These are the three largest publicly available multi-domain taskoriented dialogue datasets, including over 10,000 dialogues, 7 domains, and 35 domain-slot pairs. MultiWOZ 2.1 fixes the previously existing annotation errors. MultiWOZ 2.2 is the latest version of this dataset. It identifies and fixes the annotation errors of dialogue states on MultiWOZ2.1, solves the inconsistency of state updates and the problems of ontology, and redefines the dataset by dividing all slots into two types: non-categorical and categorical. In conclusion, it helps make a fair comparison between different models and will be crucial in the future research of this field.
Following TRADE (Wu et al., 2019), we use five domains for training, validation, and testing, including restaurant, train, hotel, taxi, attraction.
These domains contain 30 slots (i.e., J = 30). We use joint accuracy and slot accuracy as evaluation metrics. Joint accuracy refers to the accuracy of the dialogue state in each turn. Slot accuracy only considers individual slot-level accuracy.

Baseline Models
We compare the performance of DSS-DST with the following competitive baselines: DSTreader formulates the problem of DST as an extractive QA task and extracts the value of the slots from the input as a span . TRADE encodes the whole dialogue context and decodes the value for every slot using a copyaugmented decoder (Wu et al., 2019). NADST uses a Transformer-based non-autoregressive decoder to generate the current turn dialogue state . PIN integrates an interactive encoder to jointly model the in-turn dependencies and crossturn dependencies (Chen et al., 2020a). DS-DST uses two BERT-base encoders and takes a hybrid approach . SAS proposes a Dialogue State Tracker with Slot Attention and Slot Information Sharing to reduce redundant information's interference (Hu et al., 2020). SOM-DST considers the dialogue state as an explicit fixedsize memory and proposes a selectively overwriting mechanism (Kim et al., 2020). DST-Picklist performs matchings between candidate values and slot-context encoding by considering all slots as picklist-based slots . SST proposes a schema-guided multi-domain dialogue state tracker with graph attention networks (Chen et al., 2020b). TripPy extracts all values from the dialog context by three copy mechanisms (Heck et al., 2020).

Training
We employ a pre-trained ALBERT-large-uncased model (Lan et al., 2019) for the encoder of each part. The hidden size of the encoder d is 1024. We use AdamW optimizer (Loshchilov and Hutter, 2018) and set the warmup proportion to 0.01 and L2 weight decay of 0.01. We set the peak learning rate to 0.03 for the Preliminary Selector and 0.0001 for the Ultimate Selector and the Slot Value Generator, respectively. The max-gradient normalization is utilized and the threshold of gradient clipping is set to 0.1. We use a batch size of 8 and set the dropout (Srivastava et al., 2014) rate to 0.1. In addition, we utilize word dropout (Bowman et al., 2016) by randomly replacing the input tokens with the special [UNK] token with the probability of 0.1. The max sequence length for all inputs is fixed to 256.
We train the Preliminary Selector for 10 epochs and train the Ultimate Selector and the Slot Value Generator for 30 epochs. During training the Slot Value Generator, we use the ground truth selected slots instead of the predicted ones. We set k to 2, β to 0.55, and δ to 0. For all experiments, we report the mean joint accuracy over 10 different random seeds to reduce statistical errors. Table 1 shows the joint accuracy and the slot accuracy of our model and other baselines on the test sets of MultiWOZ 2.0, 2.1, and 2.2. As shown in the table, our model achieves state-of-the-art performance on three datasets with joint accuracy of 56.93%, 60.73%, and 58.04%, which has a significant improvement over the previous best joint accuracy. Particularly, the joint accuracy on Mul-tiWOZ 2.1 beyond 60%. Despite the sparsity of experimental result on MultiWOZ 2.2, our model still leads by a large margin in the existing public models. Similar to (Kim et al., 2020), our model achieves higher joint accuracy on MultiWOZ 2.1 than that on MultiWOZ 2.0. For MultiWOZ 2.2, the joint accuracy of categorical slots is higher than that of non-categorical slots. This is because we utilize the hybrid way of the extractive method and the classification-based method to treat categorical slots. However, we can only utilize the extractive method for non-categorical slots since they have no ontology (i.e., candidate value set).

Ablation Study
Pre-trained Language Model For a fair comparison, we employ different pre-trained language models with different scales as encoders for training and testing on MultiWOZ 2.1 dataset. As shown in Table 2, the joint accuracy of other implemented ALBERT and BERT encoders decreases in varying degrees. In particular, the joint accuracy of BERT-base-uncased decreased by 1.38%, but still outperformed the previous state-of-the-art performance on MultiWOZ 2.1. The result demonstrates the effectiveness of DSS-DST.

Separate Slot Selector
To explore the effectiveness of the Preliminary Selector and Ultimate Selector respectively, we conduct an ablation study   of the two slot selectors on MultiWOZ 2.1. As shown in Table 3, we observe that the performance of the separate Preliminary Selector is better than that of the separate Ultimate Selector. This is presumably because the Preliminary Selector is the head of the Dual Slot Selector, it is stable when it handles all slots. Nevertheless, the input of the Ultimate Selector is the slots selected by the Preliminary Selector, and its function is to make a refined judgment. Therefore, it will be more vulnerable when handling all the slots independently. In addition, when the two selectors are removed, the performance drops drastically. This demonstrates   that the slot selection is integral before slot value generation.
Dialogue History for the Dual Slot Selector As aforementioned, we consider that the slot selection only depends on the current turn dialogue. In order to verify it, we attach the dialogue of the previous turn to the current turn dialogue as the input of the Dual Slot Selector. We observe in Table 4 that the joint accuracy decreases by 2.37%, which implies the redundant information of dialogue history confuse the slot selection in the current turn.

Dialogue History for the Slot Value Generator
We try the number from one to three for the k to observe the influence of the selected dialogue history on the Slot Value Generator. As shown in Table 5, the model achieves better performance on MultiWOZ 2.1 when k = 2, 3 than that of k = 1. Furtherly, the performance of k = 2 is better than that of k = 3. We conjecture that the dialogue history far away from the current turn is little helpful because the relevance between two sentences in dialogue is strongly related to their positions. The above ablation studies show that dialogue history confuses the Dual Slot Selector, but it plays a crucial role in the Slot Value Generator. This demonstrates that there are fundamental differences between the two processes, and confirms the necessity of dividing DST into these two sub-tasks.   Table 7 shows the domain-specific results of our model on the latest MultiWOZ 2.2 dataset. We can observe that the performance of our model in taxi domain is lower than that of the other four domains. We investigate the dataset and find that all the slots in taxi domain are non-categorical slots. This indicates the reason that we can only utilize the extractive method for non-categorical slots since they have no ontology. Furthermore, we test the performance of using the separate classificationbased method for categorical slots. As illustrated in Table 8, the joint accuracy of our model and categorical slots decreased by 8.03% and 10.17%, respectively.

Conclusion
We introduce an effective two-stage DSS-DST which consists of the Dual Slot Selector based on the current turn dialogue, and the Slot Value Generator based on the dialogue history. The Dual Slot Selector determines each slot whether to update or to inherit based on the two conditions. The Slot Value Generator employs a hybrid method to generate new values for the slots selected to be updated according to the dialogue history. Our model achieves state-of-the-art performance of 56.93%, 60.73%, and 58.04% joint accuracy with significant improvements (+2.54%, +5.43%, and +6.34%) over previous best results on MultiWOZ 2.0, Multi-WOZ 2.1, and MultiWOZ 2.2 datasets, respectively. The mechanism of a hybrid method is a promising research direction and we will exploit a more comprehensive and efficient hybrid method for slot value generation in the future.