Scalable-DSC: A Structural Template Prompt Approach to Scalable Dialogue State Correction

,


Introduction
Task-oriented dialogue systems are becoming increasingly important in facilitating the daily life of human beings.Dialogue State Tracking module plays an important role in dialogue systems (Wang and Lemon, 2013a), which aims to accurately track the user's goals based on the dialogue history and † Equal Contribution.

DST :
Definitely, for how many people ?

:
Just me.Can you also give me information on the vue cinema?: (hotel-name, avalon) (hotel-book stay, 1) (attraction-name, vue cinema)

:
Hi, how may I help you?: I need to book a room at autumn house : (hotel-name, avalon)

Figure 1:
An example of error propagation in DST task."S 1 " and "U 1 " refer to the system utterance of the 1-th turn and the user utterance of the 1-th turn, respectively."B pred 1 " and "B label 1 " represent the predicted results of the dialogue state for the first turn model, and the ground truth of the first turn dialogue state."C 1 " refers to the first turn of dialogue history.The other symbols can be inferred in sequence.The red characters indicate errors.
represent the dialogue states as a set of (slot,value) pairs.
Currently, some DST approaches (Ye et al., 2021b;Tian et al., 2021;Zhou et al., 2021;Xu et al., 2023) attempt to track the user's goal from the dialogue state of the previous turn, which as a compact representation of the dialogue history information (Kim et al., 2020), and the dialogue utterance for the current turn.These approaches have improved the efficiency of DST tasks.However, the errors generated in the current turn of the model are likely to be carried over to the next turn.Figure 1 illustrates an example of error propagation: the predicted dialogue state B pred 1 in the first turn, which contains the wrong slot values, influences the prediction B pred 2 of the DST model in the second turn.Thus, recent studies (Tian et al., 2021;Xie et al., 2022) focus on correcting wrong slot values in predicted dialogue states to mitigate the error propagation.The correction modules of these approaches are deeply intertwined with specific DST models, limiting their applicability to other DST models.Moreover, to mitigate historical context mismatch between training and inference, by utilizing certain strategies, these approaches generate predicted dialogue states which may contain wrong slot values to train the model.However, the inconsistent error distribution between the predicted dialogue states generated by certain strategies during training and the dialogue states predicted by DST during inference is ignored, which limits the correction capability of the model.
To address the above problem, we propose Scalable-DSC.This is a standalone dialogue state correction model whose responsibility is to correct the wrong slot values in the dialogue state predicted by the DST model.Firstly, to enhance the scalability of Scalable-DSC, we propose the Structural Templates Prompt (STP) approach, which consists of four components in its input schema: (1) Dialogue History provides Scalable-DSC with real dialogue information, serving as crucial evidence to correct the dialogue state; (2) By using heuristic scripts, the predicted dialogue state of any DST model is transformed into a natural language sequence (State Sequence), which is considered as part of the model's historical context, and then a dialogue state sequence is generated with corrected slot values.(3) Slot Options enable the model to associate predefined template content with relevant slot descriptions, making the model aware of which slot information needs to be selected; and (4) Template Options prompts the model to select relevant template content for performing controlled conditional generation.This generation approach forces Scalable-DSC to firstly generate a sequence of erroneous states and then continue generating a sequence of corrected states.Secondly, we employ two training strategies to optimize the Scalable-DSC model in a staged manner.Strategy 1: During training, we use a predictive state simulator (Xie et al., 2022)  We conducted extensive experiments on Mul-tiWOZ 2.0-2.4 (Budzianowski et al., 2018;Eric et al., 2020;Zang et al., 2020;Han et al., 2021;Ye et al., 2021a) , and the results indicate that Scalable-DSC significantly improves the performance of STP-DST by correcting dialogue state errors and achieves a new state-of-the-art.The contributions of the paper are as follows: (1) We propose Scalable-DSC, a standalone dialogue state correction model, that can universally correct the predicted results of different DST models, adaptively infer errors in the dialogue state sequence and then generate a complete and correct sequence.
(2) We introduce a Structural Template Prompting (STP) approach.It controls Scalable-DSC through structured prompt texts to determine what needs to be corrected, what needs to be associated, and what needs to be generated through the template prompting mechanism.
(3) We use predicted dialogue states generated by a predictive state simulator and predicted dialogue states from the DST model to train Scalable-DSC in stages.This approach not only enhances the model's generalization ability but also alleviates the issue of inconsistent distributions between training and inference error types.

Methodology
Figure 2 illustrates the overall framework of correcting erroneous slot values in dialogue states using Scalable-DSC.In the preprocessing section, Strategy 1 employs a predictive state simulator to simulate predicted dialogue states in the training set.Strategy 2 utilizes a DST model to generate predicted dialogue states in the training set.In the training section, we utilize the predicted dialogue states B pred_seq t generated from the preprocessing section as a part of the input schema for the STP method in Scalable-DSC, resulting in the generation of the corrected target sequence B error_seq t ⊕ B correct_seq t .In the inference section, Scalable-DSC corrects the prediction results of the DST model on the test set.

The Input Schema of STP Approach
Figure 2 illustrates the input schema of the STP approach in Scalable-DSC.The schema includes The following dialogue status is incorrect, The user is looking for a restaurant with moderate price.Actually,The user is looking for a restaurant located in the west with any price, which serves british.Dialogue History, State Sequence, Slot Options, and Template Options.Each element is described as follows:

Preprocessing
Dialogue History: Due to the fact that the dialogue history provides real information for generating dialogue states.We use Dialogue History aids for Scalable-DSC to correct dialogue states that contain erroneous slot values.For STP-DST, it relies on the dialogue history and the previous dialogue state to predict the current dialogue state.Dialogue History for the t-th turn is denoted as the utterance in the t-th turn of the dialogue, where U t and S t respectively refer to the user's utterance and the system's utterance in the t-th turn.We use special tokens [user] and [sys] as prefixes for user utterance sequences and system utterance sequences, respectively.The prefix for the current turn utterance sequence is [CU RR_T U RN ], and [T U RN ] separates each turn of dialogue ut-terance.
State Sequence: We convert the dialogue states into natural language sequences by using the heuristic script by DS2 (Shin et al., 2022).Figure 3 shows an illustrative example of using heuristic script to convert dialogue states into natural language sequences.Given a slot-value pair "(hotel-people, 6)", since the slot value is not "dontcare", the slot name "hotel-people" finds the corresponding fragment "for people" from the slot sequence fragment.By using the slot value "6" to fill in the missing part of the sequence, we obtain "for 6 people", then add the commmon sequence fragment "The user is looking for a place to stay".The final sequence form is obtained as "The user is looking for a place to stay for 6 people".Specifically, when the slot value is "dontcare", the slot name needs to be searched for the corresponding fragment from the Dontcare slot sequence fragment.Moreover, we define the dialogue state at turn t as We concatenates all the sequence fragments in order (Common sequence fragment and Slot sequence fragment) to obtain it, providing the model with more granular guidance.We replace the missing slot value portions in the fragments using special tokens "[value_i]"(i = 1, ..., J).It is worth noting that we have designed slot values containing "any" as a replacement for the "dontcare" slot value1 .The benefit of this is that template options do not require connecting Dontcare slot sequence fragment of each slot, reducing the input length of the template sequence.In Figure 2, T DSC and T DST respectively represent the template options for Scalable-DSC and STP-DST.Especially, for Scalable-DSC, T DSC added a sequence at the beginning of T DST : "(The following dialogue status is incorrect | Actually)".This sequence serves as a guide to instruct Scalable-DSC in inferring incorrect dialogue state sequence and then correcting them.Slot Options: We establish the connection between the detailed description of the i-th slot and the i-th masked slot value position in the template options by employing the same special token "[value_i]"(i = 1, ..., J).The advantage of this lies in that when the model needs to fill in the slot values masked by special tokens in the template, it is able to associate with the corresponding slot's detailed description information through the special tokens.This helps the model comprehend the semantics of the masked slot value position and enables more accurate text completion.The slot options for all slots is represented as O where desc i is the connection between the domain name of the i-th slot and the detailed description of the i-th slot.For example, the detailed description of the slot named "restaurant-food" is: "restaurantthe cuisine of the restaurant you are looking for".

Sequence Generation in STP Approach
The model automatically selects relevant template fragments from the template options and replaces the special token "[value i]" in the fragments with actual slot values to generate a corrected dialogue state natural language sequence.Especially, the generation process of Scalable-DSC can be divided into two steps: (1) Generating the phrase "The following dialogue status is incorrect" and then generate the incorrect parts of the state sequence; (2) Generating the phrase "Actually" and then generating a complete and correct state sequence.As shown in Figure 2 where Θ represents the parameters of the model.

Dataset
We adopt MultiWOZ 2.0-2.4 (Budzianowski et al., 2018;Eric et al., 2020;Zang et al., 2020;Han et al., 2021;Ye et al., 2021a) as the datasets in our experiments.MultiWOZ is a multi-domain dialogue dataset that consists of dialogues between simulated users and dialogue systems.The training, validation and test sets in the MultiWOZ dataset contain 8420, 1000 and 1000 dialogues respectively.The model we follow the previous work (Wu et al., 2019;Lee et al., 2019;Kim et al., 2020;Le et al., 2020;Ye et al., 2021b;Wang et al., 2022) in training and only used data from five domains (restaurant, hotel, attraction, taxi, train).

Evaluation Metric
We borrow and modify the DS2 (Shin et al., 2022) heuristic script to convert the generated state sequences of the model into dialogue states for evaluation.The model is automatically evaluated based on two metrics (Williams et al., 2013).Joint Goal Accuracy is the primary evaluation metric for cross-domain DST tasks.It is the percentage of turns for which all the slots are correctly identified.Slot Accuracy only considers individual slot-level accuracy, calculating the proportion of the correct slot values filled in the slot from a macro perspective.Final Joint Goal Accuracy Xie et al. (2022) is defined as the proportion of examples (dialogues) where the predicted dialogue state of last turn exactly matches the ground-truth dialogue state of the last turn.

Settings
We initialize the Scalable-DSC and STP-DST model parameters with pre-trained T5-base (Raffel et al., 2020) from Huggingface Transformers (Wolf et al., 2020).We set the learning rate to 5e-5, and the batch size to 6.The probability β is set to 0.06.We optimize the model parameters by using the AdamW optimizer (Loshchilov and Hutter, 2019).
The model training was performed on two NVidia RTX 3090 GPUs.

Baselines
We use the following models as the baselines for comparisons: STAR (Ye et al., 2021b) captures slot correlations by a self-attention mechanism; DSS-DST (Guo et al., 2021) proposes a slot status binary classifier to determine whether a slot should update its slot value or inherit its slot value; FPDSC (Zhou et al., 2021) is a multi-level fusion approach that integrates dialogue history and previous dialogue state; LUNA (Wang et al., 2022) is a DST approach for aligning slot and dialogue utterances; DSDN (Xu et al., 2023) dynamically utilizes the information of the previous dialogue state to track the user's goal; Tripy (Heck et al., 2020) proposes three copying approaches for filling slot values; SOM-DST (Kim et al., 2020) is a DST model that utilizes BERT encoding and employs an RNN copying mechanism for decoding dialogue states; MinTL (Lin et al., 2020) is a DST model that predicts only a subset of dialogue states; TransformerDST (Zeng and Nie, 2020) proposes using only BERT as both the encoder and decoder; SimpleTOD (Hosseini-Asl et al., 2020) is an end-toend dialogue system that formulates dialogue state tasks as sequential tasks; DS2 (Shin et al., 2022) converts dialogue states into sequence texts to track user's goals; AG-DST (Tian et al., 2021) proposes the use of secondary decoding to achieve the goal of correcting the dialogue state; D3ST (Zhao et al., 2022) proposes the approach of combining schema information and intent information to accomplish the dialogue state tracking; Correctable-DST (Xie et al., 2022) proposes to first obtain incorrect slot information in the dialogue state and use these incorrect slot information to guide the correction of the DST model in generating the correct dialogue 4 Experiment Results

Main Results
The lower part of Table 1 shows the results when applying Scalable-DSC with different training configurations to correct the predicted dialogue states of three DST models(STP-DST, SOM-DST (Kim et al., 2020)   the correction performance of training Scalable-DSC in stages, combining two strategies, exceeds the performance of training Scalable-DSC using a single strategy only.

Analysis
5.1 Ablation Analysis of STP Approach.
We conduct ablation studies to investigate into the impact of different prompts in STP.Table 2 shows the results on Scalable-DSC trained with different training strategies to correct erroneous dialogue states predicted by STP-DST.The results show that when the slot value "dontcare" is not replaced with "any", the joint goal accuracy of Scalable-DSC trained separately using two strategies is reduced by 1.26% and 0.91% respectively after correction.We believe that the reason for the performance decline is that the template options require connecting the Dontcare slot sequence fragment of each slot.This not only increases the input length of the model but also requires additional decoding of the dontcare sequence, adding pressure to the decoding process of model.The inputs of Scalable-DSC trained separately using two training strategies do not concatenate slot options.Their performance decreases by 0.94% and 0.23% respectively.This demonstrates that with the assistance of slot options, the model is capable of gaining a better understanding of the implied meanings of masked slot values within template options.Furthermore, when the model is trained without template options as the prompt texts, the performance of the two training strategies decreased by 1.97% and 0.99% respectively.These results elucidate the effectiveness of template options in enabling the model to generate controlled corrective information, thereby enhancing its corrective performance.

Effectiveness Analysis of Scalable-DSC
Analysis on Scalable-DSC is carried out to obtain deeper insights into the reason for the effectiveness.
We continue to employ STP-DST as the target correction model, and utilize the Scalable-DSC model to correct its predicted dialogue states.Following (Quan and Xiong, 2020;Xie et al., 2022), we categorize the slot errors into three types: over prediction, partial prediction, and erroneous prediction.Table 3 presents the slot accuracy and the slot error rate of different model combinations on the MultiWOZ 2.0-2.4.As shown, the Scalable-DSC model improves the slot accuracy of STP-DST, demonstrating the benefits brought by correction.It should be noted that the largest proportion of slot type errors in the STP-DST model is lack of predictions, followed by over predictions, and the smallest is updated predictions.After being corrected by Scalable-DSC, we can observe a decrease in the proportions of all three types of slot errors in the STP-DST model.Specifically, we found that Scalable-DSC primarily corrects the error of over prediction in STP-DST.This verifies that the im- provements in the Scalable-DSC model primarily come from correcting over prediction errors.The relevant results of the SOM-DST model and the STAR model are shown in Appendix F. We have also observed, however, that for the other two types of errors (partial prediction and erroneous prediction), the Scalable-DSC model brings less benefit in terms of correction.

Error Analysis of Scalable-DSC
The aforementioned experiments have shown the effectiveness of Scalable-DSC.Here we provide further analysis on the error correction quality to gain more insights into why it could be beneficial.We first investigate whether Scalable-DSC introduced new errors by correcting the dialogue states predicted by STP-DST (i.e., whether Scalable-DSC corrected correct slot values into incorrect ones).
To achieve this, we calculate the number of slots correctly predicted by STP-DST but erroneously corrected by Scalable-DSC.The "AE" column in Table 4 displays this quantity.We can observe that Scalable-DSC introduces new slot errors during the correction process, although the number is small.Next, we investigate whether the number of erroneous slot predictions from the DST model decreased due to the correction by the Scalable-DSC model.The results in the "RE" column of

Related Work
Early DST methods (Williams and Young, 2007;Thomson and Young, 2010;Wang and Lemon, 2013b) relied on pre-defined rules to recognize dialogue states.Although good prior knowledge can solve cold-start performance issues, it requires a large amount of manual rule-making, is inflexible, lacks scalability, and cannot simultaneously track multiple types of state information.Subsequently, some classification DST models based on a predefined ontology were proposed (Henderson et al., 2014;Mrkšić et al., 2016;Zhong et al., 2018;Ye et al., 2021b), which significantly improved the model performance.However, it is difficult to design a complete and robust pre-defined ontology file (Xu and Hu, 2018).Therefore, most researchers focus on developing open-vocabulary DST models (Gao et al., 2019;Wu et al., 2019;Chen et al., 2020;Shin et al., 2022;Su et al., 2021;Zhao et al., 2022), which extract dialogue states directly from the complete dialogue history, ignoring the issue of incomplete context information caused by truncation of dialogue history.To address this challenge, (Tian et al., 2021;Zhao et al., 2021;Zhou et al., 2021;Xu et al., 2023) attempts to track users' goals from the context of the dialogue and the previous dialogue state.However, error propagation has become the reason for limiting model performance.To address this problem, the idea of correcting dialogue state is proposed (Tian et al., 2021;Xie et al., 2022), but the methods heavily rely on specific DST models, i.e., lacking scalability.
Prompt learning is a widely used technique in natural language processing (NLP), aiming at reducing the gap between the target before training and downstream tasks (Liu et al., 2023).Some recent prompt-based DST work has been proposed (Lee and Jha, 2019;Gao et al., 2020;Lee et al., 2021).They use different text prompts to provide task information to the model.For example, (Su et al., 2021) uses slot names for prompt information, (Rastogi et al., 2020) uses slot description, (Lin et al., 2021) prompt information is possible values.Some other work (Jiang et al., 2020) has proposed the use of automatic prompts that do not require manually predefined prompts.

Conclusion and Future Work
We have proposed Scalable-DSC, a new dialogue state correction model aiming at decoupling the error correction functionality from a specific DST model, i.e., scalability, by introducing the STP strategy.The key idea of STP is to convert the dialogue state into a natural language sequence and generates a corrected dialogue state natural language sequence based on the guidance of template options, combined with real dialogue history information.Extensive experiments have been carried out to analyze the scalability and correction performance of the model.Results have confirmed the applicability of the model to other DST models and demonstrated a new state-of-the-art performance on MultiWOZ2.0-2.4.As discussed, limitation still exists.How to better correct all types of errors in dialogue states and avoid the generation of new errors remains interesting research topic for the future work.

Limitations
This work has two main limitations: (1) The Scalable-DSC cannot completely correct all types of errors in predicting dialogue states.As mentioned above, the Scalable-DSC model primarily corrects over-prediction errors, while the correction proportion for the other two types of errors is relatively.
(2) The Scalable-DSC has a problem of incorrectly modifying the correct conversation state.From the error analysis experiments, it can be observed that the Scalable-DSC model exhibits a small proportion of incorrect modifications in MultiWOZ2.0-2.4.   Figure 4 shows the replacement values for the "dontcare" slot values in different slots.

D Experiment on Per Slot Error Rate
We further investigated the error rates for each slot.We used STP-DST as our primary models and used the Scalable-DSC [N+R] model to correct the predictions of the primary models.The results on the MultiWOZ 2.4 test set are shown in Table 6.It can be seen that the Scalable-DSC model effectively reduces the error rates for most slots.However, we found that slots associated with "name" and "type" have higher error rates in the primary model itself.Although the error rates decrease after correction, these slots still present a challenge.
E Experiment on the model's generation type.
We believe that when correcting dialogue state, identifying the wrong slot first before correction is more beneficial in reducing the inference burden of the model than direct correction.This idea has been validated as shown in

F Experiment on Slot Error Rate
We used SOM-DST and STAR as target correction models, and Table 8 presents the performance after using the Scalable-DSC model to correct the predicted dialogue state of the two DST models on the MultiWOZ 2.0-2.4 dataset.The results have demonstrated that the Scalable-DSC model primarily corrects the over prediction errors of the DST model, which has become the main reason for the strong correction capability of the Scalable-DSC model.
to dynamically generate predicted dialogue states, which contain different wrong slot values in each epoch to enhance the model's generalizability.Strategy 2: We use the DST model to predict the dialogue state as the training data for the Scalable-DSC model.In addition, to provide predictive state data during the Scalable-DSC model training, we propose a DST model based on the STP approach (STP-DST), as shown in Figure 2.

(
hotel-people , 6) (hotel-pricerange , any) The user is looking for a place to stay with an any price for 6 people The user | he [VERB] is looking for | is searching for | looks for | searches for | wants a place to stay Dontcare slot sequence fragment hotel-people : does not care about the number of prople hotel-pricerange : does not care about the price range

Figure 3 :
Figure 3: The process of converting dialogue states into sequence of dialogue state transitions." " represents the position for filling slot values using the regularization approach.

Dialogue History : State Sequence : Slot Options : Template Options :
where V t j is the cor-responding value of the slot S j , and J represents the size of a set of predefined slots.S j refers to the domain and slot names connected by "-".B pred t in Figure2represents the predicted dialogue states for the t-th turn, while B pred_seq , we define the erroneous part of the state sequence as B error_seq The model is trained with a maximum likelihood objective function.Given a training sample d = {X, Y }, where the X is the encoder context of the model, and Y is the target output text of the model.The encoder context of STP-DST is defined as X DST = Dialogue History ∶ ⊕H t ⊕State Sequence ∶ ⊕B pred_seq Template options ∶ ⊕T DST and Y DST =

Table 2 :
Ablation analysis of different prompts for STP on the MultiWOZ2.4test set.

Table 3 :
Slot Accuracy and Slot Error Rate on MultiWOZ2.0-2.4.O refers to over prediction type, P refers to partial prediction type, and E refers to erroneous prediction type.↑: higher is better and ↓: lower is better.

Table 4 :
Error analysis of Scalable-DSC on MultiWOZ 2.0-2.4."PE": the number of incorrectly predicted slot values by the DST model."AE": the number of newly introduced error slots after being corrected the Scalable-DSC model."RE": the reduced number of error slots after correction by Scalable-DSC."CE": the number of error slots after Scalable-DSC model correction.

Table 5
presents the final joint goal accuracy results of our approach on MultiWOZ 2.0-2.4.It is shown that our approach outperforms the previous best results of Correctable-DST on datasets 2.1, 2.2, and 2.4, demonstrating the effectiveness of our method in mitigating the issue of error propagation in dialogues.

Table 6 :
The error rate of each slot on MultiWOZ 2.4. in this paper achieved an 81.62% JGA.The experimental results show our proposed method achieved a 0.72% improvement, proving that identifying the wrong slot first before correction is more beneficial in reducing the inference burden of the model than direct correction.

Table 8 :
Slot Accuracy and Slot Error Rate on MultiWOZ2.0-2.4.↑: higher is better and ↓: lower is better.