MoNET: Tackle State Momentum via Noise-Enhanced Training for Dialogue State Tracking

Dialogue state tracking (DST) aims to convert the dialogue history into dialogue states which consist of slot-value pairs. As condensed structural information memorizes all dialogue history, the dialogue state in the previous turn is typically adopted as the input for predicting the current state by DST models. However, these models tend to keep the predicted slot values unchanged, which is deﬁned as state momentum in this paper. Speciﬁcally, the models struggle to update slot values that need to be changed and correct wrongly predicted slot values in the previous turn. To this end, we propose MoNET to tackle state momentum via noise-enhanced training. First, the previous state of each turn in the training data is noised via replacing some of its slot values. Then, the noised previous state is used as the input to learn to predict the current state, improving the model’s ability to update and correct slot values. Furthermore, a contrastive context matching framework is designed to narrow the representation distance between a state and its corresponding noised variant, which reduces the impact of noised state and makes the model better understand the dialogue history. Experimental results on MultiWOZ datasets show that MoNET outperforms previous DST methods. Ablations and analysis verify the effectiveness of MoNET in alleviating state momentum issues


Introduction
Dialogue state tracking (DST) is a core component in modular task-oriented dialogue systems (Hosseini-Asl et al., 2020;Yang et al., 2021;Sun et al., 2022Sun et al., , 2023)).It extracts users' intents from the dialogue history and converts them into I need a hotel reservation for 4 nights starting Saturday.
I am sorry, but there is nothing available at that time at either hotel.Would you like to try fewer nights or perhaps a different arrival day?
Will you try Sunday arrival?Yes, I tried Sunday but it is not available.Can you shorten your stay at all? Yes, can you try it for 3 nights?A dialogue example of three turns, containing the system utterance (U), the user response (R), the ground truth dialogue state (GT), and the prediction of each turn (Pred).The state "hotel-book day-Saturday" is predicted in the first turn (marked in blue).The dotted arrow represents the ideal predictions, i.e., update slot values that need to be changed (Turn 2) and correct wrongly predicted slot values in the previous turn (Turn 3).The solid arrow represents the predictions (marked in red) with state momentum issues.
structural dialogue states, i.e., sets of slot-value pairs.An accurate dialogue state is crucial for generating correct dialogue action and suitable natural language responses, which are the main tasks of dialogue management and natural language generation components (Williams and Young, 2007;Thomson and Young, 2010;Young et al., 2010).Earlier DST approaches predict the state directly from the dialogue history (natural language utterances) (Mrkšić et al., 2017;Xu and Hu, 2018;Wu et al., 2019;Chen et al., 2020a).Since the dialogue state is condensed structural information memorizing all dialogue history, recent methods incorporate the previously predicted state as the input besides the dialogue history (Ouyang et al., 2020;Kim et al., 2020;Ye et al., 2021).
Conventional DST models taking the previous state as the input usually show the characteristic that the previously predicted slot values tend to be kept unchanged when predicting the current state, defined as state momentum in this paper.The state momentum makes DST models struggle to modify the previous prediction, which affects the performance when the values of some slots need to be updated as the user's intent changes, and there exist wrongly predicted slot values that need to be corrected.Figure 1 gives an example of a dialogue involving three turns with two types of state momentum issues.The state hotel-book day-Saturday is predicted in Turn 1 and keeps unchanged in the next two turns, while the user's request is updated into Sunday in Turn 2. Consequently, the predicted state becomes wrong in the following two turns.The dotted arrow represents the ideal prediction cases: the value is updated with the ground truth changes and is corrected when becoming a wrong input.The solid arrow represents the state momentum issues, where the state is kept unchanged, leading to two consecutive wrong predictions.One possible reason for the state momentum issue is that in the training data, most slot values in the previous turn are the same as those in the current turn, which limits the ability of conventional DST models to modify slot values during inference.To address this limitation, an intuitive idea is to augment training instances with a higher ratio of slots whose previous values differ from those in the current turn.By incorporating such examples, the DST model can learn to deal with more cases where modifying previous predictions is required.Besides, if the DST model can treat wrong and correct dialogue states similarly in representations, then the former will typically help make further predictions.In other words, by treating incorrect dialogue states as valuable information, the DST model can potentially identify and correct erroneous slot values.
In this paper, we propose MoNET to tackle the state momentum issue via a noise-enhanced training strategy.The core idea is to manually add noise into the previous state to simulate scenarios with wrong state input.First, the previous state of each turn in the training data is noised via replacing some of its slot values.Specifically, for each active slot (with a non-none value), we replace its value with a certain probability.Then, the noised previous state, concatenated with the dialogue history, is used as the input to learn to predict the current state, improving the model's ability to update and correct slot values.Furthermore, a contrastive context matching framework is designed to narrow the representation distance between a state and its corresponding noised variant, which reduces the impact of the noised state and makes the DST model better understand the dialogue history.Such approaches make the model less sensitive to the noise, and enhance its ability to modify the slot values of previous states in current predictions.Experiments on the multidomain dialogue datasets MultiWOZ 2.0, 2.  (Williams and Young, 2007;Thomson and Young, 2010;Lee and Kim, 2016).Recent researches pay more attention to multi-domain DST using distributed representation learning (Wen et al., 2017;Mrkšić et al., 2017).Previous works implement Seq2seq frameworks to encode the dialogue history, then predict the dialogue state from scratch at every turn (Rastogi et al., 2017;Ren et al., 2018;Lee et al., 2019;Wu et al., 2019;Chen et al., 2020a).Utilizing dialogue history is limited for larger turns, since the state of each turn is accumulated from all previous turns, while it's hard to retrieve state information from a long history.
Current works mainly incorporate the previous state as the model input, which is regarded as an explicit fixed-sized memory (Ouyang et al., 2020;Ye et al., 2022a;Wang et al., 2022).Kim et al. (2020) propose a state operation sub-task, where the model is trained to first predict the operation of each slot-value pair, such as UPDATE, CARRYOVER, etc., then only the value of a minimal subset of slots will be newly modified (Zeng and Nie, 2020;Zhu et al., 2020).These methods enhance model prediction efficiency and the ability to update slot-value pairs.Tian et al. (2021) deal with the error propagation problem that mistakes are prone to be carried over to the next turn, and design a two-pass generation process, where a temporary state is first predicted then used to predict the final state, enhancing the ability to correct wrong predictions.In this paper, we use "state momentum" to define the issue where the wrong dialogue state is predicted due to that the previous prediction keeps unchanged, either it should be updated or corrected.To the best of our knowledge, this is the first time to systematically tackle the issue caused by continuous unchanged predictions in the multi-turn DST task.

Contrastive Learning
Contrastive learning aims to generate high-quality representations by constructing pairs of similar examples to learning semantic similarity (Mnih and Teh, 2012;Baltescu and Blunsom, 2015;Peters et al., 2018).
The goal is to help the model semantically group similar instances together and separate dissimilar instances.During training, the neighbors with similar semantic representations (positive pairs) will be gathered, while the non-neighbors (negative pairs) will be pushed apart, enabling the learning of more meaningful representations.In the NLP area, semantic representations can be learned through self-supervised methods, such as center word prediction in Word2Vec, next sentence prediction in BERT, sentence permutation in BART, etc (Mikolov et al., 2013;Devlin et al., 2019;Lewis et al., 2020).Recent approaches build augmented data samples through token shuffling, word deletion, dropout, and other operations (Cai et al., 2020;Klein and Nabi, 2020;Yan et al., 2021;Wang et al., 2021;Gao et al., 2021;Zhang et al., 2022).In this paper, we construct augmented samples based on the noised and original dialogue state.Given context inputs with the same dialogue history and different states, the model is trained to gather them into similar objects, aiming to learn better representations, reduce the impact of noise, and better understand the dialogue history.

Problem Formulation
In this paper, we focus on building a dialogue state tracking (DST) model which accurately predicts the dialogue state based on the dialogue history and the previous state during multi-turn dialogue interactions.A dialogue state consists of domainslot-value tuples, typically corresponding to the dialogue topic, the user's goal, and the user's intent.Following previous studies, in the rest of this paper, we omit "domain" and use "slot" to refer to a "domain-slot" pair.All slot-value pairs are from a pre-defined ontology.
Formally, let's define D t = [U t , R t ] as a pair of system utterance U t and user response R t in the t-th turn of a multi-turn dialogue, and B t as the corresponding dialogue state.Each state B t contains a set of slot-value pairs, i.e., B t = {(S j , V i j )|j ∈ [1 : J]}, where J is the total number of slots, and V i j ∈ V j is one of the values in V j for the j-th slot S j in the ontology.Given the dialogue history {D 1 , ..., D t } and previous state B t−1 , the goal of the DST task is to predict the current dialogue state B t .

MoNET
As introduced in Section 1, solving the state momentum issue is crucial to the DST task.Therefore, in this paper, we propose MoNET to tackle the state momentum issue via a noiseenhanced training strategy to enhance the model's ability to update and correct slot values.The architecture of MoNET is shown in Figure 2(a), which consists of context BERT encoders, slot and value BERT encoders, the slot-context attention module, the slot-value matching module, and the contrastive context matching framework.Each of them will be elaborated on in this section.

Base Architecture
We first introduce the base architecture of our MoNET, similar to the backbone model in (Ye et al., 2022a).A model trained only with the base architecture of MoNET is noted as "Baseline", and  For each active slot in state B t−1 , given a noise threshold p, a random number a is selected.If a < p, then its value is replaced with another one randomly selected from the ontology (suppose the pair "train-day-Saturday" is replaced with "train-day-Friday"); otherwise the value will be kept unchanged (suppose the pairs "train-departure-Birmingham" and "train-destination-Cambridge" are unchanged).evaluated in Section 5 to compare the difference in performance with the whole MoNET model.Context Encoder.A BERT encoder encodes the context input, which is the concatenation of the dialogue history and the state in the previous turn: (1) where M t = D 1 ⊕, ..., ⊕D t−1 contains previous utterances, B t−1 is the state containing the active slots in the previous turn, [CLS] and [SEP ] are special tokens of the BERT encoder.Then the representations of the context input are derived: where |X t | is the total number of tokens in X t , and d is the encoded hidden size.
Slot and Value Encoders.The BERT encoders with fixed parameters are used to derive the slot and value representations: where states h S j , h V i j ∈ R 1×d are the [CLS] representations of the slot and value.
Slot-Context Attention.For each slot S j , its slot-context-specific feature is extracted by the multi-head attention mechanism (Vaswani et al., 2017): (4) where LN is the normalization layer.
Slot-Value Matching.The probability of predicting the value V i j of the slot S j is derived by calculating the L2-distance between the value representation h V i j and the slot-context representation r t S j , which is denoted as: ) where θ are trainable parameters of the model.

Training and Inference.
During training, the ground dialogue state is used to form the context input X t (teacher-forcing).For the t-th turn, the loss is the sum of the negative log-likelihood among all J slots as follows: where V i * j is the ground truth value of the slot S j at turn t.During inference, the previously predicted state is used to form the context input X t , and the value of the slot S j is predicted by selecting the one with the smallest distance, corresponding to the largest probability:

Noised Data Construction
As described previously, an intuitive idea to tackle the state momentum issue is to increase the number of training instances where the slot-value pairs in the previous turn are different from those in the current turn.Based on this point, we attempt to utilize noised data to train the DST model.Generally, for each active slot (with a nonnone value) in the previous dialogue state, we involve noise by replacing its original value with another value with a probability p (used as the noise threshold), e.g., as the example shown in Figure 2(b).Formally, at each training step, given a batch of training instances, a noised context input X + t is constructed for each instance based on its original context input X t = f (M t , B t−1 , D t ) as follows: For each active slot S j in B t−1 = {(S j , V i j )}, a real number a ∈ [0, 1] is sampled to determine whether the original V i j is replaced with a randomly selected value V k j ∈ V j \ {V i j } from the ontology or kept unchanged:

Noised State Tracking
Similar to X t , the noised context instance X + t is also used as the model input to predict the state B t as the training target, aiming to improve the model's ability to dynamically modify the previous slot values in current predictions.Specifically, the representation H + t of X + t is first derived by the BERT context encoder mentioned in Section 3.2.1: Then, similar to the previous process, for each slot S j , X + t is used to predict its value based on the distribution P θ (V i j |X + t , S j ).Eventually, the loss for the noised state tracking can be denoted as: (11)

Contrastive Context Matching
Inspired by contrastive learning approaches which group similar samples closer and diverse samples far from each other, a contrastive context matching framework is designed to narrow the representation distance between X t and its noised variant X + t , aiming to reduce the impact of the noised state B + t−1 and help the model better understand the dialogue history.Specifically, in a batch of N instances with the original context input X t = {X n t } N n=1 , we construct N corresponding noised instances with the context input X + t = {X n+ t } N n=1 .To clearly describe the context inputs, in this section, we temporarily involve n into X t & H t as X n t & H n t to indicate the in-batch index.For each context input X n t , its noised sample X n+ t is regarded as its positive pair, and the rest (2N −2) instances in the same batch with different dialogue histories are considered negative pairs.Then the model is trained to narrow the distance of the positive pair and enlarge the distance of negative pairs in the representation space with the following training objective:  (Chen et al., 2020c).

Optimization
The total training loss for each instance is the sum of losses from the slot-value matching for DST and the contrastive context matching for representation learning, where the former is the average of the losses using the original or the noised context input mentioned in Section 3.2.1 and 3.2.3: 4 Experiment Setting

Evaluation Metrics
We use joint and slot goal accuracy as the evaluation metrics.Joint goal accuracy is the ratio of dialogue turns where the values of all slots are correctly predicted.Slot goal accuracy is the ratio of domain-slot pairs whose values are correctly predicted.Both of them include correctly predicting those inactive slots with the value none.

Training Details
The BERT-base-uncased model is used as the context, slot and value encoders, with 12 attention layers and a hidden size of 768.During training, only the parameters of the context BERT encoder are updated, while the parameters of the slot and value BERT encoders are not.The batch size is set to 8. The AdamW optimizer is applied to optimize the model with the learning rate 4e-5 and 1e-4 for the context encoders and the remaining modules, respectively (Loshchilov and Hutter, 2019).The temperature parameter τ is set to 0.1.The noise threshold p defined in Section 3.2.2 is set to 0.3, and its impact on model performance is discussed in Section 5.All models are trained on a P40 GPU device for 6-8 hours.
5 Results and Analysis

Main Results
Table 1 shows performances of MoNET and baselines on MultiWOZ 2.0, 2.1 and 2.4.Among them, TripPy and its modified versions employ a ground truth label map of synonyms replacement as extra supervision, which increases their accuracy scores and differs from other methods of testing with common labels.As can be observed, MoNET achieves the joint goal accuracy scores of Besides the general joint and slot goal accuracy, we also calculate the slot-level proportion of state momentum errors over all wrong predictions.We train the Baseline model and make predictions on the MultiWOZ 2.4 test set.For each dialogue, starting from the second turn, we add up each wrong predicted slot-value pair which also exists in the previous turn.Finally, there are 844 such wrong slot-value pairs, and the number of all the wrong predicted pairs is 2603, hence the proportion is (844/2603)*100%=32.4%,and our MoNET model modifies 47.0% of them (397 in 844 are correctly predicted).Moreover, in MultiWOZ 2.4 training set annotations, for each dialogue turn (also except the first turn of each dialogue), around 78.1% slotvalue pairs exist in the previous turn, since the slotvalue pairs will be accumulated as the dialogue progresses.The results further indicate the issue caused by those unchanged slot-value pairs during multi-turn interactions, and the effectiveness of our method in enhancing the model's ability to modify previous predictions.

Ablation Study
To explore the individual contribution of each part of our model, we compare the whole MoNET with several ablated versions.First, we remove the previous dialogue state from the context input of the Baseline model, where the modified context input is , denoted as Baseline w/o state; besides, the two noise-enhanced methods are removed from the MoNET respectively, denoted as MoNET-CM (context matching only) and MoNET-ST (noised state tracking only).
Table 2 shows the joint goal accuracy performances of the full MoNET model and its four modifications on the MultiWOZ 2.4 test set.As can be observed, Baseline w/o state gets the lowest accuracy, demonstrating that explicitly using the previous dialogue state as part of the model input is beneficial to make predictions, even though there may exist wrong slot-value pairs.Besides, both MoNET-CM and MoNET-ST outperform the Baseline model, demonstrating functionalities of the noised state tracking in modifying slot-value pairs in further turns, and the context matching framework in learning improved semantic representations.Moreover, MoNET derives the best performance, demonstrating the effectiveness of integrating the two parts into a unified noised-enhance training strategy.

Turn-Level Evaluation
Figure 3 shows the turn-level joint goal accuracy of MoNET and Baseline models, as well as the percentage difference in accuracy (the difference between the two models' accuracy divided by the accuracy of Baseline) on the MultiWOZ 2.4 test set.Generally, the state momentum issue becomes more apparent in dialogues with larger turns, since they always contain more active slotvalue pairs, and any one of the wrong pairs kept unchanged will affect the further prediction accuracy.With the increase of turns, the accuracy of Baseline harshly degrades, while MoNET gets a relatively smaller decline, resulting in a gradually increasing and evident percentage difference in accuracy.This demonstrates the superiority of MoNET in alleviating the accuracy decrease caused by the state momentum issue, especially in those dialogues with larger than 6-7 turns.

Noise Threshold Selection for Training
To explore the impact of different probabilities of adding noise into the context input for training, we vary the noise threshold p from 0 to 0.5 to train our MoNET.The results on the MultiWOZ 2.4 validation set are shown in Figure 4, where MoNET achieves the best performance when the noise threshold p is set to 0.3.Intuitively, a small p makes the noised context input contain fewer noised slotvalue pairs (hard to learn meaningful semantics from the noised data); conversely, a large p makes the noised context input far from the original context input in the representation space (hard to group them closer).Both two cases make the model hard to learn effective features from the noised context input, leading to lower prediction accuracy.Hence, the empirical probability of adding noise is important to derive the best performance of the DST model.

Anti-noise Probing with Noise Testing
In this section, we conduct noise testing to explore the impact of anti-noise ability on DST models.We first evaluate DST performances of MoNET and Baseline by introducing different ratios of noise (with p from 0 to 1) into the oracle previous dialogue state as the model input.Figure 5 shows the performances of MoNET and Baseline on MultiWOZ 2.4.Both of them get high accuracy when the noise ratio is 0, as we use the oracle previous dialogue state as the model input; with the increase of the noise ratio, the joint goal accuracy of Baseline gets a sharp decline, while MoNET degrades much more smoothly.Furthermore, for each dialogue turn, we also show the L2distance between the original and noised context representations, i.e., the mean pooling of all token representations H t and H + t .As can be observed, along with the increase of noise ratio, the distance between the two representations of MoNET is much lower than that of Baseline.These results indicate that MoNET achieves a higher anti-noise ability by generating relatively similar representations for the original and noised contexts, which helps the DST model maintain an acceptable performance even with a high ratio of noise in its input.

Case Study and Attention Visualization
Table 3 gives two prediction examples using MoNET and Baseline on the MultiWOZ 2.4 test set, corresponding to the two types of state momentum cases.In the first one, they correctly predict the slotvalue pair "train-day-Sunday", while only MoNET updates it in the next turn along with the ground truth changing into "train-day-Monday".In the second one, they make a wrong prediction "taxidestination-Gonville and Caius College".While

Extension on Generation-based Models
In addition to the original classification-based MoNET model, we also evaluate our approach using a simple generation framework using T5base as the backbone pre-trained model (Raffel et al., 2020).The ontology is built from the database and training set annotations, which is

Conclusion
In

Limitations
Our proposed MoNET is a classification-based method requiring the pre-defined ontology containing all slot-value pairs.Moreover, during prediction, for each slot, its distance with all possible values is calculated, i.e., the prediction has to process 30 times, which is the number of slots in the MultiWOZ dataset.Compared with the generation methods that only process once and do not need ontology, our method is short in training efficiency and scalability.However, most task-oriented dialogue datasets contain their knowledge base containing slot value information, so it's acceptable to construct the ontology for random sampling.Besides, the results in Section 5.7 demonstrate that our method can be implemented into generation-based backbone models.
Figure1: A dialogue example of three turns, containing the system utterance (U), the user response (R), the ground truth dialogue state (GT), and the prediction of each turn (Pred).The state "hotel-book day-Saturday" is predicted in the first turn (marked in blue).The dotted arrow represents the ideal predictions, i.e., update slot values that need to be changed (Turn 2) and correct wrongly predicted slot values in the previous turn (Turn 3).The solid arrow represents the predictions (marked in red) with state momentum issues.

Figure 2 :
Figure 2: The model description and noised input construction example.The left part (a) shows the architecture of the MoNET model.A context input representation H n t in an N -size batch is shown in the contrastive context matching framework as an example, where H n+ t is the context representation of its noised variant.The right part (b) gives an example of constructing noised context input.For each active slot in state B t−1 , given a noise threshold p, a random number a is selected.If a < p, then its value is replaced with another one randomly selected from the ontology (suppose the pair "train-day-Saturday" is replaced with "train-day-Friday"); otherwise the value will be kept unchanged (suppose the pairs "train-departure-Birmingham" and "train-destination-Cambridge" are unchanged).

Figure 3 :
Figure 3: Turn-level joint goal accuracy and accuracy difference between MoNET and Baseline on the MultiWOZ 2.4 test set.

Figure 4 :
Figure 4: Performance on the MultiWOZ 2.4 validation set w.r.t the noise threshold of adding noise.
only used for noise value construction.The model framework is similar to the BERT-based MoNET in Figure 2(a), where the BERT encoders and slot-value matching modules are replaced with T5 encoders and decoders.The T5 encoders encode the dialogue context inputs, slots, and values.After deriving the slot-context attentive representations, the T5 decoders generate each slot-value pair.Table 4 shows the joint goal accuracy performance of the T5-based MoNET on the MultiWOZ 2.0 test set, compared with other end-to-end/generation-based models using the same T5-base pre-trainied model.As can be observed, our modified MoNET outperforms the T5-base backbone and others with the same T5-base model, indicating its effectiveness and adaptability for the implementation of generationbased methods.

Table 1 :
Joint and slot goal accuracy of our MoNET and several previous methods on three MultiWOZ test sets.
4.1 DatasetsWe choose MultiWOZ, 2.0, 2.1, and 2.4 versions as our datasets.MultiWOZ 2.0(Budzianowski et al., Performances on the MultiWOZ 2.4 test set w.r.t. the noise threshold (corresponding to the noised slot-value pair ratio in the dialogue state input).The left one is the goal accuracy, and the right one is the active slot-context features similarity.

Table 3 :
Predictions of two dialogue examples on MultiWOZ 2.4 separated by the double solid line, corresponding to two state momentum cases.Wrong and correct predicted values are marked in red and blue.
Attention visualizations of the two dialogue examples mentioned in Table3.

Table 4 :
Joint goal accuracy on MultiWOZ 2.0 test set of baselines using the same T5-base pre-trained model.Baseline keeps it unchanged in the next turn, MoNET corrects it, resulting in a joint goal accuracy of 100% for the second turn.Besides, we further explore these two examples by calculating and visualizing the overall attention scores, which are shown in Figure6.For each slot, its overall attention score over each token is the weighted sum of the self-attended scores by all tokens in X t .The weights come from the slot-context attention, and the self-attended scores are the average of attention scores over multiple layers in BERT.As can be observed, Baseline pays more attention to the values in the previously predicted state, and fails to solve the state momentum issues; MoNET pays relatively higher attention to the correct tokens ("monday" in the first case and "autumn house" in the second case), and consequently, successfully updates Sunday into Monday and corrects Gonville and Caius College into Autumn House.These examples and attention visualizations indicate the effectiveness of our MoNET in alleviating the two types of state momentum issues.
this paper, we define and systematically analyze the state momentum issues in the DST task, and propose MoNET, a training strategy equipped with noised DST training and the contrastive context matching framework.Extensive experiments on MultiWOZ 2.0, 2.1, and 2.4 datasets verify its effectiveness compared with existing DST methods.Supplementary studies and analysis demonstrate that MoNET has a stronger anti-noise ability which helps alleviate the state momentum issues.