Improving Dialogue State Tracking with Turn-based Loss Function and Sequential Data Augmentation

While state-of-the-art Dialogue State Tracking (DST) models show promising results, all of them rely on a traditional cross-entropy loss function during the training process, which may not be optimal for improving the joint goal accuracy. Although several approaches recently proposed augmenting the training set by copying user utterances and replacing the real slot values with other possible or even similar values, they are not effective at improving the performance of existing DST models. To address these challenges, we propose a Turn-based Loss Function (TLF) that penalises the model if it inaccurately predicts a slot value at the early turns more so than in later turns in order to improve joint goal accuracy. We also propose a simple but effective Sequential Data Augmentation (SDA) algorithm to generate more complex user utterances and sys-tem responses to effectively train existing DST models. Experimental results on two standard DST benchmark collections demonstrate that our proposed TLF and SDA techniques signiﬁcantly improve the effectiveness of the state-of-the-art DST model by approximately 7-8% relative reduction in error and achieves a new state-of-the-art joint goal accuracy with 59.50 and 54.90 on MultiWOZ2.1 and Multi-WOZ2.2, respectively.


Introduction
Task-based Virtual Personal Assistants (VPAs) interact with users in natural language to help complete tasks such as making hotel bookings and restaurant reservations. Dialogue State Tracking (DST) is an essential component for VPAs that aims to track the dialogue state from the user's utterances at each turn (Rastogi et al., 2019). Based on the current dialogue state, VPAs decide the next action to perform. In general, existing DST models rely on an ontology that defines slots for a particular domain/task (e.g. hotel-name and taxidestination). To accomplish the tracking task, given the user's current utterance, a slot to track and dialogue history, the DST models need to 1) predict if the user has mentioned the given slot and 2) if so, predict/extract its value from the current utterance. Joint Goal Accuracy (JGA) is a widely used metric to evaluate the effectiveness of DST models Eric et al., 2019;Shah et al., 2018;Wen et al., 2017). At each turn, the joint goal accuracy is 1.0 if and only if all domain-slot and value pairs are predicted correctly, otherwise 0. The existing DST models rely on the traditional cross-entropy loss function during the training process. We argue that this is not effective for optimizing joint goal accuracy. We illustrate this issue in Figure 1 in Dialogue State Prediction 1 & 2, with the traditional cross-entropy loss function, if two models only make one mistake during the training process they will be penalised equally. However, the consequence that the first model incorrectly predicts the value at the first turn is worse than when the second model fails to predict the value at the forth turn (i.e. average JGA across 3 turns is 0 and 0.66 for model 1 and model 2).
Training current DST models currently requires annotated dialogue datasets that cover a wide variety of diverse conversation flows. However, existing dialogue datasets (e.g. MultiWOZ2.1 (Eric et al., 2019)) are relatively small and do not provide coverage of all slot values for open-vocabulary slots (e.g. restaurant name and destination). To alleviate this problem, several recent data augmentation techniques propose (Giovanni et al., 2020;Summerville et al., 2020;Song et al., 2020) replacing the ground-truth values of particular slots using additional information (e.g. restaurant and movie name corpus). Although these augmented dialogues increase coverage of all slot values, the complexity of the dialogues remains the same 1 .
To address the aforementioned challenges, we propose a novel Turn-based Loss Function and Sequential Data Augmentation algorithm that improves the effectiveness of DST models. Our contributions are: • We modify the traditional cross-entropy loss function to take into account the turn information during the training process. Our proposed Turn-based Loss Function (TLF) penalises the DST models more heavily if they fail to predict dependent slot values in subsequent turns.
To the best of our knowledge, this work is the first to incorporate turn dependence into the loss function of the DST models.
• We propose a simple but effective Sequential Data Augmentation algorithm (SDA) to generate complex dialogues that can be used to train DST models to generalize more effectively.
• We conduct comprehensive experiments on two DST benchmark datasets. Experimental results demonstrate that our TLF and SDA approaches consistently and significantly improve the effectiveness of the state-of-the-art DST model in terms of joint goal accuracy. In particular, we study the state-of-the-art DST model behaviour based on turn depth, domains, slot complexity, and robustness using perturbed dialogue history. We find the model does not perform well on later turn depths, dialogues with more active slots, and does not depend heavily on aspects of the dialog history, while the model with our proposed TLF and SDA approach can effectively address these challenges.

Related Work
DST models can be categorised into two types: the ontology-based (Zhang et al., 2019; and span-based models (Heck et al., 2020;Kim et al., 2020;. Zhang et al. (2019) propose an ontology-based DST model that leverages a pre-defined ontology to predict dialogue state based on the similarity between the encoded candidate values and encoded user utterance and slot description. Recent work in DST focuses on the span-based approach to address the scalability and generalisation issues of previous ontologybased models.  proposed a scalable span-based DST model that encodes the whole dialogue context and decodes the value for every slot using a copy-augmented decoder. Recently, several DST models (Kim et al., 2020;Heck et al., 2020) incorporate the predicted dialogue state from previous turns when tracking the dialogue state at the current turn using a copy mechanism. Data augmentation is widely used to improve the effectiveness of the existing DST models (Hou et al., 2018;Giovanni et al., 2020;Song et al., 2020). Hou et al. (2018) use a sequence-tosequence model and delexicalisation to generate a variety of diverse utterances based on the original utterances. These generated augmented utterances help improve the language understanding of the DST models. In addition, the span-based DST models often encounter out-of-vocabulary words (e.g. unseen restaurant name) at inference time. As a result, these DST models are likely to fail to extract unseen words from the utterances. To address this problem, Summerville et al. (2020) augment the training dataset by randomly replacing original slot values with other possible values obtained from external corpora (e.g. restaurant name corpus). Similar to (Summerville et al., 2020), Song et al. (2020) augment the training data by copying user utterances and replace the ground-truth slot values with randomly generated strings. Recently, Li et al. (2021) proposed to use the pre-trained utterance generator and counterfactual goal generator to create novel user utterances that are correlated with the original system response. Their approach showed significant improvement on the DST performance.

Neural Models for DST
We first formalise the task of Dialogue State Tracking and define key notations. Then, we briefly describe the general architecture of existing neural network DST models (see Figure 2 that consist of three main components: the dialogue encoder, the slot operation predictor, and the slot value predictor.

Problem Statement and Notation
DST tracks the state of the user at a particular turn given the user's utterance and system response. Let X = {(U 1 , R 1 ), (U 2 , R 2 ), ..., (U T , R T )} be the sequence of user utterance U and system response R pairs, given a dialogue context with T turns. Each (U t , R t ) pair can involve a single or multiple domains (e.g. restaurant and taxi) and a certain number of slots (e.g. restaurant-name and taxidestination) associated with the domains. Let B = {B 1 , B 2 , ..., B t } be the dialogue state of the user for each turn. We denote all the N possible domainslot pairs as S = {s 1 , s 2 , ..., s N }. Each dialogue state B t is a set of tuples (s, v), where s ∈ S is a domain-slot pair and v is a value associated with the domain-slot s.

The Dialogue Encoder
The dialogue encoder is the core component of DST models that aims to capture the user's intent from the dialogue context (see the blue box in Figure 2). The input of the dialogue encoder is the dialogue context at turn t that consists of the current utterance U t , system response R t and dialogue history H t = (U t−1 , R t−1 ), ..., (U 1 , R 1 ). Existing DST models exploit pre-trained language models (e.g. BERT (Devlin et al., 2019)) to encode the input as follows: where ⊕ is the concatenation operation, [C] and [S] are BERT's special CLS and SEP tokens.
is the output of the dialogue encoder that represent each token in the dialogue context. In particular, e CLS t ∈ R d , where d is BERT's contextual embedding dimension, is the aggregated representation of the total input tokens that captures the user's intent from the whole dialogue context, while [e 1 t , e 2 t , ..., e seqmax t ] is the token-level representation.

The Slot Operation Predictor
The slot operation predictor aims to predict an operation for each slot as one of the slot operations O slot = {none, dontcare, update} (see the red box in Figure 2). none and dontcare operations denotes that the slot does not take a value or could be any value, respectively. The update operation denotes that a value of the given slot could be predicted or extracted from the current utterance U t (see Turn 1 & 2 in Figure 1). If the slot operation predictor predicts that a value of the given domain-slot pair then the DST models will obtain the value from the slot value predictor described in Section 3.4. The input to the slot operation predictor is the aggregated representation e CLS t and the probability distribution over the slot operations O slot for domain-slot pair s at turn t is defined as follows: are learnable parameters and bias. Then, the cross-entropy loss function for the slot operation prediction is defined as follows: where y slot t,s is the one-hot slot operation label for domain-slot pair s at turn t.

The Slot Value Predictor
The final component of DST models is the slot value predictor that aims to extract a value for each domain-slot pair from the dialogue context (the violet box in Figure 2). The slot value predictor takes the token-level representations [e 1 t , e 2 t , ..., e seqmax t ] of the entire dialogue context for turn t as input and applies a two-way linear mapping to compute the probability of the terms being the start and the end position of the span for slot s,ŷ start t,s andŷ end t,s , respectively, as follows: Similar to the slot operation predictor's loss function (Equation (2)), the loss function for the slot value prediction, L value , is defined as follows: (3) where y start t,s and y end t,s are the one-hot start and end position label for domain-slot pair s at turn t. Finally, the DST models are trained using the following joint loss function: where µ slot and µ value are hyperparameters that control the weights of the slot operation prediction and the slot value prediction, respectively. Note that the joint loss function in Equation (4) is widely used by the existing DST models (e.g. (Heck et al., 2020;Kim et al., 2020;Zhang et al., 2019)).

Proposed Methods
We now describe the proposed methods that improve the effectiveness of DST model.

Turn-based Loss Function
We start by describing our Turn-based Loss Function which improves the effectiveness of the core DST model. As discussed in Section 1, most existing DST models (e.g. (Heck et al., 2020;Kim et al., 2020;Zhang et al., 2019)) still rely on the traditional cross-entropy loss function during the training process, which may not be optimal to improve the joint goal accuracy. To address this, we incorporate the turn information during the training process. Our proposed TLF penalises the DST model more heavily if it inaccurately predicts a slot value at the early turns than the later turns. This is important to avoid the error cascade in early turns that results in highly degraded JGA in later turns.
To model this dependency explicitly during the training process we modify the joint loss function as shown in Equation (4) as follows: where T is the total number of turns for a given dialogue, t is the current turn number and λ is a turn weight parameter that controls the influence of the turn-based penalty. For example, if a given dialogue consists of 5 turns (T = 5), the model will be penalised more heavily if it makes a mistake at the first turn (t = 1) than the last (t = 5).

Sequential Data Augmentation
Our proposed Sequential Data Augmentation algorithm improves the generalizability of DST models. The overall training process of DST algorithms with SDA is summarised in Algorithm 1.
In particular, for each turn t, given the current utterance U t , system response R t and dialogue history H t , we generate augmented training data by concatenating U t and R t with U t+1 , U t+2 , ..., U t+η and R t+1 , R t+2 , ..., R t+η , respectively. The hyperparameter η controls the complexity of the augmented dialogues. For example, we can generate augmented data for the dialogue in turn 1 in Fig The larger η is, the more complex the augmented dialogue becomes. We hypothesize that the complex augmented dialogues help the DST model to learn to more effectively track dialogue state for several reasons. First, the longer augmented dialogues help the model to understand the intent deeper in the conversation than the original dialogues, which are often relatively short. Second, the augmented dialogues contain more ground truth labels than the original dialogues 2 , which helps to train the DST models more effectively to extract the domain-slot values from the utterance and system response. For example, in Figure 1, the utterance on turn 1, U 1 , consists only of one ground truth label, whereas the augmented utterance U augment 1 and the augmented system response R augment 1 contains 3 ground truth labels, highlighted in green. Our proposed SDA algorithm differs from the previous data augmentation algorithms (e.g. (Summerville et al., 2020;Song et al., 2020)) in two key ways: 1) SDA takes into account the sequential property of dialogues when generating the augmented training data, while (Summerville et al., 2020;Song et al., 2020) do not and 2) (Summerville et al., 2020;Song et al., 2020) require external information (e.g. a restaurant name corpus), while SDA does not.

Experimental Setup
We conduct experiments on the two most widely used multi-domain task-based dialogue state tracking datasets (MultiWOZ2.1 (Eric et al., 2019) and MultiWOZ2.2 ). These two are the largest datasets which contain over 10,000 dialogues across seven domains: restaurant, taxi, attraction, hotel, train, hospital and police. Following Zhang et al., 2019), we remove hospital and police domains in MultiWOZ2.1 and MultiWOZ2.2 because they only appear in the training dataset. This results in five domains with 30 domain-slot pairs. We use the standard training/validation/test splits provided in the original datasets. Following previous literature (Heck et al., 2020;Zhang et al., 2019), we evaluate all the DST models using the Joint Goal Accuracy (JGA) metric (Henderson et al., 2014). At each turn JGA is 1.0 if and only if all domain-slot and values pairs are correctly predicted, otherwise 0. The score is averaged across all turns in the test set.

Baseline Models
We compare our proposed approaches with a variety of recent DST baselines. TRADE  encodes the whole dialogue context using bidirectional Gated Recurrent Units (GRU) and generates the value for every slot using the GRUbased copy mechanism. Picklist-DST (Zhang et al., 2019) is the ontology-based DST model that requires a pre-defined ontology with all possible values for each domain-slot pair. DS-DST (Zhang et al., 2019) is a hybrid DST model that jointly trains both the ontology-and span-based models. SOM-DST (Kim et al., 2020) is the span-based DST model that uses the copy-mechanism for the slot operation prediction and uses GRU for the slot value prediction. TripPy (Heck et al., 2020) is the state-of-the-art span-based DST model that uses the triple copy mechanism to track the dialogue state. DialoGLUE (Mehri et al., 2020) is the TripPy model that uses ConvBERT, a finetuned BERT trained on an open-domain dialogue corpus consisting of 700M conversations, as dialogue encoder. TripPy-V is the TripPy model that uses the existing Value-based Data Augmentation (VDA) (Summerville et al., 2020), that randomly replaces original slot values with other possible values. TripPy-CoCo (Li et al., 2021) is the TripPy model trained on the augmented data generated by the Controllable Counterfactual (CoCo) data augmentation algorithm that consists of three main components: value substitution, controllable counterfactual generation and classifier filter.

Implementation Details
We implement our proposed TLF and SDA approaches using PyTorch 3 . The hyperparameters of TLF and SDA (i.e. the turn weight parameter λ and the sequence number parameter η) are tuned on the validation set. We use the pre-trained BERTbase-uncased model (Devlin et al., 2019) with 12 hidden layers and embedding dimension d = 768 as the dialogue encoder 4 . For all baselines, we optimise them similarly using cross-entropy loss and the Adam optimiser (Kingma and Ba, 2014) with a learning rate of 2e −5 . For the hyperparameters, we use the optimized parameters reported in the original papers.

Experimental Results and Discussion
Tables 1 reports the effectiveness of DST models in terms of joint goal accuracy on the two datasets. The table contains two groups of rows: The first group reports the effectiveness of the TripPy model that uses our proposed Turn-based Loss Function (TLF) and Sequential Data Augmentation (SDA) approaches compared to the baselines. The second group reports the effectiveness of TLF, SDA and the existing data augmentation algorithms (i.e. VDA and CoCo). The encoder column indicates the pre-trained language model used by the baselines as the dialogue encoder, described in Section 3.2. Due to their recency or a lack of details, we were not able to re-implement all baselines. For those baselines, we include the as-reported results and are unable to test for statistical significance.
We first reproduce results with TripPy and SOM-DST models in Table 1. We find that the relative dialogue state tracking effectiveness of these two models on MultiWOZ2.1 is consistent with the results reported in the original papers (Heck et al., 2020;Kim et al., 2020). For instance, SOM-DST outperforms both TRADE and DS-DST and is as effective as the state-of-the-art ontology-based DST model (DST-picklist). Similarly, we observe that TripPy outperforms all the ontology-based and span-based baselines on the MultiWOZ2.1 and MultiWOZ2.2 datasets. Note that MultiWOZ2.2 is the most recent DST dataset and has not widely used in the previous literature, hence some base-lines results are not available yet on MultiWOZ2.2 5 . The results of TRADE and DS-DST on Multi-WOZ2.2 are those reported in .
Comparing the baseline model that uses our TLF and SDA approaches (TripPy-TS) with baselines across the two datasets in Table 1, we observe that TripPy-TS consistently and significantly outperforms all the ontology and span-based DST baselines in terms of JGA across all datasets. TripPy-TS improves joint goal accuracy by 7.17%, 11.63% and 14.12% relative reduction in error over the base TripPy, DST-picklist and SOM-DST models that use BERT as the dialogue encoder on Multi-WOZ2.1. Comparing TripPy-TS with DialoGLUE, the TripPy model that uses the fine-tuned BERT on 700 million open-domain dialogues, we observe that TripPy-TS still outperforms DialoGLUE by 1.36% relative reduction in error on Multi-WOZ2.1, although TripPy-TS only uses the pretrained BERT-base-uncased model as the dialogue encoder. Similar results are observed on Multi-WOZ2.2 where TripPy-TS outperforms TripPy, DS-DST and TRADE by 8.26%, 6.19% and 20.93% relative reduction in error, respectively. These results imply that our TLF and SDA approaches significantly and consistently improve the effectiveness of the state-of-the-art DST model, TripPy.
Next, we further analyse the effectiveness of TLF and SDA using an ablation study. We note that TripPy-T and TripPy-S are the baseline models using TLF and SDA. Comparing TripPy-TS to TripPy-T, we observe a significant decrease of effectiveness in terms of JGA across both datasets. The relative reduction in error decreases around 1.55-5% in TripPy-TS's performance compared to TripPy-T. These results indicate the importance of SDA in enhancing the effectiveness. In addition, comparing TripPy-S and TripPy-V that uses the existing Value-based Data Augmentation, we find that SDA is more effective than the VDA in improving the effectiveness of TripPy. Comparing TripPy-TS and TripPy-S, we observe approximately 2.53% and 1.87% relative reduction in error decreases of the TripPy-TS's performance on MultiWOZ2.1 and MultiWOZ2.2 datasets, respectively. These results are intuitive because TLF improves the effectiveness of DST models by penalising the DST models heavily if they fail to accurately predict the dialogue state at the early to mid turn depths. Overall, we find that our proposed TLF and SDA approaches together consistently and significantly improve the effectiveness of state-of-the-art DST model (TripPy) across the two used datasets.
We further compare the performance of SDA (TripPy-S) and the state-of-the-art CoCo data augmentation algorithm (TripPy-CoCo). Note that TripPy-CoCo (2x) denotes the TripPy model that is trained on the augmented data two times larger than the original training data. First, we observe that TripPy-S outperforms TripPy-CoCo (1x) and (2x) by 3.63% and 1.91% relative reduction in error on MultiWOZ2.1. Although TripPy-CoCo (4x) and (8x) are more effective than TripPy-S, such comparison is not fair due to several reasons. First, the value substitution component of CoCo relies on a pre-defined value set for each domain-slot which is manually created. Second, CoCo initialises the parameters of the controllable counterfactual generation model and the classifier filter using the pretrained T5 (Raffel et al., 2020) and BERT models. Third, CoCo uses MultiWOZ2.2 to train the controllable counterfactual generation model and the classifier filter, yet evaluate the performance of TripPy-CoCo on MultiWOZ2.1. In contrast, our SDA approach does not use any pre-defined value set for each domain-slot, the advanced pre-trained models (i.e. T5) as well as the MultiWOZ2.2 dataset during the training process.

Turn Depth and Domain-specific JGA
Dialogues vary in length and longer dialogues are likely to be more challenging. In this section, we study the relationship between the depth of dialogue and the effectiveness of different models 6 . The trend in Figure 3 clearly shows the effectiveness of all models decreases steadily and dramat-ically as the turn depth increases. Comparing the effectiveness of TripPy that uses the traditional cross-entropy loss function and TripPy-T that uses our proposed Turn-based Loss Function, we observe that TripPy-T outperforms the baseline starting from the third turn consistently through to turn ten. The improvement of TripPy-T compared to TripPy from the early to mid turn depths has a large impact in the later turns. These results imply that we should penalise the model more heavily if it fails to predict the slot value early-ish in the conversation as the error from the first turn cascades in later turns, degrading JGA.
Next, we investigate the utility of our proposed SDA algorithm to improve the quality of DST. We see that the performance of both TripPy and TripPy-S are similar on the first and second turns. Then, from the third to the tenth turn, TripPy-S consistently outperforms TripPy on MultiWOZ2.2. When we compare TripPy with TripPy-T, SDA does not improve the performance of TripPy at the early turns but increases the effectiveness of the model in the later turns. Finally, the performance of TripPy-TS with the base model is comparable at the first and second turn. Interestingly, we observe that TripPy-TS consistently outperforms TripPy from the third turn to the tenth turn. This demonstrates that TLF and SDA are complementary and together play an important role in improving the quality of state tracking across increasing turn depths.
We also analyse the effectiveness of the algorithms examining joint goal accuracy for each domain over turn depths. First, in Figure 4, the results show that TripPy-TS consistently outperforms TripPy across the first five turns on the restaurant, hotel, attraction and train domains on Mul-tiWOZ2.2 7 . In particular, TripPy-TS outperforms TripPy by approximately 15%, 7%, 14% and 20% relative reduction in error for the restaurant, hotel, attraction and train domains. We also observe that TripPy-TS outperforms TripPy from the third turn on the taxi domain. On the fourth turn TripPy completely fails to track the state from fourteen dialogues while TripPy-TS accurately predicts three. The performance of TripPy-TS averaged across all turns on the taxi domain is better than the performance of TripPy by approximate 22% relative reduction in error (i.e. 31.70 JGA compared to 25.79 JGA).

Complex Turn-specific JGA
We compare TripPy and TripPy-TS on different turns with a particular number of active slots, i.e. a simple turn contains either one or two active slots.
We define a complex turn to be one containing either three or four active slots. In Figure 5, we observe that TripPy-TS consistently outperforms TripPy on the simple turns by approximately 6-10% relative reduction in error in terms of JGA across MultiWOZ2.1 and MultiWOZ2.2. It is clear on both datasets that TripPy-TS is more effective in predicting state for the complex turns. TripPy-TS outperforms TripPy by 17-21% and 23-34% relative reduction in error in terms of JGA over the complex turns on MultiWOZ2.1 and Multi-WOZ2.2, respectively. These results imply that both TLF and SDA consistently improves the effectiveness of TripPy in accurately predicting the state over the turns with more active slots. As illustrated in Section 4.2, the augmented dialogues generated by our algorithm are contain more active slots than the original dialogues in the training set, that usually have a small number of active slots per turn. These augmented dialogues help the model to learn to effectively extract the slot values from complex turns.

Impact of hyperparameters
We evaluate the sensitivity of the hyperparameters for our proposed methods. First, we study the effect of the turn weight parameter λ of TLF in Equation 5 by varying its value in the range of {0, 0.01, 0.05, 0.1, 0.5} 8 . Note that λ = 0 corresponds to the baseline model without TLF. From the left figure in Figure 6, we observe that setting λ = 0.01, 0.05, 0.1 or 0.5 is more effective than the baseline with λ = 0 across both datasets. Next, we study the impact of the sequence number parameter η of the Sequential Data Augmentation algorithm. From the right figure in Figure 6, we observe that η = 1, 2, 3, 4 or 5 is more effective than η = 0. The most effective is obtained when η = 3 or η = 4 respectively.

Impact of perturbed dialogue history
Recent work by Sankar et al. (2019) shows existing task-based models seldom understand or use the dialogue history effectively. We first study the behavior of the state-of-the-art TripPy model to use it (i.e. H t in Section 3.2). Then, we compare behavior of the proposed TLF and SDA methods. We apply four types of perturbation operations to the dialogue history in the test set. P1 and P2 are utterance-level perturbation operations that shuffle and reverse the sequence of utterances in the dia- logue history. P3 and P4 are word-level perturbation operations that randomly shuffle words within an utterance and reverse the ordering of words.
In Figure 7, we observe that the effectiveness of TripPy and TripPy-TS decreases by 0.5-2.5% relative reduction in error across different perturbation operations on MultiWOZ2.1. These results are intuitive because the DST models are less likely to capture intent from the perturbed history, hence being less effective. For the P3 operation that randomly shuffles the words within the utterance, the results show that this operation negatively impacts the performance of TripPy-TS, while it only slightly affects TripPy. Similar to the results observed on MultiWOZ2.1, on MultiWOZ2.2 we also observe that both TripPy and TripPy-TS suffer from the four perturbation operations, except on P1 with TripPy. As desired, it shows that TripPy-TS is more sensitive than TripPy to the four perturbation operations. This implies that our proposed TLF and SDA approach rely more heavily on turn and word order. This is behavior that we expect and desire in a tracking model. We hypothesize that the augmented dialogues (which are complex and long) generated by SDA force the model to incorporate the dialogue order during the training process.

Conclusion
We propose two novel algorithms, TLF (Turnbased Loss Function) and SDA (Sequential Data Augmentation), that improve the effectiveness of state-of-the-art dialogue state tracking models. TLF penalizes such models more heavily if they fail to predict slot values in the middle of the conversation. On the other hand, SDA generates dialogues used to train existing DST models to generalize more effectively. Our comprehensive experiments on multiple benchmark datasets demonstrate the combined utility of both TLF and SDA to improve the effectiveness of the leading model in the literature. Indeed, TLF and SDA significantly improve the effectiveness of TripPy by approximately 8.26% relative reduction in error on MultiWOZ2.2, which constitutes the new state-of-the-art result on this most recent benchmark dataset. For future work, we plan to extend both TLF and SDA to incorporate additional information such as dialogue length and the number of active slots during the training process to be even more effective for long and complex dialogues. We also plan to investigate the effectiveness of TLF and SDA on other existing DST models.