Preview, Attend and Review: Schema-Aware Curriculum Learning for Multi-Domain Dialogue State Tracking

Existing dialog state tracking (DST) models are trained with dialog data in a random order, neglecting rich structural information in a dataset. In this paper, we propose to use curriculum learning (CL) to better leverage both the curriculum structure and schema structure for task-oriented dialogs. Specifically, we propose a model-agnostic framework called Schema-aware Curriculum Learning for Dialog State Tracking (SaCLog), which consists of a preview module that pre-trains a DST model with schema information, a curriculum module that optimizes the model with CL, and a review module that augments mispredicted data to reinforce the CL training. We show that our proposed approach improves DST performance over both a transformer-based and RNN-based DST model (TripPy and TRADE) and achieves new state-of-the-art results on WOZ2.0 and MultiWOZ2.1.


Introduction
Dialog state tracking (DST) extracts users' goals in task-oriented dialog systems, where dialog states are often represented in terms of a set of slot-value pairs (Williams et al., 2016;. Due to the language variety of multi-turn dialogs, the concepts of slots and values are often indirectly expressed in the conversation (such as co-references, ellipsis, and diverse appearances), which are a major bottleneck for improving DST performance (Gao et al., 2019;Hu et al., 2020). Many existing DST methods have focused on designing better model architectures to tackle the problems (Dai et al., 2018;Kim et al., 2020), but still neglect the full exploitation of two important aspects of structural information.
The first is curriculum structure in a dataset. Such a structure relies on a measure of the difficulty of examples, which can be used to guide the * * Corresponding author Figure 1: An easy and a hard dialog example for DST. model training in an easy-to-hard manner, imitating the meaningful learning order in human curricula. This paradigm is called curriculum learning (CL) (Bengio et al., 2009) and has been shown useful in various other problems . DST training examples also vary greatly in their difficulty levels. As shown in Figure 1, for the same slot 'taxi-departure', a user can either inform its value 'nandos' explicitly in a simple utterance or convey her intention implicitly via multi-round interactions, requiring a complex inference process to find the value 'golden house' referred by the slot 'restaurant-name'. However, CL has been rarely studied in DST, and models are often trained with dialog data in a random order.
In addition, schema structure is prominent in multi-domain task-oriented dialogs. A schema is specified by a collection of all possible slots and their values, which describes semantic relations among them. Some previous work utilized the structure via an extra schema graph in a regular training process . We propose to incorporate schema information into CL through a pre-curriculum process, in which a DST model can be pre-trained with schema-related objectives to prepare for upcom-ing DST examples. To reinforce the CL training, we can also expand those examples with frequent mispredictions during CL based upon the schema, enabling the model to accumulate more experience and perform better on similar cases.
Built on these motivations, we propose a novel framework named as Schema-aware Curriculum Learning for Dialog State Tracking (SaCLog), which consists of three components: 1) a preview module that pre-trains the base part of a DST model (e.g., BERT and RNN) with objectives capturing the connections between the schema and dialog contexts, 2) a curriculum module that organizes training data from easy to hard and optimizes the model with CL, and 3) a review module which leverages schema-based data augmentation to extend mispredicted data to boost the CL training process further. The proposed approach is model-agnostic, in the sense that it can be incorporated into different DST models. To the best of our knowledge, this is the first attempt to apply CL to the DST task. We show that our proposed approach improves DST performance over both a transformer-based and RNN-based DST model (TripPy and TRADE) and achieves new state-ofthe-art results on WOZ2.0 and MultiWOZ2.1.

Problem Formulation
We denote a dialog context containing t turns as X t = {(R 1 , U 1 ), ..., (R t , U t )}, where R i and U i represent system and user utterance at the i-th turn respectively. DST is tasked to extract turn-level or discourse-level dialog states in the form of a set of slot-value pairs given X t . A turn-level dialog state Y t = {(s, v t ), s ∈ S} is the slot-value pairs extracted only from (R t , U t ) at current turn t, where S is a predefined set of slot s in the schema and v t is the corresponding value 1 of the slot s. A discourse-level dialog state Z t is the accumulation of L t , representing all slot-value pairs that have been expressed over the course of the dialog until the t-th turn. We denote a dialog data for DST as d t = {X t , Y t , Z t } and the training dataset as D.

Schema-Aware Curriculum Learning
In this section, we first introduce the core curriculum module about how to apply the basic CL to the DST task; we then describe the preview and review module, which exploit the schema structure to facilitate the CL training process. The overall framework of SaCLog is shown in Figure 2.

Curriculum Learning for DST
We propose curriculum learning for DST and design two sub-modules: a difficulty scorer that measures the difficulty level of a dialog example with respect to a DST model, as well as a training scheduler module that arranges the scored data as a sequence of easy-to-hard training stages.

The Difficulty Scorer
As a dialog example could be intuitively complex for humans or inherently difficult for neural networks (NNs), both model-based and rule-based scores should be considered. We propose to use a hybrid scoring function that combines the advantages of model predictions and rules.
For model-based difficulty, we predict scores in a cross-validation-like manner. We divide D into K equal-sized subsets, where K − 1 subsets are used to train a DST model to predict the remaining one. This process is repeated K times until every subset is predicted. The score r mod t ∈ [0, 1] is computed based on the average accuracy of all mentioned slots (whose values are not none) in Y t for each d t . In our experiment, we train six models with the same architecture and different initialization seeds to obtain the mean value r mod t of model scores. For rule-based difficulty, we consider 4 factors to fuse human prior knowledge about DST into our curriculum design: 1) current dialog turn number t; 2) the total token number of (R t , U t ); 3) the number of mentioned name entities like 'hotel names' in Z t ; 4) the number of newly added or changed slots in Y t . We set the maximum values of above factors as 7/50/4/6 respectively, and normalize all factors into r rul,i Finally, the hybrid difficulty score is calculated jointly as r hyb where r hyb ∈ [0, 1] and 4 i=0 α i = 1.

The Training Scheduler
We adopt a widely used strategy called baby step (Spitkovsky et al., 2010) to organize the scored data for CL. Specifically, we divide the score uniformly into N intervals and distribute the sorted data into N buckets accordingly. The optimization starts from the easiest bucket as the initial training stage. After reaching a fixed number of maximum epochs or convergence, the next bucket is merged into the current training subset and shuffled for the next training stage. In our experiment, we set the maximum number of epochs as 3, and treat as the convergence if the training loss ceases to decrease and the loss value is within a threshold 15 for 100 steps. As the subset accumulates until all buckets are aggregated, we then continue to train the model for several extra epochs.

The Preview Module
In human learning, previewing learning materials helps develop an overall picture of what will be covering and can bring benefits to the learning process. In our task here, we propose new pretraining objectives to learn structural inductive bias of the schema structure. Specifically, our preview module contains a slot encoder to compute a slot embedding e s for each input slot s, and a dialog context encoder to extract the hidden states of X t as E t = [e 1 t , e 2 t , .
..], then we have: where Att(k, V ) is the attention function using the vector k to query the vector sequence V to get a context vector and ⊕ the vector concatenation. φ sig d (·) and φ sf t d (·) denote an FNN with one hidden layer having the same size as input layer, where the output layer is of size d, and is sigmoid and softmax respectively. B s t is a binary sequence indicates which span of X t belongs to the value of s, while c s t is the classification logits indicates whether s is added, deleted, changed, or not mentioned in Y t . Therefore, for each slot s, we have a binary sequence loss L seq and a classification loss L cls to optimize. Such pre-training objectives help the encoders understand how a slot is roughly operated in the current dialog context and connected with all possible tokens regarding its values in the schema. The dialog context encoder is used for the parameter initialization of the base part of a DST model. The pre-trained corpus is constructed from MultiWOZ2.1 dialogs  and the off-the-shelf synthesized dialogs (Campagna et al., 2020), which contains 337,346 dialog data in total.
We also leverage the language modelling (LM) loss as an auxiliary loss L aux to learn contextual representations of natural language. To be specific, we use the MLM loss (Devlin et al., 2019) as L aux for transformer-based DST modes and the summation of both forward and backward LM losses (Peters et al., 2018) for RNN-based DST models. We only use the original MultiWOZ2.1 dialogs to optimize L aux , considering that synthesized data is not suitable for natural language modelling. However, both the original and synthesized data are used to optimize L seq and L cls .

The Review Module
The process of review often help a learner consolidate difficult concepts newly learned. We design a review module to consider mispredicted examples as the concepts that the DST model has not grasped during CL, and utilize a schema-based data augmenter to produce similar cases from the examples. Specifically, the DST model is monitored at each stage of the CL training process. If a model is not converged at the end of an epoch in a training stage, we choose the top 10% incorrectly predicted examples according to their training losses as the resource to enlarge the cumulative dataset. The schema-based data augmenter uses three practical techniques to generate data as follows: Slot Substitution. A mentioned slot name in (R t , U t ) is changed into another slot name when its value is dontcare. Specifically, we first collect a word set for each slot name, e.g. {'arrive', 'arriving', 'arrived'} for the slot 'taxi-arriveby'. Then, for a dialog data d t where Y t contains a slot s with the value dontcare, we substitute the word of s in the utterance with some word of another slot s that is of the same domain and not mentioned in Y t .
Value Replacement. A slot's value is replaced with another proper one when the value is explicitly contained in U t . Specifically, we leverage the predefined schema in the dialog dataset to produce a value set for each slot and use the label map in (Heck et al., 2020) to figure out the position of value span within the utterance. The target value is then replaced with another one of the same slot.
Dialog Recombination. To recombine the dialog data d t , we randomly search another dialog data in D that possesses the same mentioned slots (whose values are not none) in Y t . We then cut and stitch their history X t−1 and current utterances (R t , U t ), and exchange their Y t to produce two new dialog data.

Experiments
Two popular datasets, WOZ2.0 (Wen et al., 2017) and MultiWOZ2.1 , are used to verify our approach. WOZ2.0 is a single-domain dataset with 1,200 dialogs and 3 slots. Multi-WOZ2.1 is a multi-domain dialog dataset with 10,438 dialogs, where there are 30 slots spanning 7 domains. The data splits (train/valid/test) of WOZ2.0 and MultiWOZ2.1 are 600/200/400 and 8438/1000/1000, respectively. We use the joint goal accuracy (JGA), the ratio of dialog data whose Z t is correct, as the evaluation metric. We apply SaCLog onto TripPy (Heck et al., 2020), a transformer-based DST model, and TRADE , an RNN-based DST model, to show its effect. The slot encoder and the dialog context encoder are weight-shared. We use a BERT base as the encoder and the [CLS] embedding as the slot embedding in TripPy, and use a bi-GRU as the encoder and the concatenation of the first and last hidden state as the slot embedding for TRADE. We also follow TripPy to add 2 new slot operations (i.e. refer/dontcare) into the classification types of L cls . Implementation Details. For the preview module, we use Adam (Kingma and Ba, 2015) with a fixed learning rate 3e-5 for 3 epochs in the pretraining. The batch size for L aux is 14 and the batch size for L seq and L cls is 64. For the curriculum module, we perform a warm-up strategy for Adam optimizer with a maximum learning rate 1e-4. Before CL, we train models on full dataset for 2 epochs. After all subsets are accumulated, we then train for 10 extra epochs with a minimum learning rate 1e-6. We set the bucket number N = 10 and the crossed fold K = 5. The batch size is 36 and Models MultiWOZ2.1 WOZ2.0 GLAD (Zhong et al., 2018) 35.57% * * 88.1±0.4% SUMBT  46.65% * * 91.0±1.0% DST-picklist  53.30% -Trippy (Heck et al., 2020) 55.29±0.28% 92.7±0.2% SimpleTOD (Hosseini-Asl et al., 2020) 55.72% -CHAN (Shan et al., 2020) 58.  (rule-based) 58.38±0.17% + CL (model-based) 58.71±0.21% + CL (hybrid) 58.85±0.23% + SaCLog (w/o. review) 60.19±0.26% + SaCLog (w/o. preview) 60.23±0.34% + SaCLog 60.61±0.31% the maximum length is 256. To simplify the review process, we conduct data augmentation after the CL training is finished.

Performance of TripPy+SaCLog
Tables 1 shows the results of our approach comparing to various baselines. Based upon TripPy, we obtain state-of-the-art performance on both datasets with SaCLog. The two closest baselines 2 , Con-vBERT (Mehri et al., 2020) and CoCoAug (Li et al., 2021), are also built upon TripPy, where Con-vBERT enhances its performance by using external large-scale conversational corpora to pre-train a BERT base and CoCoAug leverages a delicate counter-factual augmentation skill to produce much larger training data. Our method, however, benefits from the CL framework and improves TripPy by utilizing the preview and review modules.    difficulty scores by adding the curriculum module and utilizing the same pre-trained BERT base . As we can see, using the hybrid difficulty score achieves better JGA (58.85%) than using either single score, indicating that both model prediction and human knowledge are necessary. When incorporating the other two modules in the CL framework, the performance is greatly boosted further. The combination of both modules increases the JGA by 1.76%, suggesting that the schema-aware pretraining and dialog augmentation are crucial for improving DST performance in the CL training.

Performance of TRADE+SaCLog
We also apply SaCLog to the classical RNN-based generative DST model, TRADE. As Table 3 shows, SaCLog improves TRADE by around 3∼4% JGA on both datasets, demonstrating the effectiveness of SaCLog on different types of base DST models.

Related Work
Curriculum Learning (CL) has attracted increasing research interests in various NLP tasks, such as machine translation , general language understanding , reading comprehension (Tay et al., 2019) and open-domain chatbots (Bao et al., 2020;Su et al., 2020). Yet, the research on using CL in task-oriented dialog systems is limited. There has been some work (Saito, 2018;Zhao et al., 2021) on using CL in dialog policy learning, but applying CL to DST has not been investigated. Learning a structural inductive bias during pretraining has been shown beneficial in downstream tasks that require parsing semantics, such as textto-SQL (Yu et al., 2021) and table cell recognition . There are also many works (Hou et al., 2018;Yoo et al., 2020; on dialog augmentation. We aim to integrate these methods to build a general CL framework for DST.

Conclusion
In this paper, we propose a model-agnostic framework named as schema-aware curriculum learning for DST, which exploits both the curriculum structure and the schema structure in task-oriented dialogs and shows to substantially improve DST performances. In the future, we plan to investigate CL approaches on other dialog modeling tasks.