Domain-Lifelong Learning for Dialogue State Tracking via Knowledge Preservation Networks

Dialogue state tracking (DST), which estimates user goals given a dialogue context, is an essential component of task-oriented dialogue systems. Conventional DST models are usually trained offline, which requires a fixed dataset prepared in advance. This paradigm is often impractical in real-world applications since online dialogue systems usually involve continually emerging new data and domains. Therefore, this paper explores Domain-Lifelong Learning for Dialogue State Tracking (DLL-DST), which aims to continually train a DST model on new data to learn incessantly emerging new domains while avoiding catastrophically forgetting old learned domains. To this end, we propose a novel domain-lifelong learning method, called Knowledge Preservation Networks (KPN), which consists of multi-prototype enhanced retrospection and multi-strategy knowledge distillation, to solve the problems of expression diversity and combinatorial explosion in the DLL-DST task. Experimental results show that KPN effectively alleviates catastrophic forgetting and outperforms previous state-of-the-art lifelong learning methods by 4.25% and 8.27% of whole joint goal accuracy on the MultiWOZ benchmark and the SGD benchmark, respectively.


Introduction
Task-oriented dialogue systems aim at helping users to accomplish various tasks, such as reserving restaurants, booking flights, and checking weather (Young et al., 2013;Lei et al., 2018;. Dialogue state tracking (DST) is an essential component of task-oriented dialogue systems, which estimates user goals for downstream modules (Bohus and Rudnicky, 2006;Henderson et al., 2014b;Mrkšić et al., 2017;Shan et al., 2020). Given a user utterance and its dialogue history, a DST model should be able to output an accurate dialogue state. In general, the dialogue state is represented as a set of slot-value pairs, such as ((restaurant-area, north), (restaurantprice, expensive)). Previous DST models are usually trained offline, which requires a fixed dataset prepared in advance. These offline solutions are often impractical in realworld applications, as online dialogue systems usually involve continually emerging new data and domains, especially when new services are introduced. In addition, it is infeasible to retrain DST models from scratch every time new domain data arrives due to computational costs, storage budgets, and data privacy (McMahan et al., 2017). To tackle this realistic issue, we explore Domain-Lifelong Learning for Dialogue State Tracking (DLL-DST). As shown in Figure 1, the DLL-DST task aims to continually train a DST model on new data to learn incessantly emerging new domains. At each step, new data generally contains one or multiple new domains, and the updated model should be able to make accurate predictions for all the domains observed so far.
A plain approach to domain-lifelong learning is to simply fine-tune a pre-trained model on new data. However, this approach suffers from the problem of catastrophic forgetting (McCloskey and Cohen, 1989;French, 1999). To be more specific, fine- tuning the model on new data usually results in a significant performance drop on old data. To address this problem, there are two mainstream lifelong learning methods: (1) regularization-based methods, which try to identify and preserve the parameters important to old data (Kirkpatrick et al., 2017;Aljundi et al., 2018); (2) replay-based methods, which reserve some representative old samples and combine them with new data to re-train the model (Rebuffi et al., 2017;Hou et al., 2019). Recently, replay-based methods have shown promising results in alleviating catastrophic forgetting of class-lifelong learning tasks in NLP scenarios (Han et al., 2020;Cao et al., 2020).
However, when deploying previous replay-based methods on the DLL-DST task, we find two main problems: expression diversity and combinatorial explosion. Expression diversity: In the DST task, dialogue texts usually contain a variety of expressions for each dialogue state, as shown in Figure  2. The expression diversity makes it difficult for previous replay-based methods to select the most representative old samples. The unrepresentative samples, such as the first utterance in Figure 2, do not contain typical expressions for any slot. Retraining models with these unrepresentative samples is not conducive to retaining the performance on old domains. Combinatorial explosion: Ideally, we should reserve at least one sample for each dialogue state in old domains. However, the classes of dialogue states explode rapidly as the number of slot-value pairs increases. For example, the Mul-tiWOZ 2.1 dataset (Eric et al., 2019) has an average of 2732 classes of dialogue states per domain. In the DLL-DST task, it is infeasible for replaybased methods to reserve samples for each class of dialogue states in old domains due to limitations of memory capacity and training time. Since the reserved samples involve only a few types of dialogue states, previous methods may gradually forget previous knowledge, leading to catastrophic forgetting.
To address the above two problems, we propose Knowledge Preservation Networks (KPN), which contain two main components: (1) to handle expression diversity, we propose multi-prototype enhanced retrospection, which computes multiple slot prototypes for each domain and selects the most representative old samples based on these slot prototypes; (2) to cope with the combinatorial explosion problem, we propose multi-strategy knowledge distillation, which enables the model at the current step to preserve the knowledge of the model trained in the last step from multiple aspects, instead of just reserving some old samples. Experimental results demonstrate that KPN outperforms previous state-of-the-art lifelong learning methods by 4.25% and 8.27% of whole joint goal accuracy on the MultiWOZ benchmark and the SGD benchmark, respectively. The contributions of this paper are listed as follows: • To the best of our knowledge, we are the first to formally introduce domain-lifelong learning into dialogue state tracking and we construct two benchmarks through two widely used DST datasets, MultiWOZ 2.1 and SGD.
• We propose Knowledge Preservation Networks, which handle expression diversity and combinatorial explosion in the DLL-DST task via multi-prototype enhanced retrospection and multi-strategy knowledge distillation.
• Experimental results show that our method outperforms previous lifelong learning methods and achieves state-of-the-art performance. We will release the source code and the benchmarks for further research (https://gi thub.com/liuqingbin/Knowledge-Preservation-Networks).

Task Formulation
The DST task is usually formulated as a slot-filling task (Bohus and Rudnicky, 2006;. At each dialogue turn, the DST model takes the user utterance and the dialogue history as input and predicts values for each slot. As shown in Figure 2, the DST model is expected to fill the slot "restaurant-price" with the value "expensive". The DLL-DST task continually trains DST models on emerging data to learn new domains. New data arrives in a stream form ({D 1 , D 2 , ..., D N }). At each step, the new data (D i ) can contain one or multiple new domains. In addition, inspired by other lifelong learning work (Lopez-Paz and Ranzato, 2017;Zenke et al., 2017), we treat dialogues across the same multiple domains as data of a special domain, since these cross-domain dialogues usually contain specific expressions that distinguish them from other dialogues, such as domain transformation and slot reference (Ouyang et al., 2020;Hu et al., 2020). Each new data has its own train- When new data (D i ) arrives, the DST model is optimized using the new training data (D train i ). The updated model should still perform well on all previous domains. Therefore, in the testing stage of the ith step, we evaluate the updated model on the test data of all observed domains (i.e., i k=1 D test k ). The evaluation protocol indicates that it will be more and more difficult to achieve high performance for DST models as more and more domains arrive.

Method
In this work, we propose Knowledge Preservation Networks (KPN) to handle the DLL-DST task. KPN consists of two core components, i.e., multi-prototype enhanced retrospection and multistrategy knowledge distillation, for dealing with the two main challenges, i.e., expression diversity and combinatorial explosion. The framework of KPN is shown in Figure 3.

Background
Our method, KPN, is a lifelong learning framework, which is model-agnostic. The DST model is only a basic component, not our research focus. DST models, such as TRADE (Wu et al., 2019), SAS (Hu et al., 2020), and SOM-DST (Kim et al., 2020), can all be used as this basic component. We adopt the previous best model, SOM-DST, in this work.
In each dialogue turn, SOM-DST simplifies the dialogue history to the last system response and the last dialogue state, and then combines them with the current user utterance into an input sequence for the BERT encoder (Devlin et al., 2019). BERT is a multi-layer Transformer (Vaswani et al., 2017), pre-trained on large-scale unlabeled corpora. To fit the input form of BERT, the tokens [CLS] and [SEP] are placed in the input sequence. In addition, a special token [SLOT] is placed at the beginning of each slot in the last dialogue state. The BERT encoder obtains the contextual representation for the input sequence. The encoded hidden state of [SLOT] is used as the feature vector of each slot.
For each slot, SOM-DST first classifies it into four categories, including "dontcare", "carryover", "update", and "delete". "dontcare" means that the user does not care about this slot. "carryover" indicates that the slot inherits the value of the same slot from the last dialogue state. "update" means that the model needs to generate a value for the slot. "delete" means that this slot does not contain any value. A softmax classifier is added to the feature vector of each slot to predict its category. The cross-entropy loss is used to train the classifier: where y s is the one-hot label for the slot s and p s is the predicted probability. N is the training samples and C is the slots of all observed domains. For each slot belonging to the "update" category, SOM-DST generates a value for this slot via the GRU decoder (Cho et al., 2014). The decoder is equipped with the ability to copy words from the input sequence (Kim et al., 2020). The cross-entropy loss is used to train the generation probability: is the predicted probability of the i-th word of the value v. y v i is the one-hot label. d is the length of the value. U is the slots that are predicted to be the "update" category.

Multi-Prototype Enhanced Retrospection
In this paper, we focus on the domain-lifelong learning scenario for DST. Given a model trained on old data, we aim to continually learn a unified DST model for all observed domains so far based on a new combined dataset N = D train i P. D train i is the training data of the new domains at step i. P is a bounded memory that stores representative old samples, denoted as P = {P 1 , P 2 , ..., P m }. P i is the set of stored samples for the i-th domain. m is the number of old domains.
Since the DLL-DST task suffers from expression diversity, we propose multi-prototype enhanced retrospection to reserve the most representative samples of old domains. In this way, important information about the data distribution of all previous domains enters the subsequent training process. This approach is inspired by prototype learning (Snell et al., 2017;Yang et al., 2018), which uses prototypes as representative points.
Specifically, after learning on the new domains, we store |P i | = B/m samples for each new domain. m is the number of all observed domains and B is the total number of samples that can be reserved. We encode all samples of the i-th domain into the hidden representation and compute a prototype µ s for each slot s in this domain: where N is the training samples of the i-th domain. f s (x) is the hidden state of [SLOT] in front of the slot s, which is the slot representation. Then, we compute the distance between the slot representation of each training sample and the corresponding slot prototype. Based on the average distance of all slots, we produce a sorted list of new training samples. Intuitively, the closer the samples to these slot prototypes, the more representative the samples will be for these slots. Based on the sorted list of samples, the top B/m samples are selected as exemplars to be stored in the memory.
Since the storage size of memory is constant, when new domains arrive, the memory needs to remove some reserved exemplars of old domains to allocate space for the exemplars of new domains. Suppose the number of new domains is t. The memory needs to remove B/(m−t)−B/m stored samples of each old domain. For each old domain, we remove the samples that are far from these prototypes according to the sorted list.

Multi-Strategy Knowledge Distillation
As mentioned above, just reserving some old samples makes previous lifelong learning methods still suffer from combinatorial explosion. Since the reserved samples involve only a few types of dialogue states, these methods may gradually forget the previous knowledge. To handle this problem, we propose multi-strategy knowledge distillation, which preserves the knowledge of the model trained in the last step through multiple distillation strategies. In this way, the current model can perform well on the old domains. Knowledge distillation is an effective way to transfer knowledge from one network to another (Hinton et al., 2015).

Encoder Feature Distillation
For each slot, we denote its feature vector extracted by the BERT encoder of the current model and the BERT encoder of the last model as f s (x) and f s, * (x), respectively. To preserve the feature distribution in the original encoder, we adopt an encoder feature distillation loss: where f s (x), f s, * (x) measures the cosine similarity between the two features. This loss is computed for all samples from the new domains and the reserved exemplars. If the features of the current encoder don't greatly deviate from those of the last encoder, the current model can effectively preserve the knowledge of the model trained in the last step.

Classifier Prediction Distillation
In addition, we also adopt a classifier prediction distillation, which preserves the previous knowl- where T is the temperature scalar. T is usually greater than 1 to increase the weights of small probability values. The classifier prediction distillation loss is also computed for the training samples of the new domains and the reserved exemplars of the old domains.

Decoder Feature Distillation
To retain the previous knowledge of the last decoder, we adopt a decoder feature distillation loss to learn the feature distribution of the last decoder.
where g i (x) and g * i (x) are the i-th hidden state decoded by the current decoder and the last decoder for the slot s.

Generation Prediction Distillation
We adopt another prediction distillation loss L gp to mimic the generation probability predicted by the last decoder. Because the sigmoid function in the decoder (Kim et al., 2020) makes it impossible to adopt the above prediction distillation loss, we use the KL-divergence as the generation prediction distillation loss as follows: where p i and q i are the i-th probability predicted by the current and last decoder for the slot s.

Training
During each step of the domain-lifelong learning process, we combine the above losses to train the DST model: where α and β are two adjustment coefficients. The coefficients are used to balance the performance of the old domains and the new domains. If α and β are very small, the model will pay more attention to the new domains, thus hurting the performance  of the old domains. At each step, we combine the training set (D train i P) to train the model with the loss L, and then select the most representative samples to update the memory. Therefore, our method can continually learn new domains while avoiding catastrophically forgetting old domains.

DLL-DST Benchmarks
To the best of our knowledge, we are the first to formally introduce the DLL-DST task. Therefore, we construct two benchmarks based on the following method: for a given DST dataset, we arrange its domains in a fixed random order. Each domain has its own data and ontology (i.e., slot-value pairs). In a domain incremental manner, the lifelong learning methods continually train a DST model on one or multiple new domains. Following other tasks (Li and Hoiem, 2017;Cao et al., 2020), we adopt one new domain at each step. As described in Section 2, inspired by other tasks, we treat dialogues across the same multiple domains as data of a special new domain, since they usually contain many specific expressions. Based on two widely used datasets, MultiWOZ 2.1 (Eric et al., 2019) and SGD (Rastogi et al., 2019), we propose two instantiations of the above construction method. MultiWOZ benchmark: We use the data splitting of the official Multi-WOZ 2.1 dataset. Since the domains in MultiWOZ 2.1 has a long-tail frequency distribution, we use the data of the top 10 most frequent domains (including the combined domains). SGD benchmark: Same as the MultiWOZ benchmark, we use the data of the top 15 most frequent domains. Table 1 shows the statistics of the two benchmarks.

Experimental Settings
For the DST task, joint goal accuracy (JGA) is used as the evaluation metric (Zhong et al., 2018). For the DLL-DST task, every time the model finishes training on new domains, we report JGA on the test data of all observed domains. For example, after the i-th step, the result is denoted as JGA i . In addition, after the last step, we report Aver-  age JGA which is the average score of all steps ( 1 k k i=1 JGA i ), and Whole JGA which is the JGA score on the whole testing data of all domains.
Our method uses the HuggingFace's Transformers library 1 to implement the BERT-based DST model. The learning rate is set to 5e − 5. The batch size is 4. The hyper-parameters α and β are 0.2 and 0.1, respectively. T = 2 in our experiments. The capacity of memory is 50. All hyper-parameters are obtained by a grid search on the validation set.

Baselines
In this work, we propose a model-agnostic lifelong learning method to handle the DLL-DST task. Therefore, we adopt other model-agnostic lifelong learning methods that achieve state-of-the-art performance on other tasks as our baselines: EWC (Kirkpatrick et al., 2017), which slows down the update of important parameters by adding a L 2 regularization of parameter changes. LwF (Li and Hoiem, 2017), which matches the prediction of the current network with that of the original network by knowledge distillation. EMR , which alleviates forgetting by randomly storing some old samples. AdapterCL (Madotto et al., 2020), which adds the model parameters to learn new data. EMAR (Han et al., 2020), which selects representative samples based on only one prototype and consolidates the model through the prototype. FineTune, which simply fine-tunes the pre-trained model on new data. UpperBound, which uses training samples from all observed domains to train the model. We regard it as the upper bound of the benchmark.  Table 2: Average JGA (%) of all steps ("Avg" column) and whole JGA (%) on the whole testing data ("Whole" column) after the last step. Figure 4 shows the JGA scores over the observed domains during the whole lifelong learning process. We also list the results after the last step in Table 2.

Main Results
From the results, we can observe that: (1) Our proposed method KPN significantly outperforms other baselines and achieves state-of-theart performance in both the MultiWOZ and SGD benchmarks. For example, compared to EMAR, our method achieves 4.25% and 8.27% improvements of the whole joint goal accuracy on the Multi-WOZ benchmark and the SGD benchmark, respectively. It verifies the effectiveness of our method on the DLL-DST task.
(2) At each step of the domain-lifelong learning process, there is a performance gap between EMAR and our method KPN. The reason is that EMAR ignores the problems of expression diversity and combinatorial explosion in the DLL-DST task. Therefore, EMAR fails to reserve the most representative samples and tends to gradually forget the previous knowledge of the original model,   eventually resulting in catastrophic forgetting. The architecture-based method, AdapterCL, greatly increases the computation time due to the need to select the parameters to be executed. Besides, because AdapterCL only trains domain-specific parameters, it has weak representation capabilities for each domain and achieves low performance.
(3) FineTune always achieves the worst results on both benchmarks, which confirms that catastrophic forgetting is indeed a major difficulty in the DLL-DST task. In addition, there is still a gap between our method and the upper bound. It indicates that, although we have proposed an effective approach for the DLL-DST task, there remain numerous challenges to be addressed.

Ablation Experiment
In this work, we propose a model-agnostic domainlifelong learning method, KPN, which consists of two core components: multi-prototype enhanced retrospection and multi-strategy knowledge distillation. In this section, we show ablation studies of the two components.

Effectiveness of Multi-Prototype Enhanced Retrospection
We conduct experiments to verify the effectiveness of the proposed multi-prototype enhanced retrospection. The results are shown in Table 3. From the results, we can see that: (1) For "-MPR", we remove multi-prototype enhanced retrospection and randomly select samples. Our method KPN outperforms this variant by 1.51% and 3.6% in terms of the whole JGA. The results show that the multi-prototype enhanced retrospection is effective in selecting the most representative samples from diverse dialogues.
(2) In addition, we compare our method with previous data selection methods. For "+ iCaRL" (Rebuffi et al., 2017), the model computes only one prototype for each domain based on the hidden state of the [CLS] token and selects samples based on this prototype. For "+ K-Means" (Han et al., 2020), this model selects diverse samples by choosing the central samples of clusters in the [CLS] hidden vector space. KPN significantly outperforms "+ iCaRL" and "+ K-Means". "+ iCaRL" is even worse than the random selection "-MPR" because the [CLS] prototype is often not representative for any slot. By contrast, our method adopts multiple prototypes based on the slot representation, which effectively selects the most representative samples.

Effectiveness of Multi-Strategy Knowledge Distillation
To gain more insights into the multi-strategy knowledge distillation, we test many variants of KPN. The results are shown in Table 4. We can see that: (1) Removing any knowledge distillation strategy, encoder feature distillation (EFD), classifier prediction distillation (CPD), decoder feature distillation (DFD), or generation prediction distillation (GPD), brings performance degradation. If we remove all knowledge distillation strategies (MSKD), the performance further declines. It shows that these knowledge distillation strategies can retain performance on old domains by effectively preserving the knowledge of the original model from multiple perspectives.
(2) When we remove both multi-prototype enhanced retrospection and multi-strategy knowledge distillation (i.e., the model EMR), the performance drops significantly. It indicates that simultaneously exploiting the two components is very effective.

Discussion: Memory Capacity
As shown in Table 5, we test the models that reserve different numbers of samples. Both EMAR and our method KPN achieve performance improvements as the number of reserved samples increases. In each case, our method significantly outperforms EMAR. Our method using only 30 samples achieves comparable performance to EMAR using 50 samples. It demonstrates the effectiveness of our proposed method. The proposed multiprototype enhanced retrospection effectively selects the most representative samples. The proposed multi-strategy knowledge distillation alleviates the impact of combinatorial explosion.

Dialogue State Tracking
Dialogue state tracking (DST) is an active research area recently, where typical DST models can be mainly divided into two categories: discriminative DST methods and generative DST methods. Discriminative DST methods use predefined values as categories to simplify DST as a multi-class classification task (Bohus and Rudnicky, 2006;Henderson et al., 2014a). These methods mainly focus on modeling the relation between slots and dialogue history, such as NBT (Mrkšić et al., 2017), GLAD (Zhong et al., 2018), SST , and CHAN (Shan et al., 2020). Generative DST methods treat dialogue state tracking as a generation task (Rastogi et al., 2017;Xu and Hu, 2018;Wu et al., 2019). By generating values from the dialogue history and the vocabulary, generative DST methods handle unknown values that are not predefined in the ontology. Therefore, generative DST methods dominate this research, such as SpanPtr (Xu and Hu, 2018), COMER (Ren et al., 2019), BERT-DST (Chao and Lane, 2019), TRADE (Wu et al., 2019), SAS (Hu et al., 2020), and SOM-DST (Kim et al., 2020).
Despite the great progress in single-domain or multi-domain DST tasks, previous DST methods usually assume the training data is fixed, containing predefined domains. They can not learn newly emerging domains online, which makes it impractical in real-world applications. Our method handles the domain-lifelong learning problem, where data of new domains continually arrives, whether it is new single-domain or new multi-domain data.

Lifelong Learning
Lifelong learning, also called continual learning, is a long-standing research topic in machine learning, which enables models to perform online learning on new data (Cauwenberghs and Poggio, 2000;Kuzborskij et al., 2013). Architecture-based methods dynamically extend the model architecture to learn new data (Fernando et al., 2017;Shen et al., 2019). However, the model size grows rapidly with the increase of new data, which limits the application of architecture-based methods. Existing lifelong learning methods can be divided into two main categories: regularization-based methods (Zenke et al., 2017;Aljundi et al., 2018) and replay-based methods (Rebuffi et al., 2017;Hou et al., 2019). Regularization-based methods design reasonable metrics to identify the parameters important to old data and slow down the update of them (Kirkpatrick et al., 2017;Li and Hoiem, 2017). Replay-based methods retain the previous knowledge by storing a small amount of old data Han et al., 2020). In addition, generative replay-based methods generate old data to alleviate catastrophic forgetting (Shin et al., 2017;Kemker and Kanan, 2018;Ostapenko et al., 2019;Zhai et al., 2019). Although lifelong learning has been widely investigated in NLP and CV scenarios (Kou et al., 2020;Kundu et al., 2020), its exploration in DST is relatively rare.
In other dialogue tasks, Lee (2017) fine-tunes a dialogue model trained on open-domain dialogues to learn task-oriented dialogues. However, their setting is only a one-step incremental process. Shen et al. (2019) continually train a slot-filling model on new data from the same domain. Madotto et al. (2020) introduce continual learning into multiple dialogue tasks. However, they ignore cross-domain dialogues that exist widely in the real world. In addition, they only adopt a plain architecture-based method, which does not address the main challenges of the dialogue tasks.
In contrast to previous work, we formally introduce domain-lifelong learning into DST, which is practical in real-world applications. In addition, we propose Knowledge Preservation Networks to handle the main challenges of the DLL-DST task.

Conclusion
In this paper, we introduce domain-lifelong learning into dialogue state tracking and propose Knowledge Preservation Networks to overcome catastrophic forgetting. To handle expression diversity, we propose multi-prototype enhanced retrospection to reserve the most representative samples. Moreover, to alleviate the adverse effects of combinatorial explosion, we propose multi-strategy knowledge distillation to learn the previous knowledge of the original model. Experimental results on the MultiWOZ and SGD benchmarks demonstrate the effectiveness of our model.