Prompt Conditioned VAE: Enhancing Generative Replay for Lifelong Learning in Task-Oriented Dialogue

Lifelong learning (LL) is vital for advanced task-oriented dialogue (ToD) systems. To address the catastrophic forgetting issue of LL, generative replay methods are widely employed to consolidate past knowledge with generated pseudo samples. However, most existing generative replay methods use only a single task-specific token to control their models. This scheme is usually not strong enough to constrain the generative model due to insufficient information involved. In this paper, we propose a novel method, prompt conditioned VAE for lifelong learning (PCLL), to enhance generative replay by incorporating tasks’ statistics. PCLL captures task-specific distributions with a conditional variational autoencoder, conditioned on natural language prompts to guide the pseudo-sample generation. Moreover, it leverages a distillation process to further consolidate past knowledge by alleviating the noise in pseudo samples. Experiments on natural language understanding tasks of ToD systems demonstrate that PCLL significantly outperforms competitive baselines in building lifelong learning models.


Introduction
Task-oriented dialogue (ToD) systems are of great importance in advanced AI applications (Zhang et al., 2020b;Dai et al., 2020Dai et al., , 2021;;He et al., 2022a,b,c).However, most existing ToD systems are developed under the assumption that the data distribution remains unchanged (Zhu et al., 2022).Unless the entire system is retrained, this setup may not be realistic when the ToD system deployed in practice needs to support new features and provides more services over time based on user demands.Without incurring the high cost of retraining, Lifelong Learning (LL) is able to acquire new knowledge continuously while preserving previously learned knowledge (Delange et al., 2021).Hence, it's crucial to equip natural language understanding (NLU) modules, the vital components of ToD systems, with the lifelong learning ability.
The main issue for lifelong learning is catastrophic forgetting (McClelland et al., 1995;Parisi et al., 2019), which refers to the phenomenon that a model forgets previously learned tasks when learning new tasks.Various approaches have been proposed to alleviate this issue (Schwarz et al., 2018;Aljundi et al., 2018;Rusu et al., 2016;Aljundi et al., 2017).The replay-based methods are among the most effective and widely used ones (Rebuffi et al., 2017;Shin et al., 2017;Dai et al., 2022).The main idea of replay-based methods is to retrain samples or representations from already seen tasks when learning new tasks (Mundt et al., 2020).Some methods explicitly store previously seen real samples for replaying (experience replay) (Rebuffi et al., 2017;Chaudhry et al., 2019).However, this setting will be infeasible when data from previous tasks is unavailable due to data security concerns.Other methods try to generate pseudo samples using a generative model (generative replay).This variant relieves the burden of storing previously seen data and has been widely adopted in previous studies (Delange et al., 2021;Shin et al., 2017;Kemker and Kanan, 2018).
The key to generative replay is to produce pseudo samples to approximate the real data distribution of previous tasks.Intuitively, higher quality pseudo samples can better preserve learned tasks and lead to less forgetting in LL.However, the generation of pseudo samples for each seen task in previous studies (Sun et al., 2020;Chuang et al., 2020) is usually controlled by a single task-specific token.It has been observed that this scheme is usually insufficient to constrain the PLM (Sun et al., 2020), due to limited information involved.Consequently, the generated pseudo samples suffer from problems such as not being fluent or not corresponding well to the designated task.Moreover, those special tokens are only introduced in the fine-tuning stage of the PLM.This enlarges the gap between pre-training and fine-tuning of the PLM (Gu et al., 2022) and harms the quality of the generated pseudo samples.In addition, generated noisy pseudo samples may degenerate the LL performance.
To address the above issues, we propose a novel method, Prompt Conditioned VAE for Lifelong Learning (PCLL), to enhance generative replay on NLU tasks of ToD systems.To impose strong control over the pseudo-sample generation, PCLL explicitly models latent task-specific distributions using a conditional variational autoencoder (CVAE) (Kingma and Welling, 2014;Zhao et al., 2017).Then it incorporates the corresponding task statistics to guide the generation of pseudo samples.To reduce the gap between pretraining and finetuning, we construct natural language prompts to unify different NLU tasks while being specific to each task.These prompts not only contain meaningful semantics compared to special tokens, but also serve as conditions to assist CVAE in capturing task distributions.Moreover, PCLL employs a knowledge distillation scheme to alleviate the impact of noisy pseudo samples during the replay process.Leveraging the above strategies, PCLL can generate highquality pseudo samples that better approximate the real distributions of previous tasks while tackling the aforementioned issues.
We validate our method on NLU tasks of ToD systems including both intent detection and slot filling.The results indicate that our approach generates high-quality pseudo samples and significantly outperforms competitive baselines.Our main contributions are as follows, (1) We propose a novel method, PCLL, to enhance generative replay for building lifelong NLU modules of ToD systems.
(2) Conditioned on prompts, PCLL models latent task distributions with CVAE to guide the pseudosample generation and leverages knowledge distillation to further avoid forgetting.
(3) Our extensive experiments and comprehensive analyses demonstrate the superior performance of PCLL and the high quality of its generated samples.

Lifelong Learning
There are generally three categories of LL methods: Regularization-based Methods aim to strike a balance between protecting already learned tasks while granting sufficient flexibility for a new task (Mundt et al., 2020).Some methods (Schwarz et al., 2018;Aljundi et al., 2018;Zenke et al., 2017;Ebrahimi et al., 2019) impose constraints on the modification of important weights.Other methods introduce a distillation loss to constrain predicted features of the LL model.(Li and Hoiem, 2017;Dhar et al., 2019;Rannen et al., 2017).However, these additional regularization terms may downgrade the model performance (Parisi et al., 2019).
Architecture-based Methods dedicate model parameters for each task to prevent forgetting (Delange et al., 2021).Some studies (Fernando et al., 2017;Serrà et al., 2018;Hu et al., 2018) use static architectures and rely on task specific information to route through the architecture (Mundt et al., 2020), while other studies (Rusu et al., 2016;Aljundi et al., 2017;Zhai et al., 2020;Madotto et al., 2021;Ke et al., 2021;Geng et al., 2021;Zhao et al., 2022b) dynamically grow the architecture in the LL training process.However, these methods either require capacity allocation for tasks at the beginning or are not feasible when model expansion is prohibited with limited resources (Sun et al., 2020).
Replay-based Methods aim to preserve previous knowledge by replaying data from learned tasks.One line of studies (Rebuffi et al., 2017;Chaudhry et al., 2019;Lopez-Paz and Ranzato, 2017;Mi et al., 2020;Han et al., 2020;Liu et al., 2021b) keeps a small number of real samples from old tasks for replaying.However, these methods are unpractical when data from old tasks are unavailable.Another line of studies (Shin et al., 2017;Kemker and Kanan, 2018;Xiang et al., 2019) utilizes a generative model to reproduce pseudo samples or representations from old tasks.
In this paper, we focus on improving generative replay, as it does not require allocating extra parameters or model capacity and can be used with any LL model.Specifically, Sun et al. (2020) propose a general framework LAMOL for lifelong language learning to replay pseudo samples of previous tasks.Chuang et al. (2020) improve LAMOL by training an extra teacher model before learning each new task, however, this increases the burden of the LL process.Kanwatchara et al. (2021) freeze critical parameters in LAMOL based on rationales, but those rationales are not always available for NLP tasks.All these previous works do not take task statistics into consideration, whereas our PCLL method incorporates the information of tasks' distributions to enhance generative replay.

Prompt-based Learning in NLP
Prompt-based learning has been found to be more effective than typical finetuning to use PLM (Schick and Schütze, 2021).With prompts, we can convert various downstream tasks to a unified language modeling task (Brown et al., 2020;Schick and Schütze, 2021).Prompts can be either manually designed (Petroni et al., 2019;Yu et al., 2019) or generated automatically (Shin et al., 2020;Jiang et al., 2020;Gao et al., 2021).Some recent studies employ prompt tuning on continual learning for dialogue state tracking (Zhu et al., 2022) and few-shot learning (Qin and Joty, 2022).

Problem Definition
We aim to build an LL model to learn a stream of NLU tasks sequentially T T = {t} T t=1 in dialogue systems, where T can be infinite potentially.For each task t, a set of samples D t = {(x k , y k )} Nt k=1 are drawn from its underlying data distribution.Here, x k denotes the input utterance, and y k denotes the output label of NLU.In intent detection tasks, y k is the intent label of x k ; in slot filling tasks, y k is the slot-value pairs contained in x k .Our objective is to learn a model that can perform well on all seen tasks and forget as little as possible.

Overview
We start with a brief overview of our proposed PCLL method for generative replay (See Fig. 1).PCLL consists of two components: an LM-based task solver to solve NLU tasks (Fig. 3) and a CVAEbased generator (Fig. 2) to generate pseudo samples with the help of task-specific latent distributions.For the first task, PCLL is initialized with PLMs along with other parameters randomly initialized.Before learning a new task t, we first use the PCLL model trained on previous tasks to generate pseudo samples for each of the learned tasks T t−1 .Then we interleave these pseudo samples with the training data in D t and continue to train PCLL.In this way, the model can learn the new task t while consolidating the knowledge of past tasks.
In the following sections, we first illustrate how PCLL learns the current task (Sec. 3.3,3.4).Then we describe the pseudo-sample generation process (Sec.3.5), and finally, we introduce a knowledge distillation process to further improve the LL performance (Sec.3.6).

LM-based Task Solver
Following recent studies (Sun et al., 2020;Chuang et al., 2020), PCLL unifies different NLU tasks into a language modeling (LM) task and implements a task solver based on a PLM.Different from previous studies that introduce randomly initialized special tokens in the fine-tuning stage (Sun et al., 2020), we construct task-specific natural language prompts for the solver.These prompts carry rich semantic information to alleviate the mismatch between fine-tuning and pre-training of PLM.
For each input-output pair (x, y) from task t, our task solver is a LM that takes a prompt g t (x) as an input and predicts y.Specifically, g t (x) is constructed as g t (x) = g pre t ⊕x⊕g post t , where g pre t and g post t are prompt prefix and postfix designed for task t, respectively, and ⊕ means the concatenation of word tokens.For instance, if the task t is an intent detection task, we design g t (x) as: "For an utterance from the ID task, x has the following intent ", where "ID" represents the task name of t.After serializing the output y into a token sequence, we can obtain a natural language sentence by simply concatenating g t (x) with y.We list detailed examples in Appendix B.1.Then the PLM f θt for the current task t is optimized on the concatenated sentence by maximizing the following objective (see Fig. 3): in which the first term learns to decode the constructed sentence given the start token [BOS], and the second term learns to predict the output y after reading the prompt g t (x).λ is a scalar used to balance these two terms.

Prompt Conditioned VAE Generator
To construct high-quality pseudo-samples, PCLL leverages a CVAE module to build a pseudo-sample generator so that it can incorporate tasks' statistics to guide the generation of pseudo samples.The CVAE module captures task-specific latent distributions by taking utterances as the input, conditioned on prefix prompts, and reconstructing the input during training.Specifically, given an input utterance x in task t, we assume a random variable z captures the latent distribution over x.We define a conditional distribution as p(x, z|t) = p(x|z, t)p(z|t), where we approximate p(z|t) and p(x|z, t) using deep neural networks with parameters ϕ and θ, respectively.We refer to p ϕ (z|t) as the prior network and p θ (x|z, t) as the decoder.To reconstruct x, a latent variable z is first sampled from p ϕ (z|t) and then x is decoded through p θ (x|z, t).
In this study, we assume the prior of z to be a multivariate Gaussian distribution with a diagonal covariance matrix, and introduce a recognition network q ψ (z|x, t) to approximate the intractable true posterior p(z|x, t).The goal of CVAE is to maximize the conditional log-likelihood log p(x|t) = p(x|z, t)p(z|t)dz.Employing variational inference, we can get the following evidence lower bound (ELBO) (Zhao et al., 2017) to maximize: where β is a scalar to balance the reconstruction term L REC and the Kullback-Leibler (KL) divergence term L KL and is adjusted by a cyclic annealing schedule (Fu et al., 2019) to alleviate the vanishing latent variable issue (Bowman et al., 2016).
CVAE Implementation.When implementing each network in Eq.2, we use the prompt prefix g pre t to represent the task t because g pre t involves the task name that can exclusively identify t.Fig. 2 shows the overall architecture of our PCLL model, in which we use an unidirectional transformer (Vaswani et al., 2017) to encode the concatenated sentence g pre t ⊕ x into hidden representations.Then an attention-average block (Fang et al., 2021) is introduced to pool the hidden representations of g pre t and g pre t ⊕ x to single vectors, which are further fed into a prior network p ϕ (z|t) and recognition network q ψ (z|x, t) respectively.Next, the reparametrization trick (Kingma and Welling, 2014) is used to obtain latent variables z from the prior and posterior distributions.Then z is injected to the decoder p θ (x|z, t) by adding to each token embedding (word embedding and position embedding, elementwisely) of the prompt (Fang et al., 2021;Li et al., 2020).
In PCLL, the decoder p θ (x|z, t) shares the same parameters with the PLM-based task solver f θ .This allows us to inherit the advantage of PLM and leverage a unified model to solve each task and generate pseudo samples simultaneously.

Pseudo Sample Generation
Generating pseudo samples for learned tasks involves two steps: (1) PCLL generates a pseudo input utterance x guided by a latent task distribution using the CVAE-based generator.Specifically, for each seen task t ′ , (t ′ < t), the model samples a latent variable z t ′ from the prior network p ϕ (z t ′ |t ′ ) with the constructed prompt prefix g pre t ′ as the input.Then the decoder takes z t ′ and g pre t ′ , and decodes them into the pseudo input x using top-k sampling1 (Holtzman et al., 2019).( 2) PCLL generates the output y associated with x using the solver (i.e., following Fig. 3).

Knowledge Distillation
Previous generative replay approaches indistinguishably interleave pseudo data with the current task's training data.However, this naive approach hurts the model performance since these pseudo data may contain noise and may drift from the real data distribution.In this study, we utilize a knowledge distillation (KD) (Hinton et al., 2015) process to prevent our model from being affected by these noisy pseudo data.
When training on a new task t, we treat the model obtained on previous tasks T t−1 as a fixed teacher model f θ Tch .For each input-output pair (x, y) in the pseudo data, f θ Tch is distilled on the generated pseudo data to the current model f θ (i.e., serves as the student model) by maximizing the Language Model  .
[BOS] token-level distillation objective: where g t (x, y) < l and y < l refers to the token sequence before the l-th token in g t (x, y) and y, respectively.V represents the vocabulary set.
Similarly, when training the CVAE module, we replace the reconstruction term L REC of in Eq. 2 with a distillation objective: and thus we maximize the following objective over the pseudo data L KD CVAE = L KD REC − βL KL .Using the above KD strategy, the distributions produced by the teacher model contain richer knowledge compared to one-hot labels (Hinton et al., 2015).These distributions constrain the student model (i.e., f θ ) by preventing its weights from drifting too far when learning new tasks, thereby mitigating forgetting in lifelong learning.
Fig. 1 illustrates the training process of PCLL.Specifically, when learning a new task t, we optimize PCLL on training samples of t with the following objective: L LM + L CVAE .For pseudo samples of previous tasks t ′ , (t ′ < t), we optimize the loss where α ∈ [0, 1] is a scalar used to adjust knowledge distillation terms.

Datasets
We evaluate the PCLL method on intent detection and slot filling based on public NLU benchmarks: For intent detection, we collect six datasets that carry intent annotations: HWU (Liu et al., 2019), BANKING (Casanueva et al., 2020), CLINC (Larson et al., 2019), SNIPS (Coucke et al., 2018), AITS (Hemphill et al., 1990), andTOP (Gupta et al., 2018).The dataset TOP is divided into three disjoint subsets TOP-S1, TOP-S2, and TOP-S3, and these three subsets along with the other five datasets are regarded as separate LL tasks to increase the total number of tasks for sequential training.Finally, we have eight tasks to be learned sequentially for this intent detection experiment.
For slot filling, we adopt five datasets that provide slot labels: SNIPS, AITS, DSTC (Rastogi et al., 2020), MIT-MOVIE, and MIT-RESTAURANT2 .Each dataset above is regarded as a separate LL task, and thus five tasks are learned in lifelong slot filling experiments.More descriptions about datasets are in Appendix A.

Implementation Details
We use the pretrained 12-layer GPT2 model (Radford et al., 2019) to initialize the encoder and decoder of our CVAE model.The prior network and the recognition network are both set to be a 2-layer MLP with hidden size of 128.When learning a new task t, PCLL balances the training data of t and pseudo samples by generating γN t pseudo samples for previously learned tasks.γ is the sampling ratio and γ is set to 0.2 in our experiment following Sun et al. (2020).Each task for intent detection and slot filling is trained for 5 and 10 epochs, respectively.We train PCLL on six random permutations of the task order.See Appendix B.2 and B.3 for more details.

Baselines
We compare PCLL with the following baselines: Fine-tune directly fine-tunes the model on the task stream without preventing catastrophic forgetting; EWC (Schwarz et al., 2018) and MAS (Aljundi et al., 2018) are two regularization methods that mitigate forgetting by penalizing changes of important parameters for learned tasks; LAMOL-g and LAMOL-t (Sun et al., 2020) are two variants of the generative replay method LAMOL that control the generation of pseudo samples either using a global special token (LAMOL-g) or task-specific special tokens (LAMOL-t); L2KD (Chuang et al., 2020) improves LAMOL by assigning an extra teacher for each new task to perform knowledge distillation; ER (Rolnick et al., 2019) preserves previously seen real samples for replay to prevent forgetting.We also consider some architecture-based baselines: HAT (Serrà et al., 2018) creates a task-based hard attention during training; CTR (Ke et al., 2021) inserts continual learning plug-ins into BERT to mitigate forgetting and encourage knowledge transfer; Adapter (Madotto et al., 2021) builds residual adapter for each task independently.Since works in Liu et al. (2021b) and Qin and Joty (2022) are specially designed for dialogue state tracking and few-shot learning, respectively, we do not consider them as our baselines.
Besides the above baselines, we further evaluate the model performance when all tasks are trained simultaneously in a multitask learning setting (Multi), which is often seen as an upper bound of LL.For fair comparisons, all baselines are implemented following either the settings of Sun et al. (2020), or their own reported settings.For ER, we store 1% of previously seen samples in memory following the setting of Madotto et al. (2021).

Evaluation Metrics
We use the accuracy score, and macro-averaged F1 score (Coope et al., 2020) to evaluate the performance of intent detection and slot filling tasks, respectively.Moreover, we consider access to a test set for each of the T tasks to learn in the LL process, and define R i,j as the test score of the task j after finishing learning the task i.We follow previous studies Lopez-Paz and Ranzato (2017); Chaudhry et al. (2018a) to use the following two metrics to evaluate the performance of LL: (1) Average Score (Score) is defined as the average test score of all T tasks after the LL process: (2) Learning Curve Area (LCA) is the area under the Z b curve, which captures the model's performance on all T tasks (Chaudhry et al., 2018b).Specifically, Z b is the average score for all seen tasks at the training step b.Here, high Score and high LCA are preferred for a good LL model.

Main Results
Table 1 shows the performances of our model PCLL and all the baselines.Our method PCLL significantly outperforms all baselines by a large margin on both intent detection and slot filling tasks.To better understand the LL process, we also plot the curve of the average score for all the models when trained using the same task order (see Fig. 4).From those results, we can observe that: (1) Regularization-based methods (EWC and MAS) suffer from serious catastrophic forgetting, consistent with the observation of Madotto et al. (2021).
(2) Generative replay methods LAMOL-g, LAMOL-t, and L2KD alleviate the forgetting issue to some extent.However, replaying real samples (i.e., ER) performs much better.This indicates that the quality of samples used for replaying is critical to addressing catastrophic forgetting, which matches our motivation to improve generative replay by generating high-quality pseudo samples.Our method PCLL achieves higher performance than ER, indicating that PCLL can generate high-quality pseudo samples under the guidance of task distributions.Our analyses in Sec.5.3 further prove this claim.
(3) Architecture-based methods HAT, CTR, and Adapter achieve good performance.However, PCLL still outperforms these baselines.This further validates the effectiveness of PCLL.Note that replay-based methods such as PCLL can be used together with these architecture-based methods to further improve the LL performance.(4) From Fig 4, we can notice that when switching to new tasks, PCLL retains more knowledge about previous tasks (less performance degradation) compared to the baselines.This suggests that PCLL has a better ability to consolidate knowledge and mitigate catastrophic forgetting for LL.

Ablation Studies
We conduct ablation studies to verify the effectiveness of each proposed component in PCLL.(1) w/o Latent means no latent distribution is modeled for each task, i.e., the CVAE model in Section 3.4 is removed, and pseudo samples are generated by directly feeding the prompt prefix into the LM f θ without incorporating task-specific statistics.(2) w/o Task ID means no task indicators are involved in the prompts.In other words, we design a taskindependent prompt prefix by replacing the task ID with a general description "current task" (see Appendix B.1 for more details).In this way, the CVAE model degenerates to a VAE model that captures a global latent space for all tasks.(3) w/o KD means that the knowledge distillation process in Section 3.6 is not applied.
From Table 2, we can see that: (1) Capturing task-specific latent distributions and incorporating them in the pseudo-sample generation process is crucial for building better LL models (w/o Latent).
(2) Using task-specific prompts helps to generate high-quality pseudo samples, thereby improving the LL performance (w/o Task ID).(3) The proposed knowledge distillation process does mitigate the effects of noisy pseudo-samples and is beneficial for consolidating previously learned knowledge to prevent forgetting (w/o KD).

Soft Prompts vs. Manual Prompts
We conduct analyses on soft prompts by replacing manually designed prompts with soft tokens in PCLL.Specifically, the prompt prefix g pre t and postfix g post t in Eq. 1 are replaced by several randomly initialized task-specific soft (learnable) tokens (Liu et al., 2021a).We also vary the lengths of these soft prompts to analyze their behaviors.
Results in Table 3 show that: (1) Longer prefix prompts (i.e. more parameters guiding the pseudosample generation) generally lead to better LL performance; (2) Longer postfix prompts may not always lead to better LL performance.This may be because the postfix prompts are less important than prefix prompts since they do not participate in the pseudo-sample generation.Longer postfix prompts may bring in more noise, degenerating the performance; (3) Using manual prompts in PCLL outperforms all its soft-prompt variants even though some soft prompts are much longer than manual prompts.This justifies our claim that manual prompts carrying rich semantic information help to alleviate the mismatch between fine-tuning and pre-training of PLM and capture tasks' distributions, and thus mitigate catastrophic forgetting in lifelong learning.

Manual Prompts
Different Designs.We validate different designs of manual prompts in PCLL.Specifically, we implement five different prompt templates with dif-   ferent lengths (Appendix B.4).We observe that different manual prompts yield almost the same performance.This indicates that our method is robust to the design of manual prompts.(See Table 8 in the Appendix).
Visualization of Attentions.We provide the visualization of the attention scores over several manual prompts employed by PCLL.High attention scores of task names in Fig. 6 indicate that the task indicators play an important role in our manually designed prompts (see Appendix B.5).

Qualities of Pseudo Samples
We validate the quality of pseudo samples generated by PCLL and all our generative replay baselines on intent detection tasks.We use the distinct score Dist-n (Li et al., 2016) to measure the proportion of unique n-grams in the generated pseudo samples' inputs (n=1,2,3,4).Higher Dist-n indicates more diverse generated pseudo samples, which is usually preferred because diverse samples help to approximate task distributions.As shown in Table 4, PCLL can generate more diverse pseudo samples compared to other generative replay methods.This demonstrates that pseudo samples constructed by our method are closer to real samples.Further, we measure whether the generated pseudo samples can restore the distribution of real samples by visualizing samples' feature space with t-SNE (Van der Maaten and Hinton, 2008).As shown in Fig. 7, pseudo samples generated by PCLL are clustered in a similar pattern compared to real samples, while those of LAMOL-t are scattered in the feature space.It shows that the pseudo samples generated by PCLL share closer distribution with the real samples compared to our baselines (see Appendix B.6 for more details).

Analyses of Latent Variables
To further analyze the behavior of the pseudo sample generator, we visualize the latent space captured by the recognition network on slot filling tasks.Specifically, for each sample in the test dataset, we extract a latent variable z from its posterior distribution and use the t-SNE algorithm (Van der Maaten and Hinton, 2008) to visualize these variables in 2D space.It can be seen from Figure 5 that the latent spaces of different tasks are well clustered and clearly separated.This indicates that the latent variable z is able to capture task-specific knowledge among learned tasks.
We also analyze the influence of dimensions for latent variable z.The results are listed in Table 5.We can notice that when we select the dimension of z as 128, it can reach the best performance.This phenomenon is reasonable, when the dimension of z is small, it may not catch enough information to model the task distribution; when the dimension is large, it may contain some noisy information, leading to poorer performance.

Influence of Sampling Ratio γ
We analyze the influence of the sampling ratio γ (ranging from 0.01 to 1.0) on the performance of PCLL.The results in Table 11 indicate that PCLL is more effective in improving the LL performance when considering a small number of pseudo samples (See more details in Appendix B.7).

Case Study
We present several pseudo samples generated by PCLL and LAMOL in Table 6 on the BANKING task for intent detection (see more cases in Appendix C).We can observe that: (1) Compared to LAMOL, pseudo samples produced by PCLL are closer to real samples from the BANKING dataset; (2) Some samples generated by LAMOL are inconsistent with the task: LAMOL generates samples for the weather domain, which is not related to the BANKING task; (3) LAMOL may also generate unmatched inputs and outputs in pseudo samples (last line in Table 6).These observations verify our claim that a single task-specific token is too weak to constrain the PLM, and our method PCLL helps to generate high-quality pseudo samples that are consistent with each task.

Analyses of Forgetting for PCLL
We provide more fine-grained analyses for the forgetting issue based on findings when learning with our proposed method PCLL.In Appendix D, we carry out the analyses from the following four aspects: (1) unbalanced classes in some tasks, (2) conflicted label spaces for different tasks, (3) noisy pseudo labels for generated samples and (4) the diversity of pseudo samples created by PCLL.

Conclusion
In this paper, we propose PCLL to enhance generative replay for addressing catastrophic forgetting of lifelong learning in building NLU modules of ToD systems.To construct high-quality pseudo samples, PCLL captures task-specific distributions with a prompt conditioned VAE to guide the generation of pseudo samples.Empirical results on two NLU tasks and extensive analyses demonstrate the superior performance of PCLL and the high quality of its generated pseudo samples.Currently, we do not consider lifelong learning in the low-resource setting where only limited labeled data are available.In the future, we will extend our framework to lifelong few-shot learning.

Limitations
Here are some limitations of our work: • We have not investigated lifelong learning in the low-resource setting where only limited labeled data are available.In future works, we will consider combining PCLL with meta-learning (Zhao et al., 2022a) to extend our framework to a lifelong few-shot learning setting.We will also extend previous studies by using unlabeled data (Zhang et al., 2020a;Zhao et al., 2022b) to build lifelong learning dialogue models.• We have not considered architecture-based methods for lifelong learning.However, our method PCLL can be readily combined with the architecture-based approach by leveraging parameter-efficient modules (e.g., Adapter (Houlsby et al., 2019;Zhang et al., 2021), LoRA (Hu et al., 2021)) into the model architecture to further mitigate the catastrophic forgetting issue.We will explore this direction in the future.

A Details of Datasets
We list the statistics of datasets for the intent detection and slot filling in Table 7 and give detailed descriptions as follows.
• ATIS consists of audio recordings and corresponding manual transcripts about humans asking for flight information on automated airline travel inquiry systems.The data consists of 17 unique intent categories.

B.1 Prompt Examples of NLU Tasks
We provide some detailed examples for inputs and outputs of the model with the designed prompts in PCLL.For intent detection, when we train on "BANKING" task, an input utterance x of the language model (LM) for a sample is modified as "For an utterance from the BANKING task, "I already have one of your cards, how do I link them?" has the following intent ", the output of LM y is its corresponding intent annotation: "Card linking".For the ablation study of w/o Task ID, the prompt of the above sample becomes "For an utterance from the current task, "I already have one of your cards, how do I link them?"".For slot filling, when we train on the "MIT-RESTAURANT" task, an input utterance x is "Does the Casanova restaurant at Kendall Square offer a fixed price menu?" of LM is modified as "In the MIT-RESTAURANT task, if there are any slots and values, what are they in this sentence: "Does the Casanova restaurant at Kendall Square offer a fixed price menu?"? Answer: ", the output y locating the contained slot-value pairs is modified as "Restaurant name: Casanova; Location: Kendall Square.".Here, different slot-value pairs are formatted as "slot: value" separated with ";".If the input x does not contain any slotvalue pairs, we use the sentence "No slot in this sentence."as the output y.

B.2 Different Task Orders
We list the six random permutations of tasks that we use to implement all competing methods in Table 10.

B.3 Model Implementation Details
We use a pre-trained GPT2 model (Radford et al., 2019) as the initialization for the encoder and decoder of CVAE in PCLL.We set the maximum context length as 256.Our model contains a total number of 240M parameters.We train all competing methods on 1 Tesla-V100 GPU and it takes around 6 to 10 hours to train all the tasks.Moreover, the training and testing batch sizes are set to 64.The maximum learning rate is 5e − 5, the Adam optimizer is used with parameters β 1 = 0.9, β 2 = 0.98 and ϵ = 1e − 8.

B.4 Analysis of Manual Prompts Designs
We list five different manual templates as the designed prompts of intent detection in Table 9, where Prompt1 is the one we use in Table 1.Let ID refers the task name, x refers the input utterance and y means the intent of x.

B.5 Analysis of Prompt Attention
We provide the visualization of the attention scores over several samples employed with our designed prompts for intent detection tasks.Specifically, the attention score on each prompt token is calculated using the averaged attention it receives when generating the output prediction.From the following Fig 6, we can notice that the task names do contain meaningful information to be attended to when generating predictions.

B.6 Analysis of Pseudo-sample Quality
We analyze the quality of generated pseudo samples with PCLL and other generative replay-based baselines.Specifically, we first fine-tune a pretrained BERT (Devlin et al., 2019) model using these observed real samples to construct a task classifier.This classifier can determine the task identity of a given sample, and it reaches an accuracy of 98.67% on a hold-out test set.The fine-tuned BERT is used to extract the representation vector of each sample, and the t-SNE algorithm (Van der Maaten and Hinton, 2008) is used to map these vectors into 2-dimensions.For a specific task order3 in LL, we gather pseudo samples generated when learning the last task and visualize the feature space of these samples.Note that the last task, ATIS, is not shown in Fig. 7 since there is no need to replay the last task.

B.7 Analysis of Sampling Ratio
Table 11 shows the results on intent detection tasks.
It can be seen that generating more pseudo samples helps to improve the LL performance.Besides, the performance gain slows down as the sampling ratio γ exceeds 0.2, i.e., generating 5 times more pseudo samples from γ = 0.01 to γ = 0.05 yields 10.48 absolute improvement on the Score metric, while increasing γ from 0.2 to 1.0 only yields 1.63 absolute improvement.

C Case Study
We present more generated pseudo samples from PCLL and LAMOL along with real samples in Table 12.For intent detection, we list real and pseudo samples from HWU tasks; for slot filling, we list those samples from MIT-RESTAURANT and DSTC tasks in Table 12.

D Analyses of Forgetting
We provide more fine-grained analyses for the forgetting issue.
• Classes with fewer samples are easier to be forgotten.Some tasks (e.g., ATIS, TOP, MIT-MOVIE) have unbalanced classes.These minor classes that only occupy a small portion of training samples are less likely to appear in pseudo samples used for replay.For example, the intent "meal" only takes 0.13% of the training samples for ATIS, and there are barely any pseudo samples generated for this intent when replaying.Without these pseudo samples, the model is more likely to forget these minor classes.• Different tasks may have partially overlapping data distributions and conflicted label spaces, i.e., some tasks may assign different labels to the same set of utterances.For example, in the CLINC dataset, the utterance "transfer funds to the other account" is assigned with a label of "transfer"; however, in the BANKING dataset, the same utterance is assigned with a label of "transfer into account".These conflicted label spaces may confuse the model, resulting in incorrectly labeled pseudo samples.• Noisy pseudo labels created by generative replay may lead to error accumulation, which will downgrade the performances of previously learned tasks.
• The diversity of generated pseudo samples for previous tasks tends to decrease as more replay times are performed, and these less diversified pseudo samples lead to more forgetting.Specifically, we conduct analyses on lifelong intent detection tasks with the following task order (CLINC, SNIPS, TOP_S3, BANKING, TOP_S2, HWU, TOP_S1, ATIS).We compare the diversity of pseudo-samples for the first task (i.e., CLINC) generated at different replay moments: (1) after learning the first task, (2) after learning three subsequent tasks, and (3) after learning eight subsequent tasks (i.e., after the last task's learning).
In Table 13, we use the distinct scores (Li et al., 2016) to measure the diversity of pseudo samples.We can notice that as we learn more tasks, the diversity of pseudo samples for the first learned task decreases.Therefore, replaying less diverse pseudo samples leads to performance degradation on previous tasks (i.e., forgetting of previous tasks).

Figure 1 :
Figure 1: The training process of our model PCLL.

Figure 2 :
Figure 2: The architecture of the prompt conditioned VAE generator in PCLL.It captures the task distribution conditioned on prompts and incorporates the latent variable z (or z ′ ) into tokens' embeddings to guide the decoding.

Figure 3 :
Figure 3: The LM-based solver for NLU tasks.The input-output pair (x, y) is converted into a natural language prompts with g pre t and g post t .

Figure 4 :
Figure 4: Learning curves of different methods on intent detection tasks.The dotted lines mean task switching.

Figure 6 :
Figure 6: Visualization of attention scores for the natural language prompts of PCLL.

Table 1 :
Experiment results on both intent detection and slot filling tasks.Each result is an average of six random task orders.The best results among LL models are bold.Our model PCLL is significantly better than other LL baselines with p-value < 0.05 under t-test.

Table 2 :
Ablation studies on two NLU tasks.Each result is an average of 6 random task orders.

Table 3 :
Applying soft prompts on lifelong intent detection tasks.#prefix and #postfix indicate the lengths of prefix and postfix prompts, respectively.Each result is an average of 6 random task orders.

Table 4 :
Distinct scores for generated pseudo samples.

Table 5 :
Analysis of different dimensions of the latent variable z of PCLL on lifelong intent detection tasks.Each result is an average of six random task orders.

Table 6 :
Real samples and generated pseudo samples for the BANKING task.
channels gradient descent in super neural networks.arXiv preprint arXiv:1701.08734.

Table 7 :
Statistics of datasets for intent detection and slot filling.

Table 8 :
The number of cycles for the cyclic annealing schedule is set to 4 in each epoch.When generating pseudo samples, the maximum decoded sequence length is set to 96.For baselines implementations, we use BERT to implement HAT and CTR, and choose GPT-2 as the backbone model for other baselines (LAMOL, L2KD, ER, Adapter, EWC, MAS, Finetune).Applying different manual prompts on lifelong intent detection tasks.Each result is an average of 6 random task orders.

Table 11 :
The LL performance on various sampling ratio γ.Each result is an average of 6 random task orders.

Table 9 :
Different Manual Prompts for Intent DetectionPrompt1 For an utterance from the ID task, x has the following intent y Prompt2 In the ID task, what intent best describes: x? Answer: y Prompt3 Task ID utterance x intent y Prompt4 In the task ID, this utterance x has the intent of y Prompt5 If we consider the intent detection task, for a sample in the ID task, what's the intent of the utterance x?The intent is: y Different manual prompts are designed for intent detection module of a ToD system.
Utterance: Can you track my card for me?Intent: Card arrival Utterance: Find me something to do in Boston this weekend Intent: Get event