Variational Autoencoder with Disentanglement Priors for Low-Resource Task-Specific Natural Language Generation

In this paper, we propose a variational autoencoder with disentanglement priors, VAE-Dprior, for task-specific natural language generation with none or a handful of task-specific labeled examples. In order to tackle compositional generalization across tasks, our model performs disentangled representation learning by introducing a conditional prior for the latent content space and another conditional prior for the latent label space. Both types of priors satisfy a novel property called \epsilon-disentangled. We show both empirically and theoretically that the novel priors can disentangle representations even without specific regularizations as in the prior work. The content prior enables directly sampling diverse content representations from the content space learned from the seen tasks, and fuse them with the representations of novel tasks for generating semantically diverse texts in the low-resource settings. Our extensive experiments demonstrate the superior performance of our model over competitive baselines in terms of i) data augmentation in continuous zero/few-shot learning, and ii) text style transfer in the few-shot setting.


Introduction
Task-specific Natural Language Generation (NLG) aims to generate texts that satisfy desired attributes of target tasks, such as text style transfer (Jin et al., 2020) and task-specific data augmentation (Lee et al., 2021).Herein, a task includes a set of taskspecific labels, optionally a set of labeled texts for that task (Han et al., 2020).Although there is already a large amount of labeled data for various tasks, in many application scenarios, such as AI assistants for legal aid, the labeled data of new tasks are still difficult to acquire.As a result, there may be no or just a handful of labeled texts for target tasks.In such a low-resource setting, given a new task, it is desirable to i) identify which information in texts is task-specific and which is task-independent, and ii) systematically and consistently combine the label representations of the new task with task-independent content representations for text generation.As illustrated in Fig. 1, data augmentation needs to combine content representations from seen tasks with novel task labels.In contrast, text style transfer requires combining the content representations extracted from inputs with target styles.
Most prior work assumes access to labeled data for supervised training.However, those models trained on seen tasks cannot generalize well to new tasks during inference (Krishna et al., 2022).One of the key reasons is that the parameters of supervised models are tied to seen tasks such that a significant amount of fine-tuning data is needed for adapting to new tasks.For prompt-based and guided decoding methods (Zhang et al., 2022), although they require significantly less training data, it is still challenging to generate a large number of semantically diverse and coherent texts for new tasks in a robust way because they cannot well leverage the rich contents of the seen tasks.
The key challenge of low-resource task-specific NLG is to disentangle content representations from label representations with few labeled data of target tasks.If content representations still contain taskspecific information from seen tasks, they may well mislead the language generator after fusing with the representations of new tasks.Prior works tackle this problem by enforcing the random variables of content representations to be independent of those of label representations (Cheng et al., 2020).However, in practice, both types of random variables are not always independent.For example, the random variables of emotion labels naturally depend on the contents of the events causing them.
In this work, we propose a deep VAE model with novel disentanglement priors, coined VAE-DPRIOR, for task-specific natural language generation in the zero-shot and few-shot settings.In contrast to the widely used unconditional priors in the VAE framework, the new priors are conditional, satisfying a novel property called ϵ-disentangled, which motivates a new way of regularization for disentangling representations without forcing independence between the corresponding random variables.The new priors build a constraint space for latent content representations and latent label representations with the aims to i) minimize information overlap between the two types of representations and ii) enable generalization across tasks with little labeled training data.One of the priors is a conditional Gaussian mixture in the content subspace for sampling rich content representations without accessing original training data.Another type of priors is a conditional multivariate Gaussian per label that associates latent label representations with task-specific information, requiring only a label name or a small set of labeled examples.Extending a pre-trained language decoder based on the prefix-tuning technique (Li and Liang, 2021) with those priors, our model is able to sample rich content representations of seen labels and combine them with the representations of new labels to generate diverse and natural sentences.In addition, we empirically observe that VAE-DPRIOR alleviates posterior collapse (Wang et al., 2020), which is a long-standing problem of VAEs that makes it difficult to train a latent model to generate coherent and semantically diverse texts.
To sum up, our key contributions are threefold: i) We propose a VAE-DPRIOR model with novel disentanglement priors for low-resource taskspecific NLG tasks.It enables sampling diverse content representations directly from the content prior; ii) We introduce ϵ-disentangled, which sets a novel regularization goal for disentangled representations; iii) Our model outperforms competitive baselines in the low-resource settings on the tasks of text style transfer and data augmentation for continual few/zero shot text classification.

Methodology
To tackle task-specific NLG tasks in low-resource settings, we introduce a deep generative model VAE-DPRIOR, which employs disentanglement priors, including a content prior for rich contents, to generate coherent and semantically diverse texts.We are provided with a large corpus of labeled sentences D (0) = {x i , y i } n i=1 for an initial task T (0) , where a sentence x i ∈ X is annotated with a seen label y i ∈ Y.The goal is to learn a single model that can generate diverse texts for any new task or a sequence of K distinct new tasks {T (1) , T (2) ,...,T (K) }.Each new task includes multiple novel labels, where a label y ∈ Y is associated with a label name and optionally a handful of example texts D sup = {x i , y i } m i=1 as the support set.The model is evaluated on both data augmentation for continual text classification described in Sec. 3 and few-shot text style transfer detailed in Appendix C.1.
For evaluating data augmentation, a text classifier is trained sequentially on K new tasks and evaluated on the test sets of all seen tasks {T (1) ,...,T (t) } till time t.For each task, its training data includes the texts generated by the NLG models in order to evaluate to what degree the augmented texts improve the classifier performance.In the zero-shot setting, the classifier is trained only on the generated texts using the label names of new tasks, while the support sets are also used for data augmentation and classifier training in the few-shot setting.

Theoretical Framework
In the absence of large training data for new tasks, one of the key challenges is to construct content representations and label representations in the latent space satisfying information purity.As such, content representations should not contain label information, otherwise old task information in such content representations may contaminate combined representations for new tasks, vice versa.
Formally, the latent space is the sample space Ω for both content and label representations.In the corresponding probability space, we define a random variable vector Z y for the latent representations of each label y ∈ Y, a random variable vector Z c for latent content representations.Then observable word sequences are denoted by the random variable vector X, where each variable X v corresponds to a word in the vocabulary V.The statistical dependencies between those random variables are illustrated by the Bayesian network in Fig. 2(a), where C denotes the prior knowledge of contents.The dashed arrow denotes a possible dependency between Z y and Z c .
To achieve information purity, the learned models are expected to follow the structure illustrated in Fig. 2(a) that there is no dependency between C and Z y , as well as no dependency between y and Z c .However, prior works on disentangled representation learning regularize the models by approximating Z c ⊥ ⊥ Z y (Cheng et al., 2020;Wang and Jordan, 2021), which may violate the true statistical relation between Z c and Z y .Even though Z c ⊥ ⊥ Z y holds after regularization, it does not imply Z y ⊥ ⊥ C and Z c ⊥ ⊥ y.The random variable of a label can still depend on both Z c and Z y .
To address this limitation, we propose to regularize the priors of the latent variables for encouraging information purity.After training, we expect the mutual information of I(Z y , Y ) and I(Z c , C) is high, while I(Z y , C) and I(Z c , y) is low or zero.One way to achieve this is that we make I(Z y , y) high only in the dense regions of Z y but force it to be low or zero in the dense regions of Z c , vice versa.As a result, we expect little overlap between the dense regions of p θc (Z c |C) and those of p θy (Z y |y).Then the distances between those priors are large.We characterize this property by introducing ϵ-disentangled below.We refer to the priors satisfying ϵ-disentangled as disentanglement priors.In Appendix E.1, we conduct an in-depth discussion of this property.We show that if p θc (Z c |C) and p θy (Z y |y) are not ϵ-disentangled under a mild assumption, at least one of them is non-identifiable, which is a leading cause of posterior collapse (Wang et al., 2020).VAE with disentanglement priors.Using the disentanglement priors, we employ the maximum likelihood principle for learning the parameters of the joint distribution y∈Y p θ (X, Z c , Z y |C, y).The marginal distribution y∈Y p θ (X|C, y) is given by y∈Y p θ (X|Zc, Zy, y, C)p θy (Zy|y)p θc (Zc|C)dZydZc. (1) We learn the above distribution in the VAE framework.Note that, the introduction of the conditions C and y makes the priors of both latent variables conditional, which differs from vanilla VAEs that have only unconditional priors for latent variables.
Given a dataset D = {(x i , y i )} n i=1 , the training problem to learn the marginal distribution in Eq. 1 is formulated as: The disentanglement constraint is achieved by either carefully choosing priors satisfying ϵdisentangled, applying a divergence measure requiring no absolute continuity between priors as a regularizer, such as the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012), or both.In the following, we provide the model details and show how to derive an evidence lower bound (ELBO) in the VAE framework for this optimization problem.

Model Details
As illustrated in Fig. 2(b), the overall architecture consists of an inference module, a generator and priors.The inference module consists of a pre-trained BERT encoder, whose outputs serve as inputs of a label encoder and a content encoder, and a generator comprising a prefix encoder and a pre-trained GPT2 with frozen parameters.
The VAE framework adopts variational distributions to approximate true distributions (Kingma and Welling, 2019), which ends up maximizing an ELBO.We show in Appendix E.3 that the ELBO objective takes the following form: where the first term is referred to as the reconstruction loss L r , the other terms constitute regularizers.Following the convention of VAE, we refer to the network for q ϕ (Z c , Z y |X, C, y) as inference module, the network for p θ (X|Z c , Z y ) as generator.Priors.In the label subspace, we assume p θy (Z y |y) for a label y is a simple factorized Gaussian distribution in form of N (Z y ; µ p y , λ y I), where λ y is a hyperparameter, its mean µ p y is constructed by using the name embedding of label y in the zero-shot setting, and by averaging the label name embedding and the embeddings of its support set examples in the few-shot setting.Each embedding is curated by feeding its word sequence to the label encoder shared with that of the inference module.
The content prior p θc (Z c |C) takes the form of , where M is the random variable indicating the membership to a component Gaussian.Inspired by neural topic modelling (Wang and Yang, 2020), we encode the prior knowledge of content C into a k-means clusters, where we assume that there is a one-to-one correspondence between a component Gaussian and a cluster in the k-means clusters.The mean of a Gaussian component N (Z c ; µ p c,k , λ c I) is computed by W c c k , a linear projection from the corresponding cluster centroid c k .The k-means clusters are built from BERT sentence embeddings on the training data of seen tasks.Adding new topics is a matter of adding new clusters using incremental clustering techniques.
In Appendix E.2, we show that p θc (Z c |C) and p θy (Z y |y) are ϵ-disentangled with a small ϵ if their means are far from each other and their variances are sufficiently small.Inference Module.The inference module is a BERT (Devlin et al., 2018) encoder augmented with an encoder for content, and an encoder for labels.Each encoder is built on top of the contextual embedding sequences produced by BERT and yields latent representations of the target type.This design is not only parameter efficient but also leverages the strengths of a large-scale pre-trained transformer model.
A BERT model consists of multiple layers.To provide more model capacities to capture the differences between the two types of latent representations and preserve parameter efficiency, we freeze all layers of BERT except the top most one so that the content encoder and the label encoder employ a top most transformer layer with different parameters respectively, while sharing all the remaining layers of BERT.
In the label subspace, given a contextual word embedding sequence V l = {v 0 , ..., v u } generated by the corresponding top-most layer of BERT, the LabelEncoder implements q ϕ (Z y |X, y) in form of N (Z y ; µ q y , diag(σ 2 y )).In order to build a hidden representation focusing on label relevant information, we apply the label embedding µ p y used in the label prior to V y via soft attention.In particular, we compute an aggregated representation h y = attention(µ p y , V l ) for a label y by applying µ p y as the query vector to attend all vectors of V y .We compute the mean µ q y as a linear transformation of h y by using the weight matrix W l µ and the logarithm of the variance log σ y as the linear transformation of another linear matrix W l σ .By applying the reparameterization trick (Kingma and Welling, 2019), the latent label representation z y is a function of µ q y and a stochastic noise.The stochastic noise is added by the product of σ y and the Gaussian noise ϵ y drawn from N (0, I).log σ y , µ q y = LabelEncoder(V l ) z y = µ q y + σ y ⊙ ϵ y (4) where ⊙ denotes the element-wise product.
In the content subspace, we consider q ϕ (Z c |X, C) = N (Z c ; µ q c , diag(σ 2 c )). Taking V c from the corresponding top most layer of BERT as input, ContentEncoder consists of a mean pooling layer followed by a linear layer for the mean and another linear layer for the logarithm of the variance of q ϕ (Z c |X, C).The same reparameterization trick is applied to obtain the latent representation z c .Generator.Given a pair of latent representations (z c , z y ), the generator captures p(X|z c , z y ) factorized into the following autoregressive form.
We employ a prefix-tuning technique (Li and Liang, 2021) that yields continuous prompts for the decoder in the low-resource situations.A continuous prompt is a continuous vector sequence of length L. The prefix encoder consists of L MLPs, each of which uses the architecture M θ [i, :] = MLP θ ([M ′ θ [i, :]; z y ; z c ]) for computing a vector at position i, where M ′ θ ∈ R |M idx |×H ′ is a learned matrix to encode the position information of the continuous prompt.In practice, we find it also useful to prepend the name embedding of the target label to the prefix.Further implementation details are available in Appendix A.
In the low-resource settings, the mechanisms of constructing latent label and content representations across tasks should be consistent, otherwise labeled data needs to be provided for adjusting model parameters for alleviating those discrepancies.Therefore, we minimize parameters to update across tasks, use the same prior for content representations, and construct label embeddings using the same label encoder for different tasks.The parameters of the pre-trained encoder and the pretrained decoder are frozen during training.Thus, we only need to train the parameters of the intermediate hidden layers between them on the data of initial tasks, which are also frozen for new tasks.Freezing parameters could effectively prevent the catastrophic forgetting of models when learning the new tasks.

Training and Inference
Given a training corpus D = {x i , y i } n i=1 , we derive the objective function L θ,ϕ (D) = L r + y L y (y) + L c from the objective in Eq. 2 and the ELBO, where L y and L c are the KL regularization terms from the ELBO.The constraint is removed by the usage of the disentanglement priors.
We first pre-train the whole model on the corpus of the initial task T (0) without applying any disentanglement constraint and the regularizers derived from the ELBO, followed by fine tuning the model with all regularizers.In practice, we find out that the two-steps approach is important to achieve optimal empirical performance.Regularization in the Label Subspace.The regularization term L y (y) in the label subspace is derived from D KL (q ϕ (Z y |X, y)∥p θy (Z y |y)).
The first term enforces latent label representations z y to be close to the label prototype µ p y obtained from the label priors.In contrast, the corresponding regularization term in a vanilla VAE with unconditional Gaussian priors takes the form of ∥z y ∥ 2 , which only makes the latent representations smooth without providing any label specific information.Regularization in the Content Subspace.Derived from D KL (q ϕ (Z c |X, C)∥p θc (Z c |C)), the regularization term L c takes the similar form as the loss of deep k-means (Fard et al., 2020).
We compute it by using EM.The term , which denotes the probability of an example x belonging to a cluster k with a temperature τ .In our experiments, we employ hard EM, where q ϕ (M k |x) indicates if the current Gaussian has the same the index as the one having the minimal Euclidean distance ∥z c − µ p c,k ∥ 2 among all components.Inference.For data augmentation, our model samples a large number of texts from the model and filters out the ones that are not in accordance with the target labels.For each new label y, we construct the mean µ p y of p θy (Z y |y) by averaging the embeddings of the label name phrase and optionally its associated texts from the support set.The corresponding embeddings are generated by feeding name phrases and texts into the label encoder.Then we sample a large number of content embeddings from the content prior p θ (Z c |C).All combinations of label embeddings and content embeddings are fed to the generator to generate candidate examples.We find that the candidates of low quality are not in accordance with the target labels.Hence, we perform quality control by filtering out irrelevant ones.Specifically, we project each candidate to a latent representation using the label encoder, and rank all candidates w.r.t. the Euclidean distance between each representation and its associated name embedding.The top-k candidates are taken as the final outputs.

Experiments
We evaluate our model on both continual few/zero shot text classification and few-shot text style transfer.The former requires sampling rich content representations from seen tasks, while the latter expects to retain task-independent contents from inputs.In both cases, it is desirable for models to systematically combine latent label and content representations across tasks in a consistent manner.
The details of few-shot text style transfer are available in Appendix C. We compare VAE-DPRIOR with five style-transfer baselines and show superior results on two datasets in the few-shot setting in terms of accuracy of style transfer, semantic relevance and naturalness of the generated text.

Continual Zero/Few-shot Learning
Setting.The general setting of continual zero/fewshot text classification has been introduced in Sec. 2. Following a conventional continual learning setting (Lopez-Paz and Ranzato, 2017), a memory M k is associated with a task T (k) to store a fixed number of training examples1 per seen task.Upon the arrival of a new task, given the label names in the task and optionally a support set, a generative model produces new task-specific examples.A classifier is trained on a combined set of examples from the support set, the memories, and the augmented examples, and evaluated on the test sets of all seen tasks.The datasets we used are EMPA-THETIC and TACRED .Please refer to Appendix B.1 and B.2 for more details.Evaluation.We use the widely adopted metric ACC avg in continual learning, which measures the performance by averaging the accuracies of the classifier on test sets of all seen tasks {D (Lopez-Paz and Ranzato, 2017).In addition, to measure the diversity of generated examples, we calculate the average similarity scores between all pairs of examples within each label, i.e. 1 |Y| i,j sim(x i , x j )1[arg max p(y|x i ) == arg max p(y|x j )], where we use BLEU (Papineni et al., 2002) and word mover distance (WMD) (Kusner et al., 2015) as the similarity functions.The lower scores indicate more diversified examples within each label.Baselines.We compare five data augmentation baselines: i) EDA (Wei and Zou, 2019) randomly deletes, substitutes, inserts or swaps words in the original sentences.ii) BERT (Ma, 2019) uses BERT to determine the position to insert or substitute words.iii) RTT (Sennrich et al., 2015) augments datasets by generating the paraphrases of the original sentences through round-trip translation, iv) LAMBDA (Kumar et al., 2020)  amples and deal with the low-resource setting.vi) OPTIMUS (Li et al., 2020) is our backbone model which is in an auto-encoder framework that uses BERT as the encoder and GPT2 (Radford et al., 2019) as the decoder.vii) CASUAL-LENS (Hu and Li, 2021) improves the training of OPTIMUS using an intervention and a counterfactual losses.
Both OPTIMUS and CASUAL-LENS are designed for controllable text generation.We use them for data augmentation by assessing their ability for label-conditional generation.
Main Results and Discussions.We compare first the baselines with our model in its best setting, coined VAE-DPRIOR, which applies both the disentanglement priors and the MMD regularizer between the priors.The results in Table 1 show that it outperforms all data augmentation baselines on all zero/few-shot learning settings by significant margins.The augmentation approaches such as EDA , BERT and RTT generate adversarial examples of the original sentences via manipulation of words or paraphrasing.However, adversarial distributions are not the same as the true distribution, thus their generated examples do not improve the continual learning performance.They even degrade the performance in the five-shot setting in comparison to that without data augmentation.
Although LAMBDA , EX2 , OPTIMUS and CASUAL-LENS aim to learn the true distribution from labeled data, we observe that they often fail to generate texts in accordance with correct labels, especially for new tasks.Thus, their performance cannot be improved given more labeled examples of new tasks.In contrast, VAE-DPRIOR achieves a significantly higher degree of compositional generalization across tasks, evident by high average accuracy of the classifier trained on its generated examples.The performance of the classifier further improves when our model is fed with more labeled examples of new tasks.
We also evaluated diversity of the generated examples of all augmentation methods in the one-shot setting, presented in Table 2.The generated sentences from EDA , BERT and RTT are mostly paraphrases of the original sentences.Therefore, they cannot significantly diversify examples at the semantic level.LAMBDA generates data examples conditioning on the label names and the first k words of the original sentence, which also lacks diversity.EX2 enriches the diversity by extrapolating the novel samples from the existing sentences within the novel labels.OPTIMUS and CASUAL-LENS employ a GAN (Goodfellow et al., 2014) and a conditional GAN (Mirza and Osindero, 2014) respectively to generate diversified latent vectors for the generation of examples with the novel labels.However, with merely one or five sentences per label, such methods only generate a small sample of texts with novel labels.In contrast, VAE-DPRIOR can combine plenty of seen content representations acquired from the past with representations of new labels to generate high-quality sentences.

Ablation Study
Disentanglement.To show the importance of ϵdisentanglement, we remove the constraint of the optimization problem (2) by using only the pretrained model resulted from the first training step, denoted by VAE-DPRIOR (AE).As shown in Table 4, it suffers from a significant drop in terms of all metrics in the one-shot setting.In the same table, we also report the comparisons with alternative priors: (i) unconditional priors as in the vanilla VAE (VAE (UNCOND)), (ii) the same priors but increasing the variance coefficients of the two priors from 1 to 50 (VAE-DPRIOR (LGVAR)), (iii) a Gaussian mixture with randomly initialized means as the content prior but do not fine-tune the parameters of the prior (VAE-DPRIOR (RAND)), (iv) same as (iii) but fine-tune the parameters of the content prior (VAE-DPRIOR (RAND-FT)), and (v) a simple factorized Gaussian conditioned on the averaged sentence embedding of all sentences of initial tasks as the content prior (VAE-DPRIOR (GAUSS)).In   the one-shot setting, the accuracy drops by more than 7% and 3% on TACRED and EMPATHETIC , respectively, using the alternative or no priors, indicating the importance of ϵ-disentanglement.Increasing the variance of our priors also jeopardize the ϵ-disentanglement.As evident in Fig. 3 using the t-SNE (Van der Maaten and Hinton, 2008), it is clear that the priors of VAE (UNCOND) and VAE-DPRIOR (LGVAR) are severely overlapped in contrast to VAE-DPRIOR.We further investigate how the disentanglement regularizers influence our model by removing MMD or replacing MMD with GAN, HSIC, and IDEL (Cheng et al., 2020).As in Table 3, except for MMD, the other disentanglement regularizers bring almost no improvement to VAE-DPRIOR.HSIC, GAN and IDEL enforce independence between latent variables but even hurt the performance.We observe that the GAN-based regularizer causes mode collapse, because VAE-DPRIOR with GAN tends to generate overly similar examples.In contrast, if we apply the MMD to the other types of VAEs, such as a vanilla VAE, they lead to improved performance (see Appendix B.3) because the other VAEs do not have the ability to disentangle representations.Posterior Collapse.Classical VAEs, such as vanilla VAE (VAE (UNCOND)), suffer from a notorious problem called posterior collapse.Those mod-  (Jang et al., 2016), VAMP-VAE (Tomczak and Welling, 2018) and VAE-DPRIOR with alternative priors described before in terms of diversity and accuracy.If there is severe posterior collapse, models will generate similar texts indicated by the high BLEU scores, and the classifier trained on the augmented data would perform poorly.Unsurprisingly, the results in Table 4 show that VAE-DPRIOR largely outperforms those VAEs.Although VAMP-VAE also introduces conditional priors, the latent variables of its priors do not require to be ϵ-disentangled with a small ϵ.The VAE-DPRIOR (RAND-FT) even generate almost identical texts.
Posterior collapse should lead to high ratios of duplicated outputs.Thus we feed each model with 200 diverse latent variable values randomly sampled from their priors and compute the duplicate ratios per label.VAE-DPRIOR (RAND-FT) has the highest ratio 97.38%, followed by VAE (UNCOND), VAMP-VAE , VAE-DPRIOR (RAND-FT), C-VAE and VQ-VAE with a duplicity ratio of 78.33%, 70.38%, 8.06%, 6.10% and 3.09%, respectively, on EMPA-THETIC in the one-shot learning.In contrast, our model generates no duplicates with those latent variable values.
We also investigate the quality of the outputs of those models by sampling representations from the posterior distributions, q ϕ (Z y |X, y) and q ϕ (Z c |X, C).The duplicate ratio of VAE (UN-COND) drops to merely 4.27%, while that of VAMP-VAE increases to 97.95% on EMPATHETIC .Our model still achieves a zero duplicate ratio.However, appendix B.4 shows that the models sampling from posteriors achieve comparable results as those sampling from priors in terms of accuracy.There-   (Higgins et al., 2018).The unsupervised ones mainly fall into either the framework of VAE (Burgess et al., 2018) or Generative Adversarial Learing (GAN) (Tran et al., 2017).The recent works have also incorporate causality theories for robustness (Hu and Li, 2021).
There are growing interests in applying disentangled representation learning in NLP applications, such as text style transfer (John et al., 2018a) and mitigating gender bias (Liu et al., 2020).However, it is challenging for those NLP approaches to work in the low-resource settings because they do not store rich content information inside models (Romanov et al., 2018).
Controllable Text Generation.Our method decomposes content and (attribute) label, where the label could be considered as additional control signal for text generation.We connect our work to those text style transfer (TST) and controllable text generation (CTG).Representation disentanglement is an important line of research in TST, which disentangles content and attribute representations (John et al., 2018a).Many disentanglement approaches are proposed to minimize the dependence between these two representations, such as mutual information (Yuan et al., 2020) and orthogonality (Wei et al., 2021).CTG controls the text generation of language models by smart prompt design (Li and Liang, 2021;Shin et al., 2020) or training conditioned on the controllable variables (Li et al., 2020;Hu and Li, 2021).Our work is highly aligned with (Li et al., 2020;Li and Liang, 2021).Since (Li et al., 2020) and (Li and Liang, 2021) have similar implementations, and (Li et al., 2020) was designed for generation conditioned on latent variables, we pick (Li et al., 2020) as one of our baselines.

Conclusion
In

Limitations
We have studied ϵ-disentangled only in the VAE framework for task-specific language generation, though we believe it should be useful for a wide range of latent models.Although the content prior of our model can already be used to sample rich content representations, there is a possibility to store more information and represent a even richer content space reflected in real-world scenarios.In addition, our model has not considered application scenarios with limited computing resources.Though it is beyond the scope of this work, due to the heavy use of pre-trained large-scale language models, the deployment of our model in those cases is particularly challenging.

A Implementation Details
We use learning rate of 5e-5 for our method.The training epochs for our generation model in continual few-shot learning are 120 and 160 for EMPATHETIC and TACRED , respectively.All experiments are run five times with different random seeds and we report the average accuracies.The number of clusters for the deep content clustering loss are 1600, 800 and 3200 when training model on EMPATHETIC , TACRED and PERSONALITY , respectively.All the methods are trained on the V100 GPUs.The number of total parameters is 455864068 and the total number of total trainable parameters is 401751808.For style transfer, the training epochs are 120 and 160 for EMPATHETIC and PERSONALITY , respectively.We use BERT-small (Turc et al., 2019) as the backbone of the label and content encoders and GPT2-medium as the decoder.For data augmentation in continual few-shot learning, each label is augmented with 50 examples generated by different augmentation methods.OPTIMUS is not designed for style transfer.We adapt it to conduct style transfer by prepending a label phrase as a prompt before the input sentence.Style transfer can be done by altering the current label phrase to novel labels in the new tasks.

B Continual Few-shot Learning B.1 Setting
We consider a continual few-shot learning setting similar as (Antoniou et al., 2020).The text classification model π c θ : X → Y is trained sequentially on K distinct tasks {T (1) , T (2) ,...,T (K) }.The initial task T (1) includes the training and test set (D Upon arrival of each task, the classifier π c θ is trained on a combined set of instances from the support set, memory set, and the augmented examples generated by our generative model.To generate augmented examples, we sample content from the fixed clustering obtained using the large training data in T (1) , and sample labels from C 1:k .Note that for our model, as long as we have the generative model and store the label embeddings, we could regenerate examples from all old tasks.Therefore, the memory is not necessary for our model.But for a fair comparison with other baselines, we still assume there is fixed memory for each old task and use only the examples from this memory to replay the classifier.We apply our data augmentation method to EMAR (Han et al., 2020), a SOTA continual learning approach for text classification.We follow EMAR (Han et al., 2020) for the classifier architecture and how to update its parameters in continual learning.

B.2 Datasets
TACRED is a relation detection dataset which includes 42 relations.Following the settings in (Wang et al., 2019;Han et al., 2020), examples are clustered into ten groups given the word embeddings of the label phrases.5 groups are randomly selected as the initial task.The support and test set from each task are drawn from each of the rest tasks.We randomly generate the support sets five times with different random seeds as well.Each support set includes 0, 1, or 5 examples.

B.3 Influence of MMD on VAEs
Table 6 shows the performance of VAEs on five-shot learning of TACRED with or without MMD.We present only five-shot results as we found that the MMD brings almost no improvement to different VAEs on zero/one-shot learning.But it consistently leads to performance improvement on all VAEs except for VAE-DPRIOR when the number of shots increases.We conjecture that disentanglement regularization perform better when there is enough label-specific information.

B.4 Influence of Sampling from Posterior Distribution
Table 7 shows the performance of the classifiers using augmented data sampled from posterior distributions of VAEs.Since sampling representations from posterior distributions, q ϕ (Z y |X, y) and q ϕ (Z c |X, C), requires text X as the input, we feed all the text from the training sets in previous tasks to the content encoder to obtain the content representations and the text in the support set of new tasks to the label encoder to obtain the label representations.We combine the two types of representations to get augmented data for new tasks.Notice that the VAE-DPRIOR (AE) setting in Table 4 adopts a similiar way to sample examples for data augmentation except that the content and label representations z c and z y are generated by using LabelEncoder and ContentEncoder directly.

B.5 Accuracies of VAE-DPRIOR on PERSONALITY and FEWREL
Table 8 shows the performance of the baselines and VAE-DPRIOR on one-shot learning of PERSONALITY and FEWREL .Please notice that we use the exact same FEWREL dataset as in (Wang et al., 2019;Han et al., 2020)

B.6 Accuracies of VAE-DPRIOR using Soft and Hard EM
Table. 9 shows the VAE-DPRIOR on two datasets with soft and hard EM.The temperature τ for soft EM is set as 0.5.VAE-DPRIOR with soft EM has higher performance.However, we select the hard EM as our main setting because it brings a faster training speed.
C Few-shot Text-style Transfer

C.1 Setting
We follow the common non-parallel text style transfer setting as in (Nangi et al., 2021), where each text sample x is associated with a style label y.In the few-shot setting, the style transfer model π s θ : X → X  , which includes only N -shot instances for each one of |C n | novel styles.Although VAE-DPRIOR can be easily fine-tuned on the support sets, we found the fine-tuning brings negligible performance gain in the few-shot setting.The test set D test is sampled from both D train and a corpus D ′ train which is from the same distribution of D train .The style transfer task is to transfer text in D test into the styles of C n in D sup .

C.2 Datasets
EMPATHETIC dataset includes around 18,000 dialogues.Each dialogue consists of a context description and an associated empathetic type.PERSONALITY includes 200,000 image captions associated with 215 personality types.Since many personalities are highly correlated in terms of their semantics, we cluster these personalities into 35 groups and manually select one type for each group.For EMPATHETIC , we use the context descriptions as the original text to be transferred and their corresponding empathetic types as the styles.For PERSONALITY , we use the image captions and their personalities.We randomly select examples of 28 empathetic types and 30 personality types in the training set and draw support sets from the rest empathetic and personalities types.We draw 0, 1 or 5 examples for each held out label.After drawing, the rest examples are considered as the test examples.For each k-shot, the support and test sets are drawn five times with different random seeds to avoid bias during evaluation.Our experiments would be run on all the support sets and obtain the average performance.

C.3 Evaluation.
We use three automatic metrics, Style-transfer Accuracy, Self-WMD and Perplexity, to evaluate the accuracy of style transfer, semantic relevance and naturalness of the generated text, respectively.For Style-transfer Accuracy, we train a BERT (Devlin et al., 2018) classifier on styles.The averaged accuracy on target labels indicates the correctness of style transfer.Self-WMD (Kusner et al., 2015) measures WMD between the original text and the transferred text.Perplexity is estimated by a statistical language model in English released by (Koehn et al., 2007) C.4 Baselines.
We compare five style transfer baselines: i) R-VAE-AVG (John et al., 2018b) learns the disentangled label and content representations.ii) R-VAE-CF (Nangi et al., 2021), on the base of R-VAE-AVG , uses a counterfactual reasoning module to control the generation of label representations.iii) ZF (Smith et al., 2019) is a back-translation model, which aims to deal with the zero-shot text transfer problem.The two controllable text generation baselines used in the continual few-shot setting, iv) OPTIMUS and v) CASUAL-LENS , are extended for style transfer as well.Please refer to the Appendix A and their original work for the detailed style transfer implementation.
The inference of VAE-DPRIOR for the text style transfer differs from the inference for continual few-shot learning.Given a new style, we start with sampling a name representation and text representations from the posterior label distribution, q ϕ (Z y |X, y), conditioned on its associated name phrase and text sequences in the support set, respectively.Then, we create the label representation z y for the new style by averaging its associated text representations and its name representation.The content representations are sampled from the posterior content distribution, q ϕ (Z c |X, C), conditioned on the text to be style-transferred.After Table 10: The results of one-shot style-transfer on both datasets.
feeding the content representations and the representations of target styles to the generator, we obtain the most likely outputs by beam-search.
C.6 Main Results and Discussions.
The results in Table 10 show that our method performs better than all baselines in terms of all metrics except Style Accuracy on EMPATHETIC and Self-WMD on PERSONALITY .An ideal style transfer model should find a good balance in terms of all three evaluation metrics.Though CASUAL-LENS and OPTIMUS can achieve the best on a single metric, they fail to perform well across all the metrics.We inspect that CASUAL-LENS performs poorly on preserving the content of the original sentence while OPTIMUS performs poor on style transfer and basically replicates the original sentences in PERSONALITY dataset.
In contrast, the average ranking of VAE-DPRIOR on three metrics are highest among all baselines.Our model performs particularly well in terms of semantic relevance and naturalness while still keeping high accuracies of style transfer.Other methods that utilize disentanglement learning, including R-VAE-AVG , R-VAE-CF and CASUAL-LENS , often perform well on one metric while lose on the other metrics.We conjecture this is due to their methods do not fully disentangle the representations so they can not balance well between content preservation and style transfer.

C.7 Complete Automatic Evaluation Results of Style Transfer on two Datasets
The full results of automatic evaluation on Empathetic dataset and Personality dataset are presented in Table 11 and Tabel 12 respectively.Overall, on all few-shot settings, our method perform the best in terms of the average rank among all baselines.Although on EMPATHETIC dataset, R-VAE-AVG and CASUAL-LENS outperform our method in term of the Style Accuracy.Through inspection, we found that R-VAE-AVG and CASUAL-LENS tend to overfit to support set after finetuning merely on a small number of training instances.For example, R-VAE-AVG tends to copy the text from support set, which gains higher Style Transfer Accuracy.But this effect makes the Perplexity and Self-WMD of R-VAE-AVG and CASUAL-LENS decreasing from zero-shot to five-shot learning.In contrast, VAE-DPRIOR performs steady across different(zero/few-shot) settings.The Style Accuracy of VAE-DPRIOR is increasing without losing performance on content preservation and naturalness of sentences.

C.8 Human evaluation result
We hire three crowd-workers to rate the sentences with score from 1-5 to indicate whether the the generated sentences belong to the target styles and whether the content of generated sentences are consistent with the original sentences.To evaluate naturalness, we follow the evaluation setting in (Mir et al., 2019) to let the crowd-workers distinguish the human generated sentences from the model generated sentences.OPTIMUS fl in the fissip , the fissile columns in theTyphris ' windows in the Sky .
CASUAL-LENS the horizon is filled with dazzling colors .

VAE-DPRIOR
Look at the splashes on the rocks in the middle of the street , I hate looking at rocks .They're so ugly looking , and I can't stand to look at them anymore .
Table 15: The style transfer results of different models trained on dataset PERSONALITY with one-shot learning setting.The original sentence, "Look at the fissures in the strata columns, beautiful.", is transferred from the original style "appreciative" to the target style "angry".

E Proofs E.1 Discussion about ϵ-Disentangled
To achieve information purity, the learned models should follow the structure illustrated in Fig. 2(a) that there is no dependency between C and Z y , and similarly no dependency between y and Z c .However, prior works on disentangled representation learning regularize the models by minimizing mutual information I(Z c , Z y ) between Z c and Z y such that Z c ⊥ ⊥ Z y when I(Z c , Z y ) = 0 (Cheng et al., 2020;Wang and Jordan, 2021).In another word, prior works only require that there is no edge between Z c and Z y in the Bayesian model.However, this does not imply I(Z c , y) = 0 and I(Z y , C) = 0.In the trained models, C can still be the shared parent or child of two independent random variables using the regularization from the prior works.In addition, the independence assumption between Z c and Z y does not always hold in practice.For example, if Z y is a random variable for emotion categories and Z c represents events influencing emotions, they are causally dependent.Forcing the independent assumption may deteriorate model performance.
To address this limitation, we propose to regularize the priors of latent variables for encouraging information purity.If we have a close look at I(Z y , y) = p(Z y , y) log p(Zy,y) p(Zy)p(y) , which is simplified to p(y|Z y )p(Z y ) log p(Zy|y) p(Zy) , a high mutual information expects p(Z y ) > 0 whenever p(Z y |y) is high.Similarly, if we aim for an extremely small I(Z y , C), we expect a low p(Z y ) or p(Z y ) = 0 whenever p(Z y |C) > p(Z y ).If we design the priors in the way that their dense regions are not overlapped, we achieve information purity by maximizing the corresponding mutual information.
We do not require absolute continuity for the associated divergence measure because when the priors are ϵ-disentangled with a fairly low ϵ, one of the priors would have zero probability in the regions where the other prior has positive supports.
For the cases with more than one random variables (vectors), we extend this idea for latent condtional non-identifiability.

E.3 Proofs for VAE with Disentanglement Priors
The main difficulty of maximum likelihood learning for the optimization problem (2) is that the marginal probability of data p(X|C, y) under the model is intractable.We apply the variational techniques to derive the ELBO for the optimization problem (2), whose constraint is removed by introducing the disentanglement priors.
In the VAE framework, we adopt variational distributions to approximate true distributions (Kingma and Welling, 2019), which ends up maximizing an ELBO.More specifically, we introduce a variational posterior q ϕ (Z c , Z y |X, C, y) to approximate the true posterior p θ (Z c , Z y |X, C, y), and derive the ELBO for p θ (X|C, y) in Sec.E.3.1:E q ϕ (Zc,Zy|X,C,y) [log p θ (X, Z c , Z y |C, y) − log q ϕ (Z c , Z y |X, C, y)] (8) We show in Sec.E.3.2 that the ELBO objective is further decomposed into: Lr E q ϕ (Zc,Zy |X,C,y) [log p θ (X|Zc, Zy)] − DKL(q ϕ (Zc|X, C)∥p θ (Zc|C)) − DKL(q ϕ (Zy|X, y)∥p θ (Zy|y)) where the first term is referred to as the reconstruction loss L r , the other terms constitute regularizers.

E.3.1 Evidence lower bound (ELBO)
log p(X) ≥ E q ϕ (Zc,Zy |X,C,y) [log p θ (X, Zc, Zy|C, y) − log q ϕ (Zc, Zy|X, C, y)] In each batch, the model collects the latent representations (Z c , Z y ), which are a content representation matrix and a label representation matrix respectively.We apply the linear kernel to build a Gram matrix K c = Z c Z ⊺ c for content and a Gram matrix K y = Z y Z ⊺ y for labels.The HSIC metric is computed as HSIC(Z c , Z y ) = 1 m 2 trace(K c HK y H) where H = I − 1 m 11 ⊺ and m is the size of the batch.Alternatively, we can try the Gaussian Kernel for both types of representations.

Figure 1 :
Figure 1: Generation of task-specific examples for data augmentation.In this example, the content representation is sampled from the training set, while the label representation is constructed based on the support set.

Figure 3 :
Figure 3: The label (red) and content (green) representations sampled from label and content priors of VAE-DPRIOR, VAE-DPRIOR (LGVAR) and VAE (UNCOND) trained on TACRED .els learn non-injective mappings between latent variable values and the likelihoods; thus many latent representations are mapped to the same model outputs.We investigate this problem by comparing VAE-DPRIOR with VAE (UNCOND), VQ-VAE (Oord et al., 2017), C-VAE(Jang et al., 2016), VAMP-VAE(Tomczak and Welling, 2018) and VAE-DPRIOR with alternative priors described before in terms of diversity and accuracy.If there is severe posterior collapse, models will generate similar texts indicated by the high BLEU scores, and the classifier trained on the augmented data would perform poorly.Unsurprisingly, the results in Table 4 show that VAE-DPRIOR largely outperforms those VAEs.Although VAMP-VAE also introduces conditional priors, the latent variables of its priors do not require to be ϵ-disentangled with a small ϵ.The VAE-DPRIOR (RAND-FT) even generate almost identical texts.
test ) while the succeed tasks T (k≻1) includes the support and test sets (D train includes enough training data for each base class whileD (k) sup = {x, y} N ×|C k | i=1includes only N -shot instances per new class.The classes on T (k) are disjoint from classes of previous tasks, C 1:k−1 ∩ C k = ∅.As a conventional continual learning setting as in(Lopez-Paz and Ranzato, 2017), a memory M k is associated with T (k) to store a fixed number of training instances (either examples selected from the support sets or the synthetic data) per each seen task.
per style) for each base style C b .After pre-training, the model parameters would be frozen with only the label embeddings updated based on a support set, D sup = {x, y} N ×|Cn| i=1 Definition E.2 (Latent variable conditional non-identifiability).Given a likelihood function p θ (X|Z a , Z b ; θ) with parameters θ = θ and a dataset D = {x 1 , ..., x n }, the latent variables Z a are non-identifiable conditioned onZ b if p(D|Z a , Z b ; θ) = p(D|Z b ; θ).Proposition E.3.Given a likelihood function p θ (X|Z a , Z b ; θ) with parameters θ = θ and a dataset D = {x 1 , ..., x n }, the latent variables Z a are non-identifiable conditioned on Z b if p(Z a |Z b ; θ) = 1.Proof : Given p(Z a |Z b ; θ) = 1, p(D|Z b ; θ)p(Z a |Z b ; θ) = p(D|Z a , Z b ; θ) p(D|Z b ; θ) = p(D|Z a , Z b ; θ) For all x i in D, if p(x i |Z a = z, Z b = z) > p(x i |Z a = z, Z b = z ′ ) with z ̸ = z ′ for all z, z ′ ∈ Z,where Z is the space of latent variable values, p(Z a |Z b ) is high because both random variable vectors are almost a copy to each other.In this case, if p a (Z a ) and p b (Z b ) share the same dense regions or even the same, such a conditional non-identifiable case will not be penalized during training.In contrast, if p a (Z a ) and p b (Z b ) are ϵ-disentangled with a small ϵ, the parameters leading to the conditional non-identifiable cases are disencouraged by receiving zero or a low likelihood p(x i |Z a = z, Z b = z; θ)p a (z)p b (z).

E
.3.4 Derivation of the regularization term for latent content representationsWe assume q ϕ (Z c |X, C) = N (Z c ; µ q c , diag(σ 2 c )) and pθ (Z c |C) = K k=1 p(M = k)N (Z c ; µ p c,k , λ c I) then we have D KL (q ϕ (Z c |X, C)∥p θ (Z c |C)) = K k=1 p(M = k|Z c ) 1 2λ c ∥Z c − µ p c,k ∥ 2 − log σ c + const Proof: Let Z c = µ q c + σ c ⊙ ϵ c, where ϵ c is drawn from N (0, I).D KL (q ϕ (Z c |X, C)∥p θ (Z c |C)) =E q ϕ (Zc|X,C) log q ϕ (Z c |X, C) − log p θ (Z c |C) =E p(ϵc) log q ϕ (Z c |X, C) − log p θ (Z c |C) =E p(ϵc) log q ϕ (Z c |X, C) − log p θ (Z c |C) Using the reparameterization trick, E p(ϵc) log q ϕ (Z c |X, y) = − log σ q cIt remains to estimate log p θ (Z c |C), which is a Gaussian mixture.Let γ k ∈ {0, 1} indicate the kth component of z, the likelihood function for z takes the formp θ (z, γ) = K k=1 p(M = k) γ k N (z|µ p c,k , λ c I) γ kIn the work, we consider using EM(Bishop and Nasrabadi, 2006), which estimates the expected value of the complete log likelihood function given byE γ log p θ (z, γ) = K k=1 E(γ k ){log p(M = k) + log N (z|µ p c,k , λ c I)} γ k ) = p(M = k|z c ) = p(M =k)N (z|µ p c,k ,λcI) K j=1 p(M =j)N (z|µ p c,j ,λcI), estimated in the E-step.For hard EM:E-step.For each latent content representation z c , the most likely component Gaussian is given by k * = arg max k p θ (M = k)N (Z c ; µ p c,k , λ c I). M-step.Put the estimated k * into Eq.(10), this step aims to optimize K k=1 γ k * {− 1 2λ c ∥Z c − µ p c,k ∥ 2 } + const Put them together, we have E p(ϵc) log q ϕ (Z c |X, C) − log p θ (Z c |C) (Raffel et al., 2020)rate examples conditioned on the label text and uses a classifier to filter out low-quality examples as in our work.v)EX2(Leeetal., 2021)applies T5(Raffel et al., 2020)and extrapolation technique to increase the diversity of generated ex-

Table 2 :
The diversity scores of the generated examples measured with BLEU and WMD on one-shot learning.

Table 3 :
The ACC avg of VAE-DPRIOR with different disentanglement losses in zero/few-shot learning.

Table 4 :
The ACC avg and diversity scores of the models with different VAE frameworks on one-shot learning.Sample from prior label and content distributions.

Table 5 :
The ACC avg and diversity scores of VAE-DPRIOR with different quality control methods on oneshot learning.

Table 6 :
The ACC avg of VAE-DPRIOR and other VAEs with or without MMD of five-shot learning on TACRED .The representations are sampled from the label and latent representations from the posterior distributions.

Table 7 :
The ACC avg and diversity scores of the models with different VAE frameworks on one-shot learning.The representations are sampled from the posterior label and content distributions.

Table 8 :
except that we split the tasks in a different way.VAE-DPRIOR performs best on both datasets as well.The ACC avg of the baselines and VAE-DPRIOR in one-shot learning on PERSONALITY and FEWREL .

Table 9 :
′ is pre-trained on a training set, D train , which includes abundant training data (e.g. more than 50 instances The ACC avg of VAE-DPRIOR in one-shot learning on TACRED and EMPATHETIC with hard and soft EM.

Table 11 :
The results of zero, one and five-shot learning of style transfer on EMPATHETIC dataset.