DuNST: Dual Noisy Self Training for Semi-Supervised Controllable Text Generation

Self-training (ST) has prospered again in language understanding by augmenting the fine-tuning of big pre-trained models when labeled data is insufficient. However, it remains challenging to incorporate ST into attribute-controllable language generation. Augmented only by self-generated pseudo text, generation models over-exploit the previously learned text space and fail to explore a larger one, suffering from a restricted generalization boundary and limited controllability. In this work, we propose DuNST, a novel ST framework to tackle these problems. DuNST jointly models text generation and classification as a dual process and further perturbs and escapes from the collapsed space by adding two kinds of flexible noise. In this way, our model could construct and utilize both pseudo text generated from given labels and pseudo labels predicted from available unlabeled text, which are gradually refined during the ST phase. We theoretically demonstrate that DuNST can be regarded as enhancing the exploration of the potentially larger real text space while maintaining exploitation, guaranteeing improved performance. Experiments on three controllable generation tasks show that DuNST significantly boosts control accuracy with comparable generation fluency and diversity against several strong baselines.


Introduction
Recently, Pretrained Language Models (PLM) (Liu et al., 2019;Dong et al., 2019;Radford et al., 2019;Raffel et al., 2020) have shown superiority in Natural Language Processing (NLP). However, the ever-growing size of these models also demands more training data, which destabilizes the finetuning of PLMs when labeled data is highly insufficient (Zhang et al., 2021). In this case, Self-training (ST) (Scudder, 1965;Yarowsky, 1995;Grandvalet and Bengio, 2004), a classical semi-supervised paradigm, has come to the fore again. As depicted in Fig. 1, ST produces pseudo labels for text using Step 1: Train a base model Step 2: Synthetic labeling Step 3: Select a subset

Top-K Pseudo Data
Step 4: Update Model

Updated Model
Figure 1: Classic Self-training. ST trains a base classification model on a small labeled data set, then iteratively predicts pseudo labels for unlabeled data to augment the original set and then fit the model to the augmented set. a classifier and then retrains the classifier with augmented data in an iterative process. By this means, ST could utilize a large amount of unlabeled text to denoise the pseudo labels and improve the generalization bound (Wei et al., 2021;Zhang et al., 2022), boosting various Natural Language Understanding (NLU) tasks (Mukherjee and Hassan Awadallah, 2020;Vu et al., 2021;Li et al., 2021).
Nevertheless, how to apply ST to Natural Language Generation (NLG), especially the datahungry attribute-controllable NLG, remains an open question. Different from typical NLU tasks like text classification, controllable NLG takes an attribute label as input to generate a textual sequence meeting the given attribute rather than predicting label given input text. This brings two new challenges for ST. Challenge 1: since model inputs become discrete labels, there is no massive unlabeled data for the NLG model to extend the learned distribution boundary. Challenge 2: augmented only by self-generated text, NLG models focus on exploitation and cannot explore a larger space. As a result, classic ST only works for a few generation tasks, e.g., Machine Translation (He et al., 2020;Jiao et al., 2021) where massive monolingual data exists, but fails to benefit controllable NLG.
To handle these challenges, we propose a novel Dual Noisy Self Training (DuNST), for semi-supervised controllable NLG. DuNST jointly learns to generate text from given attribute labels and predict labels for text, characterizing these two directions as a dual process via a shared Variational AutoEncoder (VAE) (Kingma and Welling, 2014). Such duality allows our model to leverage not only generated pseudo text but also pseudo labels predicted for available unlabeled text. Both generation and classification would be augmented by the two kinds of pseudo data that will hence be gradually refined during the ST process, handling Challenge 1. Besides, DuNST corrupts the generated pseudo text by two kinds of noise, softmax temperature and soft pseudo text, to further disturb and escape from the text space learned at the previous ST iteration, addressing Challenge 2. Our method can be theoretically regarded as propagating local smoothness (Chen et al., 2021) and exploring a larger and potential space, which helps extend the generalization boundary and improve attribute coverage. Therefore, DuNST further boosts controllability with comparable generation fluency and diversity.
In summary, our contributions are as follows: • To the best of our knowledge, we are the first to incorporate Self-training into semisupervised controllable language generation and propose a novel and effective ST method.
• We demonstrate that DuNST explores a larger potential text space and extends the generalization boundary, providing a theoretical interpretation for our method.
• We conduct thorough experiments on three attribute-controllable generation tasks and manifest the superiority of DuNST in improving control accuracy with competitive quality of the generated text, further exploiting the capacity of powerful PLMs for NLG.

Related Work
Controllable Language Generation Attributecontrollable language generation aims to generate high-quality text satisfying desired attributes, e.g., sentiment, topic and style, which could facilitate diverse downstream applications, such as stylistic writing (Ficler and Goldberg, 2017) and language detoxification (Gehman et al., 2020). In the era of PLM, an effective paradigm for controllable NLG lies in fine-tuning PLMs on datasets containing labeled text (Keskar et al., 2019;Gururangan et al., 2020). Nonetheless, with the increasing scale of PLMs, inadequate labeled data severely hampers the effectiveness of fine-tuning (Zhang et al., 2021). As a remedy, two lines of research have been developed. Lightweight tuning searches a trigger (Sheng et al., 2020) or optimizes only a few parameters like adapter (Ribeiro et al., 2021) or prefix (Li andLiang, 2021;Qian et al., 2022), requiring much less training data. Plug-in control steers the generation probability of NLG models towards target attributes through updating cached hidden states (Dathathri et al., 2020) or rectifying the output distribution (Liu et al., 2021;Krause et al., 2021;Yang and Klein, 2021) guided by offthe-shelf attribute classifiers or conditional PLMs at the inference time, without fine-tuning the generators. Despite no/weak dependence on labeled data, these two lines of work would cause limited control accuracy or decreased fluency.
Self-training Recently, Self-training has flourished again by iteratively generating pseudo labels and augmenting the tuning of data-hungry PLMs, showing great advantages in further enhancing NLU (Meng et al., 2020;Vu et al., 2021;Du et al., 2021;Bhat et al., 2021;Chen et al., 2021) and Neural Machine Translation (NMT) (He et al., 2020;Jiao et al., 2021) where massive unlabeled input text exists. Beyond naive ST, Mukherjee and Hassan Awadallah (2020) select unlabeled instances based on prediction uncertainty of the classifier to improve model confidence. Jiao et al. (2021) also collect informative (high-uncertainty) monolingual sentences to enhance the translation quality of hard examples. Relevant to our work, He et al. (2020) corrupts the pseudo target sentences in NMT by synthetic noise like token shuffle or paraphrase to propagate local smoothness. However, as mentioned in Sec.1, due to Challenges 1&2, it is difficult to directly apply ST (as well as the synthetic noise above) to attribute-controllable NLG.
VAE and Dual Learning VAE (Kingma and Welling, 2014) has proven to be effective in generating diverse text when combined with PLMs due to the flexible semantic properties captured in the latent space (Li et al., 2020;Hu et al., 2022), which could further enhance the variety of pseudo text and thus is more suitable for ST. Dual Learning (DL) (He et al., 2016) has been traditionally proposed and applied in NMT and then extended to joint optimization of NLU-NLG tasks (Xia et al., 2017), which is promising for tackling Challenge 1. Tseng et al. (2020) successfully combined DL with VAE for table-to-text and text-to-table generation, but their model cannot simultaneously optimize the two directions and share learned features, not compatible with our design for Challenge 1.
Unlike the aforementioned work, we revisit the challenges of incorporating Self-training with controllable generation and utilize a shared VAE to construct the duality to handle these challenges, leading to a novel and practical ST framework.

Formulation and Overview
Let x be the text, y be the attribute label, P l (x, y) be the empirical distribution of text and label formed by a labeled dataset D L = {x i , y i }, and P u (x) be the marginal distribution of text formed by an unlabeled dataset D U = {x i } from the same domain. We aim to learn an attribute-controllable generator G = P θ (x|y) parameterized by θ (e.g., a large PLM) to generate high-quality text x ∼ P θ (x|y) (in an auto-regressive manner) satisfying the given label y. We also endow our model with the ability of producing pseudo attribute labels for x ∼ P u (x) through jointly learning a text classifier C = P φ (y|x). We simultaneously model and optimize G and C with a shared Dual VAE in Sec. 3.2.
During the training of DuNST (Sec. 3.3), the pseudo labels predicted by C help cover more unseen samples and hence extend the learned distribution boundary (tackling Challenge 1), while the noisy pseudo text generated by G helps disturb the previously learned space, further improving generalization (addressing Challenge 2). Though we emphasize generation in this work, both G and C would be promoted and thus keep refining the augmentation data during ST, which acts as a joint exploration and exploitation process (Sec.3.4).

Dual VAE
We jointly learn the generator and classifier by VAE.
To optimize text generation, we minimize the following two objectives: where L g is generation loss, KL is the Kullback-Leibler divergence between the posterior distribution Q(z|x, y) of the latent variable z and the prior one P (z|y). Optimizing the two terms is equivalent to maximizing a lower bound of P (x|y). The posterior Q(z|x, y) is typically assumed as a multivariate Gaussian N(µ post , σ post ) given input text x and label y, and approximated by where h y is the label embedding, Encoder is a Transformer (Vaswani et al., 2017) encoder, and MLP is a multilayer perceptron. Similarly, we could build the prior P (z|y) by simply implementing [µ prior , log σ prior ] = MLP(h y ).
Symmetrically, we minimize the following two classification losses: where the posterior distribution Q(z|x, y) is shared by classification and generation to enhance the connection of text and corresponding labels. The text is generated by an autoregressive Transformer decoder x = Decoder(z) and the label is predicted by y = MLP(z). The latent vector z is drawn from the posterior distribution in training and from the prior ones in testing. G and C share most parameters (e.g., encoder and decoder) to better utilize the knowledge learned via generation and classification. To avoid KL-vanishing, we also incorporate BOW (bag-of-words) loss L bow (Wang et al., 2017) and Cyclical-annealing (Fu et al., 2019).
The final loss is computed as follows: where λ c , λ g , and λ bow are hyper-parameters. λ kl−c and λ kl−g are set as in (Fu et al., 2019).

Dual Noisy Self-training
Following the practice of self-training in NLU (Vu et al., 2021), we start ST from a strong base model tuned on D L and use the full unlabeled D U to produce pseudo-labels, rather than select part of the data with certain criteria as in (Mukherjee and Hassan Awadallah, 2020;Jiao et al., 2021). The complete DuNST process is described in Alg. 1. As discussed in Sec. 1, augmented only by selfgenerated text, the model would increasingly enhance the exploitation of the previously learned Algorithm 1: Training Process of DuNST Input: Labeled set D L , unlabeled set D U , attribute set Y .
1 Jointly train base model G, C on D L by optimizing Eq.(5), store the best G 0 , C 0 .
Build pseudo labeled dataset: Build soft pseudo text dataset: Eq.(5) and Eq. (7), update the parameter to G epoch and C epoch . 15 end space but fail to explore more, resulting in boosted diversity but marginal improvement of control accuracy (Challenge 2, see Table 1). Injecting noise to pseudo text is an effective way to facilitate exploration. However, the typical synthetic noise (He et al., 2020) (e.g., randomly dropped and shuffled tokens in pseudo text) encourages undirected exploration which may diverge from the valid space and get too noisy for controllable NLG.
To address this problem, we propose two novel and effective types of soft noise to enable directional exploration, namely High-temperature Generation and Soft Pseudo Text, in what follows.
High-temperature Generation (HTG) We introduce temperature τ in the softmax layer: where d m is the output token distribution for the m-th token,x j,0:m−1 is the previously generated m − 1 tokens and σ means softmax. Lower τ (e.g., τ <= 1 ) leads to a sharper distribution and thus motivates more certain output (usually used in NMT). Differently, we choose τ > 1 to encourage more diverse but semantically reasonable (high generation probability) tokens which could enhance local smoothness and help explore potential directions. Besides, the degree of noise is easy to control by adjusting τ for a better trade-off.
Soft Pseudo Text (SPT) HTG improves the diversity of pseudo text, but also takes the risk of sampling invalid tokens and propagating errors in an autoregressive generation. Moreover, HTG produces discrete pseudo text (a point in text space) and thus needs to sample numerous points to cover a small neighborhood space (see Fig. 2).
Therefore, we further propose to generate soft pseudo text, where we directly store the output token distribution d and let G epoch−1 directly learn to reproduce d. Then we replace Eq.(1) with: SPT acts as a kind of Knowledge Distilling (Hinton et al., 2015) in an iterative manner. In this way, we avoid losing relevant semantics information in d and reduce needed samples, further extending the generalization boundary (see Table 3).

Theoretical Analysis
To understand why DuNST could work well, we interpret its advantages with the following theorem: where p is the real joint distribution, q θ is and q θ are models estimated at the current and last ST iteration, respectively, and u is the noise distribution.
Proof. See Appendix B.
In Theorem 1, the first KL term corresponds to the optimization of our dual VAE which approximates the real joint distribution. The second term corresponds to the classic Self-training which works as a regularization. As depicted in Fig. 2, such regularization performs some kind of exploitation to refine the learned space. The last term is the noise to enhance exploration. Compared to the isotropic synthetic noise (too noisy) and the hard pseudo text (inflexible), DuNST with soft pseudo text could explore potential directions, cover larger space and thus further push the boundary.

Tasks
We conduct exhaustive experiments on three controllable generation tasks, described below: Sentiment control with prompt: We evaluate the controllability of sentiment on the IMDb movie review dataset (Maas et al., 2011). We finetuned a RoBERTa-large (Liu et al., 2019) classifier to tell the sentiment of generated text. Following (Dathathri et al., 2020), we use their 15 prompts and another 85 prompts sampled from IMDB (100 in total) as model input. For each prompt and each sentiment, 10 generations are generated.
Topic control w/o prompt: We use the AG-News dataset (Zhang et al., 2015) to evaluate topic controllability (measured by a similarly tuned RoBERTa). We assess our model's ability to generate from scratch on this dataset and sample 300 generations for each topic.
Text detoxification: We use the Jigsaw Toxic Classification Dataset for training and Perspective API 1 for toxic evaluation. Following (Qian et al., 2022), we use the 203 "challenging" prompts (toxicity < 0.5) from RealToxicityPrompts (Gehman et al., 2020), and generate 10 non-toxic sentences for each of these prompts.
We sample 5% of IMDB training samples as labeled data and directly take their provided unlabeled set. Since there is no separate unlabeled text in AGNews, we sample 3% of training samples as labeled data and use the others as unlabeled ones. For a fair comparison, we keep the ratio 1 https://www.perspectiveapi.com/ of labeled/pseudo/unlabeled data to 1:1:30. More details of the dataset are provided in Appendix A.2.

Experimental Settings
We use UniLM-base-cased (Dong et al., 2019) as the shared encoder and decoder of DuNST. The dimension of latent z is 128 for sentiment control (2-class) and 256 for the topic control (4-class). We use AdamW (Loshchilov and Hutter, 2019) with learning rate = 5e-5, warm-up steps = one epoch, and batch size = 8 for optimization across all tasks. The top-p (p = 0.9) sampling method is used for decoding. More implementation details are provided in Appendix A.1.

Evaluation Metrics
We mainly focus on the controllable generation side in this work and consider the following four kinds of metrics. The evaluation of classification is provided in Appendix C.1.
Fluency: We evaluate generation fluency by the perplexity of generated text measured by GPT2-XL (Radford et al., 2019), i.e., Output PPL.
Generalizability: We calculate the perplexity of each model on each test set, i.e., Model PPL, to evaluate the generalizability of the model. Controllability: We evaluate the control accuracy through classification performance (accuracy (Acc) and Macro-F1(F1) ) on the generated text by the two fine-tuned RoBERTa-large classifiers for sentiment and topic, respectively.
Diversity: To evaluate the diversity of generated text, we consider Dist-n (Li et al., 2016) and Self-BLEU (Zhu et al., 2018). More metrics details are described in Appendix A.3.

Baselines
We compare our model with three kinds of (supervised or semi-supervised) strong NLG baselines. Self-training methods: (1) +PT: the naive ST which generates pseudo text at each epoch and updates parameters with both real data and pseudo ones from the last epoch. (2) +PT+Noise: Following (He et al., 2020), we bring synthetic noise (token drop, swap and mask) to the pseudo text for self-training.
(3) +PT(noise)+PL: We apply both +PT+Noise and pseudo labeling. We fine-tune a BERT-base (Devlin et al., 2019) on the few-shot dataset to generate pseudo labels for the real unlabeled text.
(4) +PT(select)+PL: We over-generate noisy pseudo text and select the data by the classification confidence from classifiers above and uncertainty scores (Mukherjee and Hassan Awadallah, 2020). We give more details of the baseline models above in Appendix A.4.

Results
As shown in Table 1, in both with prompt (Sentiment) and without prompt generation (Topic) setting, our DuNST achieves significant improvement on controllability compared to fine-tuned PLMs and lightweight tuning, and is comparable in fluency, generalizability, and diversity. Fine-tuned PLMs obtain limited F1 improvement but severely decreased diversity, indicating they are overfitted to these few labeled data points and fail to cover larger attribute spaces. PF and Ctr-PF only reduce parameters and required data, but lose the capacity of PLMs, performing even worse than tuned PLMs. In contrast, thanks to the duality, DuNST simultaneously improves generation and classification to iteratively refine pseudo labels (Challenge 1) and enhance the quality and diversity of pseudo text, boosting both controllability and diversity.
We also have some interesting findings about existing self-training methods. 1) The classic ST  method +PT even hurts controllability and generalizability in the sense of Challenge 2. As discussed in Sec. 1, merely self-generated text over-stresses exploitation of the learned space and hinders exploration. 2) Traditional synthetic noise motivates isotropic exploration which may diverge from valid attribute distributions (+PT+Noise). 3) Sample selection methods bring a marginal improvement but at the cost of 50% more training time. Thus we did not apply such a selection in DuNST. 4) Additional pseudo-labels significantly improve performance. However, different from our dual method, the fixed pseudo labels by +PT(noise)+PL cannot evolve during ST. By comparison, DuNST utilizes hightemperature sampling, latent space in VAE, and soft text to introduce flexible noise, which encourages directed exploration and benefits controllability and diversity while maintaining good quality. Due to space limitations, we report the results of text detoxification under both automatic and human evaluation in Appendix C.1.

Human Evaluation
To better verify the effectiveness of DuNST, we also conduct a human evaluation. For each model, we generated 100 samples on each task. We invite 5 competent annotators to score these samples on three criteria -Fluency, Novelty, and Attribute Relevance. As shown in Table 2, DuNST consistently outperforms all other baselines on all three metrics, which indicates that DuNST not only has better controllability over attributes but also generates fluent and diverse texts. See Appendix A.7 for detailed evaluation protocol.

Ablation Study
We conduct an ablation study on the IMDb dataset. As shown in Table 3, we can find: 1) Pseudo labels lead to a significant performance improvement. 2) Soft pseudo text outperforms the hard one on controllability and diversity but with slight fluency loss. In semi-supervised settings, solely hard pseudo text (a) VAE posterior of generated texts based on different temperature for topic-controlled generation.  in ST limits model coverage. In contrast, the soft pseudo text brings a smoother noise and the ability to further push the learned boundary.

Analysis
Effect of Duality: We compare our model with a variant (−Dual) where we annotate pseudo labels in advance fix them and cut off classification losses through training. All other settings are retained to be the same. As depicted in Fig. 3, since classification and generation share parameters, without optimizing the classifier and pseudo labels, the latent posterior would gradually shift such that the classification performance greatly drops. As a result, generation F1 reaches its maximum soon and stops increasing. On the other hand, thanks to the simultaneously optimized classifier, DuNST keeps improving classification, further distinguishing the posterior and thus enhancing controllability. Effect of Noise (temperature): To illustrate why noise encourages exploration and improves control, we plot the posterior of generations in different temperatures and visualize the simulated de- cision bound based on training data in Fig. 4(a). We find that higher noise pushes the latent space towards the decision bound and hence leads to more challenging pseudo text. Such data enables the model to learn to better distinguish latent representations under different attributes and push the generalization boundary, thus potentially improving generation controllability. Besides, the noisy pseudo data also helps improve exploration and attribute coverage. We discuss it in Appendix C.3. Fig. 4(b) shows the generation performance of DuNST with different temperatures. We achieve the best controllability and diversity when the temperature is 5 in our setting, which indicates a suitable temperature is necessary to balance exploration and exploitation.
Number of labeled data: We show the effectiveness of different models over changing sizes of training data in Fig. 5(a) (unlabeled data size remains the same). We find that all models obtain improved generation performance with more labeled data, but our DuNST performs similarly when using only 5% of labeled data compared to 100%. Such results demonstrate the superiority of DuNST which additionally learns from unlabeled and pseudo data through self-training. Detailed analyses are described in Appendix C.3.

Number of pseudo text:
We also evaluate DuNST on varying numbers of pseudo text, keeping all the other settings unaltered. As shown in Fig. 5(b), DuNST performs best with equal size of pseudo text and labeled data. More pseudo text brings more noise which hurts generation quality as the model learns more noise than semantics. Too less pseudo text makes the model lose exploration ability and thus fail to extend the learned distribution boundary. Therefore, we should find a suitable noise level to balance exploration and fluency.
Case Study: In order to verify the generation quality and attribute relevance, we present some cases sampled from different models in Table 4. We can see that Ctr-PF and GPT2-ST suffer from repeating expressions (e.g., "poor quality" and "got bored"). In contrast, DuNST produces more diverse and fluent expressions, which are also more faithful to the given negative attribute. We provide more generated cases in Appendix E.

Model Generation
Ctr-PF At 20 years old, ichorous and unrepentant, ichanarito is a well-liked celebrity, but a poor quality actor, and a poor quality movie...

GPT2-ST
At 20 years old, I grew up with it and had no expectations. I was a big fan of the series, I thought it was great, and I wanted to see more of it. But soon enough, I got bored and just got bored watching it. DuNST At 20 years old, the poorest hour of my life had me scratching my head. This was an insult to the intelligence of mankind, and a complete disregard for anyone else's intelligence. Do not watch this movie. It is a disgrace to any religion.

Conclusion
We propose a novel DuNST method to apply Self-training to semi-supervised controllable NLG. DuNST (1) jointly optimize generation and classification via a dual VAE to both leverage pseudo text and pseudo labels, and (2) incorporates two kinds of soft noise into ST, better exploring larger potential text space and extending the attribute distribution boundary. Theoretical analysis demonstrates that DuNST acts as a combination of regularizationlike exploitation and attribute boundary exploration, significantly improving control accuracy with satisfactory generation fluency and diversity.
Since the generation of pseudo data is done in an auto-regressive manner, which takes longer training time. In the future, we plan to explore ways to speed up the self-training process. token to obtain the representation in the encoder. Dimension of latent z is set to be 128 for sentiment-controlled generation and detoxification(2-class) and 256 for topic-controlled generation (4-class). To fuze the latent z better with the Transformer decoder, we use a simplified fusion method of DELLA (Hu et al., 2022) where we concatenate z to the attention output of each token in each Transformer layer, and then add a linear layer to transfer the new attention output to the original shape of attention output. We did not use the low-rank tensor to compute layer-wise latent z to save the number of parameters.

References
To avoid KL-vanishing, we utilize cyclical annealing tricks (Fu et al., 2019) to train DuNST and set the cycle length equal to training steps in each epoch. In each cycle, first the KL weight increases from 0 to 1 linearly for the first 80% steps in a cycle, and keeps to be 1 for the remaining 20% steps. KL annealing is activated for 5 epochs for λ kl−c and 7 epochs for λ kl−g . Besides, we use the KL thresholding scheme (Li et al., 2020) to give up driving down KL for dimensions of z that are already beneath the target compression rate KLlambda.
We tuned KL-lambda ∈ {0.01, 0.03, 0.05, 0.1} (following Li et al. (2020)), λ c ∈ {1, 5, 10}, the ratio of Pseudo Texts (Fig. 5(b)), and softmax temperature ( Fig. 4(b)) to obtain the reported results. We set KL-lambda to be 0.05 for sentimentcontrolled generation, 0.03 for detoxification, and 0.01 for topic-controlled generation. λ c is 10 for sentiment-controlled generation and 1 for topiccontrolled generation and detoxification. For other hyper parameters, λ g is set to be 1 and λ bow is set to be 0.2 for all tasks. We use AdamW (Loshchilov and Hutter, 2019) as optimizer. The training batch size is 8 and the learning rate is 5e − 5. We apply linear warmup to the optimizer and the number of warm-up steps is one epoch.
We implement DuNST and all other baselines based on Huggingface Transformers (Wolf et al., 2020) library of v4.21.1 and use NVIDIA A100 to train our model. The total number of training GPU hours is around 8h for IMDb, 10h for Jigsaw, and 9h for AGNews. The number of parameters of our model is 134.56M for sentimentcontrolled generation and text detoxification. For a topic-controlled generation, the number of parameters is 136.19M. In generation phase, we use top-p sampling (p = 0.9) as decoding method. Other configuration of the generator includes a length penalty to be 1.0, a repetition penalty to be 1.0, and a no-repeat-ngram-size to be 4 for all baselines. All experimental results are trained and tested in a single run.

A.2 Dataset Description
For IMDb 2 dataset (Maas et al., 2011), the authors claimed in their paper that In the interest of providing a benchmark for future work in this area, we release this dataset to the public without claiming any further copyright. For AGNews 3 dataset (Zhang et al., 2015), it is claimed in the website that You are encouraged to download this corpus for any non-commercial use. For Jigsaw 4 dataset, the dataset is under CC0, with the underlying comment text being governed by Wikipedia's CC-SA-3.0. All datasets we used are open-sourced and are used for research only, which is consistent with their intended use.
For IMDb dataset and AGNews dataset, we leave 10% of training set as validation data, and others as training data. For the AGNews dataset, we use the description for text generation and wrote a script to resolve HTML tags. For Jigsaw dataset, we apply a binary setting that we keep the "non-toxic" class unchanged and group all other classes into "toxic" class.
The details of datasets are described in Table A1. For the Jigsaw dataset, there are only 414 toxic data (9.6%) in the Jigsaw dataset, which shows that Jigsaw is an extremely imbalanced dataset, bringing difficulty in detoxification.

A.3 Evaluation Metric Details
We set the minimum generation length to 10. For the maximum length, 490 for sentiment, 50 for detoxification, and 40 for topic. We evaluate generation quality on the following metrics: Fluency: We evaluate generation fluency by the perplexity of generated text measured by GPT2-    Li et al., 2020;Hu et al., 2022), We consider k latent variables z 1 , z 2 , ..., z k sampled from the posterior distribution q(z i |x), and PPL based on these latents p(x, z i ). Based on the fact that average importance weights are an unbiased estimator of log p(x) (Burda et al., 2016) and Jensen's Inequality, we have: Thus we use L k to estimate the output PPL in VAElike models. Generalizability: We calculate the perplexity of each model on each testing set, i.e., Model PPL, to evaluate the generalizability of the model. Controllability: We evaluate the control accuracy through classification performance (accuracy (Acc) and Macro-F1(F1) ) on the generated text by the two fine-tuned RoBERTa-large classifiers for sentiment and topic. Table A2 presents the performance of our evaluator RoBERTa-large. We find that RoBERTa-large has a satisfactory classification accuracy and F1 on these two tasks, and thus is able to act as a good evaluator of generation quality. For detoxification, we report the percentage of toxic sentences (Toxic %) using Google Perspective API. Perspective API is a free API for scoring the toxicity of text. Following Qian et al.
(2022) we also use this Perspective API for toxicity evaluation.
Diversity: To evaluate the diversity of generated text, we consider the following metrics: (1) Dist-n (Li et al., 2016): the percentage of of distinct n-grams on generated samples. We evaluate on n = 1, 2, 3, 4 and compute the geometric mean as Dist.
(2) Self-BLEU (Zhu et al., 2018). Self-Bleu calculates the BLEU score on the generated samples, which averages the BLEU score of each generated sequence calculated with other generated ones as references. The BLEU score is computed as the geometric mean of BLEU-n (n = 2, 3, 4). This metric measures the diversity of a set of generated sequences. Higher Self-BLEU means these generated sequences are more distinguishable from each other. Among all the above metrics, Accuracy, F1, AUC, Dist-n, and Self-BLEU are reported as 100 times their original value for convenience.

A.4 Baseline Details
For Finetune LM, we feed a prepend sentence as a control sentence. For sentiment-controlled generation, we use This is a [positive/negative] review as control sentence. For topic-controlled generations, we use The following is about [topic]. For detoxification, we use This is a [toxic/non-toxic] comment as control sentence. For T5, since it acts in a sequence-to-sequence manner, we feed the control sentence to the encoder and training text to the decoder. We fine-tune all pre-trained LMs under learning rate 5e-5 for 10 epochs and warmup steps to be 1 epoch.
For GPT2+PT+Noise, we use the same implementation of Noise Layer as He et al. (2020). We set the token drop rate and mask rate to 5%. Since GPT2 does not have a Mask token, we randomly substitute this token for another token. We set the parameter of word shuffle to 1.1.
For the pseudo-labeling-based method, we report the performance of pseudo labeler BERT-basecased in Table A2.
For GPT2+PT(select)+PL, we over-generate two times of pseudo text and compute the uncertainty score and classification confidence from the BERTbase-cased classifier. The classification confidence s conf is the softmax probability of the predicted label. Uncertainty score s uncertain is Bayesian Active Learning by Disagreement (BALD) computed by Monte-Carlo Dropout (Mukherjee and Hassan Awadallah, 2020). A high BALD score means the model is highly confused. We want to select the sample that is of high confidence and low BALD score. Thus we select samples based on the following score: For prefix-tuning (PF) and contrastive prefixtuning (Ctr-PF). We follow the implementation details described in Qian et al. (2022).

A.5 Additional Settings for Detoxification Tasks
As mentioned in A.2, the Jigsaw dataset suffers from severe imbalanced labels where toxic data only counts for 9.6% of training data. To alleviate this problem, we can tune the ratio of toxic and non-toxic data when generating pseudo texts and in conclusion balance the whole training set. We can obtain a less imbalanced dataset if we increase the ratio of toxic to non-toxic data in PT. We propose DuNST(pos) where all pseudo texts are generated from toxic attributes. Similarly, in the baseline for detoxification tasks, we additionally tested a new variant for GPT2based self-training methods. GPT2+PT(select, all toxic)+PL refers to all pseudo texts generated from toxic attributes, while GPT2+PT(select)+PL refers to generating 1:1 toxic/non-toxic pseudo texts.

A.6 Potential Risks and Broader Impact
Due to the unclean corpus in the Jigsaw dataset, the model trained on this corpus may also output some rude and toxic expressions. Besides, our model may also be utilized to produce toxic and harmful content by simply setting the attribute label as toxic, which would take the risk of producing and propagating harmful information. Also, topic/sentimentcontrolled generated text may contain some bias or politically sensitive expression. Besides, since our model significantly improves the controllability and diversity of generated text, it can produce more plausible texts like news and fake Movie reviews, which could be possibly utilized to create fake news and disinformation. However, these generated texts can be used as pseudo data in data augmentation for fake news detection, and thus have the potential to increase the current fact-checking and fake news detection model.

A.7 Human Evaluation Details
We invite 5 competent annotators to score generated samples. All of the annotators are graduate students. Each annotator is assigned 420 generated text: 120/180/120 for sentiment/topic/detoxification tasks. Each sentiment/topic-controlled generated text is annotated by 2 annotators. Each text in the detoxification task is annotated by 3 annotators. Annotators will be paid 30$ for completion, and they were aware of the potential risks or ethical concerns of machine-generated texts. Below we attach the instruction for workers.
Attention: all machine-generated text involved is used for research only. The generation may contain toxic content including discrimination and offensive content.
Metrics: Sentiment/Topic: Fluency: Whether the generated text is fluent/grammatically correct. Small number of html tags/special words (like <br>) are permitted. The score ranges from 1 (most influent) to 5 (most fluent).
Novelty: The extent of novelty INSIDE the generated text. E.g., distinct word/expressions, repeated phrase. The score ranges from 1 (least novel, a sentence/word repeat till the end) to 5 (most novel, all sentence describes new stuff).
Attribute Relevance: whether the generated text relevant to its label. There might exist examples that contains both positive/negative expressions. 5=exactly relevant; 4=mostly relevant; 3=neutral (not contain sentiment or 50/50); 2=slightly relevant, but more similar to another class;1=totally opposite.
Detoxicifation: Please only evaluate the quality of generated text AFTER prompts, since the prompt itself might be toxic. Toxicity: whether the provided textual sequences contain any toxic content, including but not limited to offensive text, abusive language, swearwords, hate speech, denigrating messages, microaggression, discrimination, sex, rude words, and hominem attack. The score ranges from 1 (most non-toxic) to 5 (most toxic).

B.1 Derivation of Dual VAE ELBO
To optimize the attribute-controllable generation direction, we aim at learning the conditional distribution of text, namely q(x|y), and derive the evidence lower bound (ELBO) as: log q(x|y) = log q(x, z|y) p(z|x, y) p(z|x, y) dz where we approximate the true prior and posterior distributions q(z|y), p(z|x, y) with a prior network and a posterior network which is also called recognition network. When we input a prompt c as our experiments on IMDB, similarly we can get log q(x|y, c) ≥ E p(z|x,y,c) [log q(x|z, y, c)] − KL[p(z|x, y, c)||q(z|y, c)].
For attribute label classification, we maximize q(y|x) and get a ELBO symmetrically: log q(y|x) = log q(y, z|x) p(z|x, y) p(z|x, y) dz where we similarly approximate the true prior q(z|x) with another prior network. Please note that the two optimization direction shared most parameters and utilize the same recognition network but incorporate different prior distributions.

B.2 Proof of Theorem 1
For brevity, we ignore the hyper-parameters λ and the BOW loss term −L bow only used for mitigating KL vanishing. Then over the whole labeled dataset p(x, y) we have: Then we consider the scenario of Self-training. Define the real distribution formed by the dataset as p(x, y, z), the estimated distribution at the last ST iteration as q θ (x, y, z) which is formed by the generated pseudo labels and text, and the one at the current ST iteration as q θ (x, y, z). As discussed in Sec. 3.3, we add noise to pseudo text to enhance exploration. Therefore, the previously learned q θ (x, y, z) is disturbed and becomes q θ (x, y, z)+u where u is the noise distribution. For brevity, we abbreviate these distributions as p, q θ , q θ and u, respectively. In Self-training, we are actually fitting q θ to not only q but also q θ and u. Therefore, we are minimizing an upper bound of: Consider the first term: Since p, q θ and u are all fixed at the current iteration, we can ignore the last term KL[p||p + q θ + u]. Similarly, we have that minimizing KL[p+q θ +u||q θ ] equals to minimizing KL[p||q θ ]+ KL[q θ ||q θ ] + KL[u||q θ ], concluding the proof.

C.1 Detoxification Results
As shown in Table C1, our DuNST outperforms all the other baselines on controllability. DuNST outputs the least toxic text while keeping a relatively high diversity. We find that generating all toxic pseudo texts performs better than generating 1:1 toxic/non-toxic pseudo texts for GPT2, which shows that adding pseudo text in self-training can tackle the issue of the imbalanced dataset. The Output PPL and Model PPL of DuNST are larger than the baselines. We explain the reason as follows. Since we are choosing toxic prompts marked as "challenging", it means that toxic sentences would be more likely to be generated and thus have a lower PPL score. Similarly, some non-toxic continuation might get a high PPL score from the GPT2-XL model, since it is rarer to be seen and is less natural from the challenging prompt. This does not mean that generation fluency is worse. Human evaluation on detoxification tasks (see Table C2) demonstrates that DuNST generation does not have a significant difference from GPT2 generation in fluency and novelty. On the other hand, its toxicity level is significantly lower than the two baselines, which further demonstrates that DuNST can improve generation controllability. Table C3 reports the classification performance of our model on 3 tasks. We find that self-training on pseudo-labeled data could significantly improve classification performance, and thus improve generation quality. Pseudo texts have a slight improvement in classification performance on sentimentcontrolled and topic-controlled generation tasks. For the detoxification task, the classification performance drops a little. We explain the reason as follows. We assume that attribute distribution in the test set is the same as the training set. The attribute distribution of the whole training data is shifted since all pseudo texts are toxic, which makes the distribution of classifier prediction far away from the testing set.  Table C4 shows the comparison result on the topic generation task. We can see that without duality the generation performance drops significantly. Fig. C1 depicts the distribution of BERT-large embedding of GPT2-generated data, DuNSTgenerated data, and training data under World topic, respectively. Here we use the [CLS] embedding of the BERT-large model to represent sentence embedding. We find that larger generation temperature leads to more diverse sentence representation, which demonstrates that high-temperature generation of pseudo data could improve generation diversity. Table C5, Table C6, and Table C7 reports complete experimental results of IMDb, AGNews, and Ablation Study.

D Limitation
Though DuNST works well, it has three kinds of limitations as follows: • Decelerated training process. As with all other Self-training methods, DuNST also needs to reproduce pseudo labels and pseudo text at each ST iteration. Since the pseudo text (both hard and soft) is generated in an autoregressive manner, which is hard to be done parallelly and thus leads to longer training time.
• Efforts of tuning hyperparameter. Since our dual structure involves multiple optimization

E Example of Generation
We sample some generated texts based on Ctr-PF, GPT2-ST, and DuNST and place them on Table  E1, Table E2 and     At 20 years old, ichorous and unrepentant, ichanarito is a well-liked celebrity, but a poor quality actor, and a poor quality movie...

GPT2-ST
At 20 years old, I grew up with it and had no expectations. I was a big fan of the series, I thought it was great, and I wanted to see more of it. But soon enough, I got bored and just got bored watching it... DuNST At 20 years old, the poorest hour of my life had me scratching my head. This was an insult to the intelligence of mankind, and a complete disregard for anyone else's intelligence. Do not watch this movie. It is a disgrace to any religion...

Sentiment:
Positive; Prompt:At 20 years old Ctr-PF At 20 years old, ixxoxx was a fan favorite and loved his work and movies like it was his own and his own. A hard worker who enjoyed getting his hands dirty and trying to figure out the best ways to make a buck, ixixx made his name on film and television...

GPT2-ST
At 20 years old, ixTch was my first exposure to the French film industry. I immediately knew the "circles" of French cinema, where each character had a particular expression in their mind, but the film itself was only a very short one. I was not interested in the story or the plot, just in the characters... DuNST At 20 years old, I was 13 when this came out. I remember it as being a typical 1950's soap opera. The writing was excellent. It was realistic, the characters were believable. The story was accurate, but the real problems the family had with adult life in particular were out to bother with...

Sentiment: Negative; Prompt:The book Ctr-PF
The book I loved a lot about The Legend of Zelda: Breath of the Wild. Its protagonist, Zelda, is an ordinary woman who grows up playing on a farm. She gets to choose a young boy in her village who can only dream about Zelda. The village gets more and more out of her. She tries to save the villagers by making them happy by getting them into a happy life. The game takes this into an extreme (it's not like Zelda is a good actress, just a poor one)...

GPT2-ST
The book, "Fatal Attraction" by R.D.Hilbert, was a great book and even if the book had been better, they probably would have given it a 10 out of 10. I was disappointed by this movie. I do not believe that a movie can be this boring, with actors playing their roles and acting like the movie would give it a realistic look... DuNST The book was a rather confusing and disappointing adaptation of a classic. It may very well have been a good book if it had been adapted to a movie, but this movie could have been much better. It's impossible to pinpoint exactly what exactly was the point of this film...

Sentiment: Positive; Prompt:The book Ctr-PF
The book has an overall feel to it and is quite good. There are plenty of laughs and a decent ending to the book. However, there are some other issues with the author, who has some questionable editing that can ruin the book, making the movie and television series too good to be true...

GPT2-ST
The book is a perfect example of why not to make a movie. The dialogue, the actors, the sets, everything. That said, the book is perfect, despite the fact that it has very little plot, and that is good. The acting is good and is also very funny. The story is even more funny than the book... DuNST The book ( which I'm in a "Truly Hip" ) is a fantastic comic book. The movie adaptation of the book is a superb one. It's faithful to the spirit of the book and looks to the footage and characters. I highly recommend this film if you like and appreciate the story... Table E1: Example text for sentiment-controlled generation. Words in blue/red are positive/negative indicators, respectively.

Model Generation
Topic: World Ctr-PF (1)On Wednesday, the Iraqi security forces stormed the main Islamic State office, killing nine members, killing 12 people, but did not withdraw from the building and taking over two other buildings.
(2)He said this could not be confirmed, but was quoted by Reuters. Hide Caption : A woman reacts after her car was blown up near an Islamist group rally in Dhaka.

GPT2-ST
(1)LOS ANGELES (Reuters) -A former crematory operator agreed on Wednesday to plead guilty to dumping bodies and ashes at the same crematory site where he was born, officials said (2)GAZA (Reuters) -A rocket killed two Israeli soldiers in Gaza on Wednesday, the first time the army in occupied territory in more than a year that Hamas militants have launched a fierce... DuNST (1) AP -An Italian aid worker walked free from the southern Philippines on Sunday, a day after he was abducted at gunpoint on the streets of Real Aires.
(2) AFP -The United States and South Korea failed to hammer out a deal over a timetable for the planned reduction of US forces in Iraq, with Seoul asking for more troops to join another group.
Topic: Sport Ctr-PF (1)This past weekend, when the Los Angeles Lakers drafted Michael Jordan, he looked like a real contender to play the role of mentor at the small forward spot.
(2)Houston is now playing "The Voice of America" at Madison Square Garden. The Knicks are 0-5 and facing a 10-point Los Angeles Lakers team that, if they win tonight...

GPT2-ST
(1) SEATTLE (Reuters) -Olympic chiefs may have to reconsider their decision to stage a one-day event in Atlanta after protests from marchers in the southern city...
(2)NEW YORK (Reuters) -Tommy Haas looked as though he had the flu, as he sat in his BMW 712 at the World Championship in Akron, Ohio, on Friday. DuNST (1) South Carolina assistant Skip Holtz left the game with an injured tailback Ciatrick Fason.
Freshman Adrian Peterson rushed for 140 yards and two touchdowns and Ronnie Brown added 127 yards.
(2) AP -The New York Yankees wasted little time getting down to business, and their starting pitcher, Tony Womack, was allowed to sit out Saturday night after missing two games because of an elbow injury... Table E2: Example text for topic-controlled generation.

Model Generation
Topic: Business Ctr-PF (1)President Barack Obama has said his administration is "very concerned about Iran's nuclear program and concerns about the growing threat from terrorist groups in Iran.
(2) A number of firms have taken steps to make their online business more efficient and more efficient. New York-based Gartner says that new companies such as AT&T, Bell, IBM...
(2)SINGAPORE (Reuters) -Asian stock markets opened lower on Thursday, helped by poor weather forecasts and gains by technology firms, but some oil-related stocks remained higher. DuNST (1) Tokyo stocks plunged Monday morning as investors took profits from recent gains. The US dollar was up against the Japanese yen. The Nikkei Stock Average of 225 issues was up 36.
(2) NEW YORK, Aug 18 ( Reuters ) -Rupert Murdoch's News Corp. Ltd. has agreed to sell its stake in Sky Latin America to DirecTV Group D Topic: Sci/Tech Ctr-PF (1)$1,000 for 'Millionaire's 'Rape Crisis' Victim Fund By Michael S. Osterholm -6/9/17 07:08:04: (2) US government has approved a $10 million loan to provide medical equipment to the Palestinian Authority for 'humanitarian and medical equipment' on the West Bank. The agreement provides $3 million for a...

GPT2-ST
(1)NEW YORK (Reuters) -The U.S. Securities and Exchange Commission has voted 5-0 to recommend that Internet advertising services stop soliciting fees from Web sites, according to a...
(2)LOS ANGELES (Reuters) -"Cell " phones offer fast data rates, low prices and no worries about getting fat in the long run, says a survey by analysts at research firm... DuNST (1) Toshiba has announced a new transmission system for routers and switches that will improve automatic transmission rates. The 6500 Super_GSM/GPRS system will feature high clock speed...
(2) At a press conference this week, Bill Murray, Microsoft's CEO, expressed doubt that the software giant's strategy for regaining PC identity is considerable, but heard little reason to believe it Table E3: Example text for topic-controlled generation (continued).