Controllable Text Generation via Probability Density Estimation in the Latent Space

Previous work on controllable text generation has explored the idea of control from the latent space, such as optimizing a representation with attribute-specific classifiers or sampling one from relevant discrete samples. However, they cannot effectively model a complex space with diverse attributes, high dimensionality, and asymmetric structure, leaving subsequent controls unsatisfying. In this work, we propose a novel control framework using probability density estimation in the latent space. Our method utilizes an invertible transformation function, the Normalizing Flow, that maps the complex distributions in the latent space to simple Gaussian distributions in the prior space. Thus, we can perform sophisticated and flexible controls in the prior space and feed the control effects back into the latent space owing to the bijection property of invertible transformations. Experiments on single-attribute and multi-attribute control reveal that our method outperforms several strong baselines on attribute relevance and text quality, achieving a new SOTA. Further analysis of control strength adjustment demonstrates the flexibility of our control strategy.


Introduction
Controllable text generation, a fundamental issue in natural language generation, refers to generating fluent and attractive sentences conditioned on target attributes (Zhang et al., 2022a). With the development of pre-trained language models , early work explores converting generative language models to conditional models by altering their parameters using fine-tuning Keskar et al., 2019) or reinforcement learning (Khalifa et al., 2020). Due to the high cost of modifying parameters (Brown et al., 2020;Zhang et al., 2022b), there exist control approaches 1 Work in progress. 2 https://github.com/HappyGu0524/MultiControl Figure 1: Illustration of methods controlling in Latent Space. Orange background denotes the latent space. Blue and red represent two attributes. Prefix-Tuning represents attributes with points in the manifold and composes them by interpolation. LatentOps uses classifiers to continuously estimate distributions of attributes and control by optimizing in the manifold. Discrete maps sentences to discrete samples in latent space and controls with direct searching. Our method deploys probability density estimation by transforming complex latent space into well-formed prior space, where common control strategies can be more effective.
Recent studies achieve satisfying control effects by influencing the fixed language model from the latent space (Yu et al., 2021;Qian et al., 2022) with prefix-tuning (Li and Liang, 2021). However, modeling the complex latent space remains a problem that plagues control performance. As in the left part of Figure 1, Gu et al. (2022b) provide a distributional perspective to reveal that distributions of attributes in high dimensional latent space are usually asymmetric and even non-convex, making simple control strategies such as interpolation (Qian et al., 2022) and optimization (Liu et al., 2022) inefficient. For example, interpolation may often exceed the support set of distributions, making generated sentences unable to acquire desired attributes. Besides, the optimization process can stuck in the saddle or local optimal points. Although mitigating the problem by discrete modeling and direct searching, Gu et al. (2022b) accidentally leave the control process more complicated that searching for intersection of attributes suffers from the high-dimensionality of space and noise in samples.
In this paper, we are dedicated to better modeling the latent space and control methods. As in right of Figure 1, we propose probability density estimation in latent space by invertible transformation. Complex distributions of attributes in latent space are mapped (bijection between continuous spaces) to simple ones, such as Gaussian distributions, in prior space. Thus, traditional control strategies such as interpolation can be tractable and explainable in this normalized prior space. Then we explore the relationship between these two spaces and attempt to prove under what circumstances the control in the prior space can be effectively fed back to the latent space. Finally, our controllable text generation process becomes: control attributes in prior space, activate the language model in latent space, and take effect on sentences in language space.
We conduct experiments on single-attribute control and multi-attribute control. Datasets we used are IMDb movie reviews (Maas et al., 2011) for Sentiment, AGNews (Zhang et al., 2015) for Topic, and Jigsaw Toxic Comment Classification Challenge Dataset for Detoxification. We measure the control ability of our method using the correlation of generated sentences with each attribute. For generation quality, we evaluate sentences with perplexity and distinctness concerning fluency and diversity. Results show that our method can significantly outperform baseline models and analytical experiments on control strength adjustment reveal our flexibility. The main contributions of our work are summarized as follows: • We propose a novel framework for controllable text generation that introduces a wellformed prior space for tractable and flexible control via invertible transformation. • We theoretically explore approaches to exploit invertibility to feed control in prior space back into latent space. • We experimentally reveal the effectiveness of our method compared to strong baselines and achieve the SOTA.
2 Related Work

Normalizing Flow
The Normalizing Flow (Dinh et al., 2014(Dinh et al., , 2016Kingma and Dhariwal, 2018;Kingma et al., 2016;Papamakarios et al., 2017), consisting of a sequence of invertible transformations for continuous variables, is a powerful deep generative model (Kingma and Welling, 2013;Goodfellow et al., 2020;Ho et al., 2020) that enables capturing the inner probabilistic distribution of complex and highdimensional data (Oussidi and Elhassouny, 2018), including images and text. In natural language processing, Normalizing Flows are often used as enhanced prior distributions in VAE structures (Ma et al., 2019;Ding and Gimpel, 2021) or as deep generative language models (Tran et al., 2019;Ziegler and Rush, 2019;Tang et al., 2021). Besides, Wu et al. (2022) uses the Normalizing Flow as prefixtuning for controllable image generation. However, previous work usually treats Normalizing Flow as an ordinary generative model, easily replaced by stronger models like the denoising diffusion model (Ho et al., 2020), while ignoring its invertible property. In this work, we will explore the potential for the flexible application of the Normalizing Flow's invertible feature in controllable text generation. : An overview of our framework. Part 1 represents the process of estimating latent space, which trains an autoencoder framework by prefix-tuning on the fixed language model. Part 2 denotes the invertible transformation from latent space to prior space at the training stage. Part 3 consists of two steps: we first operate in prior space for control and then feed the effect back into latent space to activate the language model.

Methodology
As illustrated in Figure 2, our framework is divided into three parts, where the former two are training phases, and the latter is the generation phase.

Estimating the Latent Space
Given sentence and attribute pairs {(s i , a i )}, we use a learnable encoder to map each sentence to a sample point x i ∈ R n×1 , which can activate the fixed language model to reconstruct the same sentence afterward via prefix-tuning. We denote the training loss of this reconstruction target as: where we can regard each point x i as being sampled from a continuous Latent Space. It's worth noting that estimating the Latent Space can be a pre-training or a pre-processing phase, in which we can directly use the parameters and mapped points Gu et al. (2022b) provide.

Invertible Transformation
The invertible transformation, denoting as z = f K • · · · • f 1 (x) = F θ (x), maps points x i in a complex distribution to ones z i ∈ R n×1 in a simple distribution, such as the Gaussian distribution, with a series of invertible functions {f i (·)}. The probability density function of p(x) can be derived as p(x) = π(z) det dF θ (x) dx and the corresponding training target is: For controllable text generation, we have to model the conditional probability p(x|a). Therefore, we can decompose the probability as: This means distributions of different attributes in Latent Space are mapped to the distributions in Prior Space through the same invertible transformation. When all sentences possess labels in every attribute, which is an ideal supervised situation, we can obtain attribute distributions and their correlations. However, we usually encounter a semisupervised situation where each sentence only has a single attribute label. As a result, we set a stricter transformation constraint for semi-supervision that p(x|a) = π(z|a) det dF θ (x) dx . Our target becomes: In this case, we train each attribute independently under the same spatial mapping, where attribute correlations in Latent Space can still be revealed by operation in Prior Space. It's worth noting that the amount of training data for different attributes should be consistent as possible to ensure the balance of the transformation. Besides, for the convenience of control, we set covariance matrices Σ ∈ R n×n of prior distributions as diagonal matrices σ 2 = σσ T I, where π(z|a) = N (µ a , σ 2 a ).

Control in the Prior Space
In this part, we will first prove three significant properties that bridge the prior and latent spaces, and then we will introduce how to conduct flexible control in the prior space and feed the control effect back into the latent space.

Theoretical Support for Control
Attribute Preservation We define z possesses the attribute a as in the support set of π(z|a), z ∈ supp(π a ), which equals to ∀z, π(z|a) > 0. Thus: which means that sampling in Prior Space contributes to sampling in Latent Space.
In addition, the point where attributes are most tightly combined is considered center of the intersection: z * = argmax z min{π(z|a 1 ),· · ·, π(z|a d )}.
Though there does not necessarily exist a mapping from z * to the intersection center in Latent Space, we can restrict the region of this mapping to an upper bound. Since z * lies in the n-d+1 dimensional subspace I = {z|π(z|a 1 ) = · · · = π(z|a d )}, named as Intersection Subspace, we can have: Figure 3: Intersection of two distributions. In the case of isotropy, the intersection of two Gaussian distributions is also a Gaussian distribution, where interpolation of two distribution centers can lie at the intersection center.
which means that the intersection subspace in Prior Space corresponds to the subspace in Latent Space.
Inequality Maintenance We define the discrepancy between two attributes concerning the control strength as D a 1 ||a 2 (x) = p(x|a 1 ) − p(x|a 2 ), measuring the degree of their mutual exclusion. Thus: which means inequality of two attributes in Prior Space is also true in Latent Space.

Details for Control
Given the Attribute Preservation property above, sampling a point x a related to attribute a in the Latent Space is equivalent to sampling in Prior Space z a ∼ N (µ a , σ 2 a ) and then transforming as For single-attribute control, we convert the sampling strategy to z a = µ a +σ a , ∼ N (0, λ 2 I), where λ is a hyperparameter 3 .
As illustrated in the upper left of Figure 3, interpolation between µ a and µā equals a line (1dimensional subspace) in Prior Space that passes through the Intersection Subspace, where the intersection point isẑ = α * µ a +ᾱ * µā. 4 Therefore, sampling withẑ as the center has a great opportunity to sample from the Intersection Subspace in Prior Space, approximate to sampling from the Intersection Subspace in Latent Space based on Intersection Invertibility. It is worth noting that when distributions are isotropic, there isẑ = z * as in Figure 3, which improves the effect of interpolation. The Inequality Maintenance further ensures that α > α * ⇐⇒ p(F −1 θ (z * )|a) > p(F −1 θ (z * )|ā), which means that positive sentiment is guaranteed to be more powerful than negative as long as our weight is larger than α * . However, the monotonicity of control can not be theoretically guaranteed, which is different from intuition. Fortunately, our experiment in §5.1 demonstrates that the control strength can be monotonic at a coarse granularity. When trading off control strength between two polarities, α is usually ranging from 0 to 1. We can even make α slightly larger than 1, which means extending the control strength, as long as it can be guaranteed that the points sampled are still within their distribution.
Multi-Attribute Control Due to the spatial symmetry of the Gaussian distribution, when we change the covariance matrix to a diagonal matrix, the trained distribution is even approximately isotropic 5 . This means we can simply deploy the 3 We will discuss how λ influences control strength in §5. interpolation of each attributes as:

Tasks and Baselines
Tasks All our experimental setups, including datasets, evaluation metrics, and generation configurations, follow (Gu et al., 2022b) for fair comparisons. There are IMDb movie reviews (Maas et al., 2011), AGNews dataset (Zhang et al., 2015), and Jigsaw Toxic Comment Classification Challenge Dataset 6 for 2 sentiments, 4 topics, and 1 detoxification, respectively. It's worth noting that Gu et al. (2022b) randomly samples 10k sentences from each dataset, constituting a minor subset, to balance the data scale for the latent space construction. And we directly use this latent space to make a fair comparison. To evaluate the attribute relevance, we use classifiers trained by Gu et al. (2022b) for sentiment and topic, and we utilize the Google Perspective API 7 for detoxification. We also measure text quality with Perplexity and Distinctness (Li et al., 2016). There are 35 prompts used for text generation, as in PPLM (Dathathri et al., 2020). For single-attribute control, models will generate 5 completions for each attribute and each prompt, which are 35 × (2 + 4 + 1) × 5 = 1225 sentences. For multi-attribute control, each model generates 35 × (2 × 4 × 1) × 5 = 1400 sentences.

Single-Attribute Control
We demonstrate the automatic evaluation results on single-attribute contol in Table 1. In addition to the degree of each independent attribute relevance, we compute their average for Sentiment and Topic. Models are grouped with their types of approaches. We mainly compare the control methods in the latent space, and the other two technical routes serve as supplementary references. Biasing methods can achieve decent control at the cost of some fluency. The diversity of their generated sentences is almost the same as the language model, owing to their plug-and-play property during decoding. Besides, we illustrate the raw GeDi without retraining, which is trained on the superset of our dataset. Results show that its performance is affected by the amount of data to some extent. Optimization methods in language space, elegant in theory, are often troubled by high dimensionality when implemented, making it difficult to obtain favorable control effects and text fluency. Optimization in latent space is a compromise strategy where the space dimension is relatively reduced, making the control process more robust. However, they also suffer from low diversity and dumb control.
Our method not only enhances the existing latent space optimization method at the level of control strength, with at least 5.0% and 7.3% significant improvement over baselines on sentiment and topic. For text quality, our model, sampling points from a Gaussian distribution, can also exceed the original prefix tuning method by 20.5 in the average distinctness. Our method performs mediocre on detoxification because we directly use Discrete's latent space, which is not good at this task. Compared with Discrete, which assigns the same weight to different sample points, our method can be seen as sampling from the area where sample points possess higher weights. This means we can sample points with higher control strength. However, we will be relatively concentrated on the semantic level compared with totally random sampling, making the diversity slightly inferior to Discrete.
Fortunately, our performance can be further improved by some sophisticatedly extended control. We can achieve opposite control, as in contrastive learning, by using negative weights when interpolating. Figure 4 denotes a typical situation where we sample blue points with their probability density function. One reason for existing methods' suboptimal control effect is that exclusive attributes, denoted as the red distribution, interfere with desired ones, the blue. We can use the probability of blue surpassing red P (D a||ā (z) > 0) = z * -∞ π(z|a)dz and the expectation of the difference between blue and red E a||ā (z * ) = z * -∞ π(z|a)(π(z|a)−π(z|ā))dz to measure the anti-interference ability in the sampling process. Above of Figure 4 shows sampling according to its original Gaussian distribution N (0, 1), where the surpassing probability is about 0.773 and the expectation of difference is about z π(z) π(z|a) π(z|ā) π(z|a)π(z|a) π(z|ā)π(z|a) z * ẑ π(z|a) π(z|ā) π(z|a)π(z|a) π(z|ā)π(z|a) z * Figure 4: The probability density function of two exclusive attributes. π(z|a) = N (0, 1), π(z|ā) = N (1.5, 1), 0.161. When our new blue sampling distribution, π(z|a) = N (-0.2, 1), is slightly away from the red by a distance of 0.2, as in below of Figure 4, the surpassing probability z * -∞π (z|a)dz and expectation of difference z * -∞π (z|a)(π(z|a)−π(z|ā))dz increase to 0.829 and 0.171, respectively. This means that the sampling center farther away from interference sources possesses better confidence. Results of this extended control feeding back to the attribute relevance are 2.6, 1.9, and 5.0 improvements on Sentiment, Topic, and Detoxification, respectively. It's worth mentioning that this offset needs to be balanced between staying away from interference and maintaining the original sampling area.

Multi-Attribute Control
Automatic evaluation results on multi-attribute control are demonstrated in Table 2. We group methods in the same way as single-attribute control, and we add an extra average score for all control combinations. Besides, we demonstrate their standard deviations, which denote the stability of models among different attribute combinations.
Compared to single-attribute control, multiattribute control is more challenging since all models suffer a drop in overall performance. There are at least 6.3% and 5.1% drops in the attribute relevance for Sentiment and Topic. There is little drop in detoxification because this attribute is generally compatible with others. On one hand, biasing models such as GeDi suffer from a drop not only in control strength, but also in the fluency of the generated text, as multiple biasing signals may conflict. On the other hand, latent optimizing approaches undergo an extra loss in diversity, even including our model, since we have to shrink the variance of the sampling to cut down the decline of the control effect. As observed in Discrete (Gu et al., 2022b), this gap between single-attribute control and multiattribute control is reasonable because different attributes usually combine at sparse edges of their distributions, which is in line with our intuition. It can also be observed in our mapped prior space that the probability density of the attribute combination region is relatively small.
Compared with Discrete, whose estimated latent space we are following, our model improves at the sentiment, topic, and detoxification by 1.5%, 3.1%, and 4.0%. Besides, we can achieve better stability according to lower standard deviation, 8.7, on average. It is worth noting that we outperform the Discrete in diversity for multi-attribute control because they can only obtain a small number of points in intersection regions, while we can sample from this continuous region.

Influence of λ
During the sampling stage ∼ N (0, λ 2 I), we often anticipate that the obtained points have a higher probability density, which is influenced by λ. As mentioned in Figure 4, exclusive attributes can interfere with the control effect, and decreasing λ is another optional strategy to reduce the interference. We plot the probability density function for λ = 0.8 in Figure 5. The probability of blue surpassing red z * -∞π (z|a)dz is 0.826 and the expectation of the difference between blue and red z * -∞π (z|a)(π(z|a)−π(z|ā))dz is 0.181, which is larger than original scores.
-∞ π(z|a)dz, which means the probability of blue surpassing red is monotonic as the function of λ. Besides, Inequality Maintenance we proved before ensures that there exists no As a result, decreasing λ means enhancing control strength in latent space as:

Control Strength Adjustment
Control strength adjustment can be directly achieved with interpolation over distribution centers under the approximately isotropic situation. We show the effect of α adjustment in Table 4. Although the change of α at the coarse-grained level, in the form of a Sigmoid-like curve, is approximately continuous and monotonic, we cannot theoretically prove its monotonicity like λ. Because the training target of invertible transformation is the combination of probability density and Jacobian determinant, which means higher probability in latent space will also tend to be mapped to a higher probability in prior space. However, this tendency is not always guaranteed, as the Jacobian determinant can compensate for some loss in probability to obtain a better form of the mapped distribution.

Conclusion
In this work, we present a novel control framework by introducing a well-formed prior space converted from latent space via invertible transformation. Experimental results confirm the superiority of our model on control effectiveness, control flexibility, and generation quality. We further provide some  theoretical support to feed the control effect from the prior space back to the latent space. In the future, we can explore the coupling degree between different attributes for a better combination, and we also can increase the scalability of the prior space for customized control strategies.

A Calculation ofẑ
Interpolation of two distribution centers is a line (one-dimensional subspace) where the probability density functions in this subspace are still two Gaussian distributions. That is: Our target goes back to solve the equation π(ẑ|a) = π(ẑ|ā) under one-dimensional situation.
Given: π(ẑ|a) = π(ẑ|ā) ⇒ N (ẑ; µ a , σ 2 a ) = N (ẑ; µā, σ 2 a ) ⇒ log According to the derivation above and Figure 6, when σ a = σā, theẑ is simply the midpoint of µ a and µā. When σ a = σā, there are usually two solutions forẑ, and the one we expect needs to be in the interval min(µ a , µā) to max(µ a , µā). It is worth noting that there may be cases where solutions of z are both outside this interval, which is caused by the distance between µ a and µā being too small. In this case, the interval of the two solutions ofẑ becomes the region where two attributes intersect.
As illustrated in Figure 6, it is complicated to accurately calculate the point where two attributes intersect, even in a one-dimensional case. Fortunately, we can observe thatẑ is always between µ a and µā. This means we can find an approximate intersection point by adjusting the interpolation parameter in practical use.

B Approximately Isotropic
We analyze the standard deviation σ of different attribute's Gaussian distribution in Table 5. We demonstrate the maximum, minimum, average, and standard deviation values among all dimensions for each σ. The maximum differences of σs are around 1, and the standard deviations are all less than 0.02, in which case we consider the distributions to be approximately isotropic.