Generative Text Modeling through Short Run Inference

Latent variable models for text, when trained successfully, accurately model the data distribution and capture global semantic and syntactic features of sentences. The prominent approach to train such models is variational autoencoders (VAE). It is nevertheless challenging to train and often results in a trivial local optimum where the latent variable is ignored and its posterior collapses into the prior, an issue known as posterior collapse. Various techniques have been proposed to mitigate this issue. Most of them focus on improving the inference model to yield latent codes of higher quality. The present work proposes a short run dynamics for inference. It is initialized from the prior distribution of the latent variable and then runs a small number (e.g., 20) of Langevin dynamics steps guided by its posterior distribution. The major advantage of our method is that it does not require a separate inference model or assume simple geometry of the posterior distribution, thus rendering an automatic, natural and flexible inference engine. We show that the models trained with short run dynamics more accurately model the data, compared to strong language model and VAE baselines, and exhibit no sign of posterior collapse. Analyses of the latent space show that interpolation in the latent space is able to generate coherent sentences with smooth transition and demonstrate improved classification over strong baselines with latent features from unsupervised pretraining. These results together expose a well-structured latent space of our generative model.


Introduction
The state-of-the-art language models (LM) are often modeled with recurrent neural networks (RNN) (Mikolov et al., 2010) or attention-based models * Equal contributions. (Dong et al., 2019;Vaswani et al., 2017). They are optimized by making a series of next-step predictions, encouraging the models to capture local dependency rather than global semantic features or high-level syntactic properties. A seminal work by Bowman et al. (2016) extends the standard LM to incorporate a continuous latent space which is aimed to explicitly capture global features. They formulate and train the model as a varational autoencoder (VAE) (Kingma and Welling, 2014). Indeed, the model is able to generate coherent and diverse sentences through continuous sampling, and provide smooth interpolation between sentences, uncovering a well-formed latent space.
However, training VAE for text is challenging and often leads to a trivial local optimum, posterior collapse. Specifically, the training objective of VAE can be decomposed into a reconstruction term and a KL term that regularizes the distance between the posterior and prior of the latent variable. Due to the autoregressive nature of the decoder, it is able to reconstruct the data well by simply relying on the one-step-ahead groud-truth and evolving model state while completely ignoring the latent codes. The posterior hence collapses into the prior, carrying no information. This is an important open question in this field. As pointed out in Fu et al. (2019), two paths work together to generate sentences in VAE. One path (Path A) is through the latent codes, while the other (Path B) is conditioned on the prior ground-truth or previously generated tokens. The posterior collapse describes an easy solution, that is, relying on Path B and ignoring Path A. Prior efforts made to address this issue by and large are along the two paths. One can control the information available from Path B to force the decoder to employ more Path A information. Bowman et al. (2016) dropout the input words to the decoder and  utilize a dilated CNN to con-trol the size of context from previously generated words. Along Path A, various techniques have been developed to improve the latent code quality. Bowman et al. (2016) anneal the weight of the KL term from a small number to reduce the regularization in the beginning of the training (Anneal-VAE), while Fu et al. (2019) further propose to adopt a cyclical annealing schedule (Cyclical-VAE).  update the encoder multiple times before one decoder update (Lagging-VAE). Li et al. (2019) initialize the VAE with an autoencoder (AE) and adopt a hinge loss for the KL term such that KL is not driven down below a target rate (FBP-VAE and FB-VAE). These techniques fall under the framework of amortized variational inference. Despite its fast inference, Cremer et al. (2018) observes that an amortization gap, the gap between the loglikelihood and the ELBO, can be large. Thus  proposes semi-amortized variational autoencoders (SA-VAE) in which initial variational parameters are obtained from an encoder as in VAE, and the ELBO is then optimized with respect to the variational parameters to refine them.
An alternative to variational inference is Markov chain Monte Carlo (MCMC) sampling. MCMC posterior sampling may be in the form of Langevin dynamics (Langevin, 1908) or Hamiltonian Monte Carlo (HMC) (Neal, 2011;Chen et al., 2014). Traditional MCMC can be time-consuming as the Markov chains require a long running time, each iteration involving a gradient computation through the decoder.
In this article, we propose to apply a short run inference (SRI) dynamics, such as finite step Langevin dynamics, guided by the posterior distribution of the latent variable as an approximate inference engine. For each training example, we initialize such a short run dynamics from the prior distribution such as Gaussian noise distribution, and run a finite number (e.g., 20) of steps of updates. This amounts to a residual network which transforms the initial noise distribution to an approximate posterior distribution.
One major advantage of the SRI is that it is natural and automatic. Designing and tuning a separate inference model is not a trivial task. In prior work, the inference model requires careful tuning to avoid posterior collapse in VAEs for text modeling. For instance, the inference model needs to be aggressively trained , pre-trained with an autoencoder , or refined with gradient descent guided by the ELBO . In contrast, the short run dynamics guided by the log-posterior of the latent variable can be automatically obtained on modern deep learning platforms. In addition, our method does not assume a closed-form density for the posterior, like a Gaussian with diagonal covariance matrix, and hence are possible to have a good approximate posterior and provide good latent code. Lastly, we optimize the hyper-parameter of the short run dynamics by minimizing the KL divergence between the short-run-dynamics-induced posterior and the true posterior, to further improve the approximate posterior.
Empirically, we show that the model trained with the SRI is able to outperform a standard LSTM language model by employing an LSTM generative model, while exhibiting active utilization of the latent space, improving over models trained with VAE-based approaches. Moreover, we find the learned latent space is smooth, allowing for coherent and smooth interpolation and reconstruction from noisy samples, and captures sufficient global information, enabling enhanced classification accuracy over state-of-the-art baselines.
In summary, the following are contributions of our paper. (1) We propose to use short run inference dynamics to train generative models for sentences without the need for an auxiliary inference network.
(2) We demonstrate that the generative model trained with the SRI is able to accurately model the data distribution and make active use of the latent space, exhibiting no sign of posterior collapse. (3) We show that the learned latent space is smooth and captures rich global representations of sentences.
2 Model and learning algorithm 2.1 Generative model Let x be the observed example, such as a sentence. Let z be the latent variable. We may consider z as forming an interpretation or explanation of x, such as the global semantics and/or high-level syntactic properties of sentences. Consider the following generative model for x, where p(z) is the prior and p θ (x|z) is given by a generative model parameterized with θ. The marginal distribution of x is p θ (x) = p θ (x, z)dz. Given x, the inference of z can be based on the posterior distribution p θ (z|x) = p θ (x, z)/p θ (x).

Learning and inference
Let p data (x) be the data distribution that generates the example x. The learning of parameters θ of p θ (x) can be based on , the above minimization can be approximated by maximizing the log-likelihood which leads to the maximum likelihood estimate (MLE).
The gradient of the log-likelihood, L (θ), can be computed according to the following identity: While the marginal distribution p(x) = p(x|z)p(z)dz is intractable due to the latent variables z being integrated out, the above expectation can be approximated by Monte Carlo average with samples drawn from p θ (z|x). Such samples from p θ (z|x) can be obtained by MCMC in the form of Langevin dynamics (Langevin, 1908), which iterates where k ∼ N (0, I), t denotes the time step of Langevin dynamics, and s is the discretization step size. The gradient term is tractable since ∂ ∂z log p θ (z k |x) = ∂ ∂z log p θ (z k , x) and thus does not depend on the intractable p θ (x). The Langevin dynamics (4) involves a gradient and a diffusion term. The first term is gradient descent , then the distribution of z k will be shifted towards basins of high log-posterior. We may recover p θ (z k |x) by smoothing with the second term √ 2s k , which amounts to white noise diffusion and induces randomness for sampling from p θ (z k |x).
For small step size s, the marginal distribution of z k will converge to p θ (z|x) as k → ∞ regardless of the initial distribution of z 0 (Cover and Thomas, 2006). More specifically, let q k (z) be the marginal distribution of z k of the Langevin dynamics, then KL(q k (z) p θ (z|x)) decreases monotonically to 0, that is, by increasing k, we reduce KL(q k (z) p θ (z|x)).
Finally, the MLE learning can be accomplished by gradient descent. Each learning iteration updates θ by where η t is the step size or learning rate, and E p θ t (z i |x i ) can be approximated by Monte Carlo sampling from p θt (z i |x i ).

Learning with short run inference dynamics
It is computationally impractical to run long Markov chains from p θ (z|x) as the gradient term in (4) requires back-propagation through the model underlying p θ (x|z). Earlier work (Han et al., 2017) recruits persistent Markov chains (Tieleman, 2008) {(z i , x i ), i = 1, . . . , n} such that for each observed example x i a latent code z i is updated for a few steps in each learning iteration and the chains are maintained throughout the learning procedure. This method leads to inconsistent sampling procedures while training and evaluating the model, since persistent Markov chains for evaluation data are not available. Moreover, estimation of the loglikelihood has to resort to means such as annealed importance sampling (Neal, 2001).
Instead, we adopt short run MCMC (Nijkamp et al., 2019) in which we approximately sample from the posterior distribution of the latent variable. We thus propose the following short run inference dynamics, with a fixed small number of steps K (e.g., K = 20), (7) where k = 1, ..., K and p(z) is the prior distribution of z. Initializing z 0 ∼ p(z) = N (0, I), we perform K steps of Langevin with step size s.
Finally, the learning procedure updates θ by where η t is the learning rate, E q θ t (z i |x i ) can be approximated by samples drawn from q θt (z i |x i ) using (7). Compared to MLE learning algorithm (5), we replace p θt (z|x) by q s,θt (z|x). Moreover, we may update the step size s of (7), which we will elaborate in the following.

Theoretical understanding
Given θ t , the updating equation (9) is a one step gradient ascent on Compared to the log-likelihood in MLE learning, Since the last term has nothing to do with θ, gradient ascent on Q s (θ) is equivalent to gradient ascent of which is a perturbation or a variational lower bound of log-likelihood L(θ).
The fixed point of the learning algorithm (9) solves the following estimating equation: If we approximate E q s,θ t (z i |x i ) by Monte Carlo samples from q s,θt (z i |x i ), then the learning algorithm becomes Robbins-Monro algorithm for stochastic approximation (Robbins and Monro, 1951), whose convergence to the fixed point follows from regular conditions of Robbins-Monro. The estimating equation (9) is a perturbation of the maximum likelihood estimating equation

Optimizing step size
We can optimize the step size s by maximizing Q s (θ) defined in equation (12), which is equivalent to minimizing the KL divergence between the short-run-dynamics-induced posterior and the true posterior since the first term L(θ) does not involve s.Q s (θ) involves the entropy of q s,θt (z i |x i ). We provide the details of its computation in the supplementary materials. The step size optimization can be done by grid search or stochastic gradient descent. In this work, we optimize the step size s with grid search guided by maximizingQ s (θ).

Algorithm
The learning procedure is summarized in Algorithm 1. Note that we only optimize s every T s iterations, so that computational cost is negligible.
Algorithm 1: Learning with SRI. input :Learning iterations T , step size interval T s , learning rate η, initial , batch size m, number of steps K, initial step size s. output :Weights θ T +1 . for t = 0 : T do by K-steps of dynamics (7) with step size s. 4. Update θ according to (9). 5. Every T s iterations, update s.

Log-likelihood computation
Unlike traditional MCMC, short run inference enables the computation of the marginal loglikelihood log p(x) 1 , Then, While most terms in (15) are readily available, log q k (z i |x) requires special treatment. We may rewrite the dynamics (7) in the form of where R k is defined by a k-step Langevin dynamics. Let the distribution of z k be denoted q k (z). Then, by change of variable theorem, q k (z) = p(R −1 k (z))|det(dR −1 k (z)/dz)|. (18) Instead of inverting R k , we draw z 0 ∼ p(z) and compute the log determinant of the Jacobian dR k (z 0 )/dz 0 . See more details in the supplementary.   is mostly related to our work. They propose SA-VAE where initial variational parameters obtained from the inference model are further refined by running a small number of gradient updates (e.g., 20) guided by the ELBO. In our work, instead of relying on a parameteric varational distribution, we run a few gradient updates on the logposterior of the latent variable with initialization from the prior distribution to draw samples directly. Thus, there is no need to design and tune an extra inference model, which is highly non-trivial considering that posterior collapse occurs easily in VAE training.

Related Work
Alternating back-propagation. Han et al. (2017) propose to learn generative models for images by maximum likelihood, where the learning algorithm iterates over two steps: (i) inferring the latent variable by sampling from its posterior distribution with Langevin dynamics; (ii) updating the model parameters based on the inferred latent codes. In the training stage, in step (i), the Langevin dynamics is initialized from the latent codes inferred in the last epoch, which is called persistent chain in the literature (Tieleman, 2008). In contrast, the short run dynamics always initializes the gradient descent updates from the prior noise distribution. Data-independent initialization renders the dynamics in training and testing consistent.
Short run MCMC. Nijkamp et al. (2019) introduces short run MCMC as a learned sampling dynamics guided by an energy-based model. It shares the same theoretical underpinning as early work of using stochastic gradient Langevin dynamics to learn mixture of Gaussians and logistic regression for large-scale data (Welling and Teh, 2011). Our short run inference method for learning latent variable models for text is inspired by these works.

Experiments
We apply our method to train latent variable models on text datasets. The dimension of the latent variable is 32 in all experiments. The generator is implemented with a one-layer uni-directional LSTM (Hochreiter and Schmidhuber, 1997). The number of hidden units and word embedding size of the LSTM vary among datasets to closely follow the experimental setup in recent work (Fu et  2019; Li et al., 2019). The number steps of the short run dynamics is 20 for all experiments 2 . The sample from the short run dynamics is used to predict the initial hidden state of the LSTM. It is also concatenated with the word embeddings and then fed to the LSTM as input at each time step.
The short run inference is more computationally costly than the vanilla VAE and has comparable training cost as some improved versions of VAE. The number of inner steps of SRI (20 steps) is about the same as that of SA-VAE and Lagging-VAE. In training, SRI has faster convergence than SA-VAE and comparable convergence as Lagging-VAE in our experiments. In inference, our sampling-based approach is slower than amortized inference. Our FB-VAE a man with a cane is walking down the street . a man with a cane is walking down the street . a man in a blue shirt is eating food . people are eating food . people walk in a city . people are outside in a city .
Ours there is a boy skating down a small street . there is a child walking in the snow . the man is riding a horse through the snow . the man is riding a boat . the biker is looking at the lake . the person is looking at a country . method trades a feasible computational cost for accurate inference whose empirical performance is presented in the following experiments.

Language Modeling
We evaluate our method on language modeling with the Penn Tree Bank (PTB) (Marcus et al., 1993), Yahoo , and a downsampled version of the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) as preprocessed in . Ideally, a language model with latent variable would be expected to make use of the latent space and accurately model the data distribution. To measure the utilization of the latent space, three quantitatively metrics are often considered in prior work (Bowman et al., 2016;Li et al., 2019;Fu et al., 2019;: reconstruction error (Recon), number of active units (AU), the magnitude of KL. Reconstruction error is the negative loglikelihood of the observed data evaluated under the posterior, E q(z|x) [− log p θ (z|x)]. A latent dimension is considered active if its distribution changes depending on the observations. Following Burda et al. (2016), a latent dimension is defined to be active if Cov x (E z∼q(z|x) [z]) > 10 −2 . Perplexity (PPL) based on the marginal log-likelihood of x is adopted to measure how accurately the model captures the data. The marginal log-likelihood is estimated with importance sampling with z sam-there is a crowd of people in the city . a man rubbing a dirty face . a couple was waiting to cross the street in a grocery store . the little girl is drinking water . a group of boys are playing in the fountain five asian teenagers are peforming a dance routine for a volunteer organization . a white-haired man is in front of a building playing music . construction workers sit at a african courtyard . a jewish man wearing white garb , playing a guitar , with a lazy look on him . a young man in a brown checkered shirt sings down on the floor while playing with the on a hot day .   ples from trained short run dynamics as importance samples.
Besides the standard LM and the vanilla VAE with KL weight annealing, VAEs with recent stateof-the-art training techniques, Cyclical-VAE (Fu et al., 2019), Lagging-VAE , SA-VAE , FBP-VAE and FB-VAE  are also included for comparison. The results are displayed in Table 1. In terms of PPL, our method outperforms all the baselines on the PTB and Yahoo datasets, while does slightly worse than Lagging-VAE and performs better than other baselines on the SNLI. This indicates the model trained with our method is able to accurately model the data distribution. On the other hand, our method yields the lowest reconstruction error and the highest KL with all latent dimension active on all three datasets, exposing the active use of the latent space. Taken together, these results suggest that the model trained with short run dynamics are balanced on modeling the data and utilizing the latent space. Figure 1 displays a t-SNE plot of the SRI-induced aggregate posterior E p data q(z|x) and its marginal density of each dimension. The t-SNE plot demonstrates the SRI-induced aggregate posterior is multimodal and the marginal densities are uni-modal but clearly deviates from the zero-centered standard Gaussian prior. These visualizations demonstrate that the aggregate posterior in our model is clearly different the isotropic Gaussian prior 3 and thus our model does not show a posterior collapse issue, consistent with our analysis above.

Latent Space Analysis
The quality of the latent space with SNLI is examined through interpolation, generation, and noisy reconstruction.

Interpolation
Interpolation allows us to appraise the smoothness of the latent space. In particular, two samples z 1 and z 2 are drawn from the prior. We linearly interpolate between them and then decode the interpolated points. FB-VAE  is considered as the SOTA text VAE that mitigates posterior collapse. Due to space limit, we only include this method for comparison in interpolation and generation experiments. Table 2 shows the decoded samples. Although the interpolated sentences by FB-VAE appears smooth, the first two sentences are repetitive. In comparison, the decoded sentences from our model transition more smoothly. While the interpolated sentences from our model are diverse, their syntactic properties and topic information remain consistent in neighborhoods along the path, exposing a smooth latent space.

Generation
We sample from the prior distribution and decode the sentences in a greedy manner. Table 3 displays the samples from our model and FB-VAE. It appears that samples from both models are grammatically correct and semantically meaningful in general. FB-VAE samples nevertheless show more local grammar errors. More generated samples are given in the supplementary. Zhao et al. (2018) reasons that a latent variable model's capacity on reconstructing from noisy data reveals the smoothness of the latent space. We impose discrete noise to the data by swapping tokens in a sentence for k times, where k = 1, 2, 3, 4 in this experiment. The reconstruction error (negative log-likelihood) under each condition is reported in Table 4. Notice that even the AE yields the lowest reconstruction when the noise is low (k = 1), but its performance deteriorates quickly as the noise level increases, implying that the latent space of AE is not smooth. In contrast, other models with regularization on the latent space do not exhibit drastic decline in reconstruction performance with increasing noise level. Furthermore, the model trained with our method demonstrates reconstruction either outperforming other methods or comparable to the best, revealing that the model trained with SRI has a smooth latent space.

Classification
The latent space of a well-learned latent variable model should capture highly informative features such that data points cluster into meaningful groups in the latent space. We hence further probe the latent space structure by investigating the clustering and classification performance of the SRIinferred latent codes. Following prior work (Fu et al., 2019;Li et al., 2019), we utilize the Yelp sentiment dataset as preprocessed in Shen et al. (2017). We train a Gaussian mixture for clustering (zero labels) and a SVM with 100, 1000, or 10,000 number of labels. The results are displayed in Table  5. Our method consistently improves over VAE approaches and AE. The improvement is especially clear in the zero-shot setting and small data regime (0 and 100 labels), revealing a well-structured latent space learned by SRI.

Conclusion
This work proposes to use short run inference dynamics to infer latent variables in text generative models. SRI dynamics is always initialized from the prior distribution of the latent variable and then performs a small number (e.g., 20) of Langevin dynamics updates guided by the posterior distribution. This simple and automatic inference method induces a good approximate posterior and provides good latent code.
The model trained with SRI accurately models the text data compared to strong language model and generative model baselines and shows no sign of posterior collapse, which is non-trivial to avoid and several remedies have been proposed for in prior art. Moreover, the learned space is smooth and captures rich representations of the sentences.