TILGAN: Transformer-based Implicit Latent GAN for Diverse and Coherent Text Generation

Conventional autoregressive models have achieved great success in text generation but suffer from the exposure bias problem in that token sequences in the training and in the generation stages are mismatched. While generative adversarial networks (GANs) can remedy this problem, existing implementations of GANs directly on discrete outputs tend to be unstable and lack diversity. In this work, we propose TILGAN , a T ransformer-based I mplicit L atent GAN , which combines a Transformer autoencoder and GAN in the latent space with a novel design and distribution matching based on the Kullback-Leibler (KL) divergence. Speciﬁcally, to improve local and global coherence, we explicitly introduce a multi-scale discriminator to capture the semantic information at varying scales among the sequence of hidden representations encoded by Transformer. Moreover, the decoder is enhanced by an additional KL loss to be consistent with the latent-generator. Experimental re-sults on three benchmark datasets demonstrate the validity and effectiveness of our model, by obtaining signiﬁcant improvements and a better quality-diversity trade-off in automatic and human evaluation for both unconditional and conditional generation tasks. 1


Introduction
In recent years, Transformer-based autoregressive (AR) models have made a dramatic impact in text generation tasks such as machine translation (Vaswani et al., 2017; and dialogue systems Ham et al., 2020), especially with the emergence of large pre-trained language models (Radford et al., 2019;Brown et al., 2020;Wu et al., 2020). However, AR models predict the next token conditioned on the ground truth * Equal Contribution. 1 Our code is available at https://github.com/ shizhediao/TILGAN. during training and on its own previously generated token during inference, which leads to a mismatch between training and generation stages, and this causes low quality of generated texts and bad generalization ability of models on unseen data (Wiseman and Rush, 2016;Welleck et al., 2020).
Generative adversarial networks (GANs, Goodfellow et al., 2014) provide a promising approach to solve the exposure bias problem (Yu et al., 2017;Kusner and Hernández-Lobato, 2016;. This is because GANs aim at matching the distributions of the generated and real data instead of forcing the model output to align with the single correct sequence, and thus provide the potential to bypass the discrepancy issue. However, it is non-trivial to apply GANs to discrete data since the gradients cannot be normally back-propagated through discrete tokens. Existing approaches have implemented the adversarial discrete generation training by reinforcement learning (RL) (Yu et al., 2017;Lin et al., 2017;Fedus et al., 2018) and Gumbel-Softmax (Kusner and Hernández-Lobato, 2016). Nevertheless, these approaches suffer from the high variance problem which causes unstable performance and slow convergence, leading to other methods based on feature matching Zhao et al., 2018;Chen et al., 2018).
In this work, we propose TILGAN, a Transformer-based Implicit Latent GAN, which combines a Transformer autoencoder and a GAN in the latent space with novel designs and a learning formulation based on the Kullback-Leibler (KL) divergence to enhance the text generation performance in both fidelity and diversity. Specifically, inspired by the representation capacity of Transformer AR models, we firstly incorporate Transformer architectures to improve GANs in text generation. Note that the previous latent feature matching methods are mostly RNN-based and assume a single vector in the latent space, which do not directly handle a sequence of latent representations encoded by a Transformer. However, single latent vector representation hinders the incorporation of correlations among different tokens, leading to the loss of crucial semantic information captured by a Transformer structure. This is especially problematic for local and global coherence (Bińkowski et al., 2020). In this paper, we directly match the distributions of multi-token sequences in the latent space, which is better suited for the Transformer structure. To do so, we have to resolve two challenges, the first being how to do distribution matching. We introduce a multi-scale discriminator over the Transformer latent space to utilize the semantic information on different scales, where a global discriminator takes the entire sequence of latent representations as the input, and a local discriminator takes only a randomly-sampled local neighborhood. The second challenge is how to train the decoder reliably. We enhance an autoencoder loss by another KL loss optimized by GAN, forcing the latent representations of the decoding output to be compatible with the generated latent representations from the latent-generator.
We provide a theoretical justification for the proposed formulation by connecting it to the standard goal of generative modeling. Experimental results on three datasets illustrate that TILGAN outperforms all baselines in both unconditional and conditional generation tasks, achieving state-of-the-art performance. Particularly, TILGAN exhibits a better quality-diversity trade-off evaluated by automatic metrics such as SelfBLEU and TestBLEU as well as human evaluation. Further analyses also confirm the effectiveness of each component of our method, where decoder enhancement greatly benefits generation quality, while the multi-scale discriminator and KL objective provide great performance gains in generation diversity, and the implicit prior contributes to both.

Model and Formulation
In this section, we introduce the proposed model and the learning formulation. Let x ∈ X denote a sentence following the real data distribution p r (x) with X = V n where V is the vocabulary, m = |V| is the vocabulary size, and n is the sequence length, and z ∈ Z be the latent variable following a prior distribution p z (z). We consider a probabilis-tic model containing an encoder E φ : X → Z and a decoder G θ : Z → X . Both are generally stochastic mappings with parameters φ and θ, and induce the encoder conditional distribution q φ (z|x) and the decoder conditional p θ (x|z) respectively. Note that previous approaches to text generation use deterministic encoders and decoders (Zhao et al., 2018), which restricts the expressiveness of the modeled distribution family. We first ensure the consistency between E φ and G θ by minimizing the negation of the expected reconstruction log-likelihood which coincides with the reconstruction term in the evidence lower bound (ELBO).
The generated data distribution is given by To achieve good generation performance, we design the model so that the distribution family of p G (x) is large enough to contain the real one p r (x). As described in Section 2.3, we use Transformer to model E and G, which we assume to have sufficient capacity to reconstruct data well and learn informative latent representations. In this way, p θ (x|z) is assumed to be expressive enough. To further enhance the capacity of p G (x), we propose to use an implicit prior p z , by transforming samples from a simple distribution with a deep neural network.
Consider a random vector ∈ E following some simple distribution p like a standard Gaussian. We then propose to learn a latent-generator g β : E → Z with parameter β so that the distribution of g β ( ) matches that of E φ (x), by minimizing the KL divergence where p β (z) denotes the distribution of g β ( ) and q φ (z) = E x∼pr [q φ (z|x)] is the distribution of E(x), a.k.a., the aggregated posterior. The advantage of KL divergence is that it imposes a heavy penalty when q φ (z) > 0 but p β (z) ≈ 0, which means that it favors a g that covers all the diverse modes of q φ (z). This is commonly known and verified empirically in Shen et al. (2020). Hence minimizing KL encourages a better diversity in generation compared with the Jensen-Shannon (JS) divergence or Wasserstein distance which are often used in the literature on generative models. Therefore, we formulate the overall objective The overall architecture of TILGAN. Blue and orange stand for the global and local discriminators, respectively, and green denotes the route of the enhanced decoder. function to be minimized as follows where λ > 0 is a coefficient to balance both terms.
Decoder Enhancement During testing, a new sentence is generated by first sampling ∼ p , then computing the latent variable g( ) and finally generating G(g( )), which means the decoder takes the output of the latent-generator g as the input which it has never seen throughout the training. Although the KL term aims at matching the distributions of g( ) and E(x), it is possible that they do not match perfectly. In such cases, the decoder may generate data with poor fidelity and far from being real data.
To resolve this and reliably train the decoder, we propose to enhance the decoder by letting it see the generated latent g( ) during training. Formally, let p g be the distribution of E(G(g( ))). We add another term to the loss function (2) with coefficient λ 1 > 0: Since this term is designed to enhance the decoder, we regard the encoder and prior parameters φ and β as fixed constants. In other words, in optimization, we do not propagate gradients of this term with respect to φ and β and only update the decoder parameter θ.

Algorithm
In this section, we propose a GAN-based algorithm for the optimization of the above formulation.
Since p β (z) is implicit, the KL term L g in (2) does not allow a closed form to be optimized directly.
We introduce a discriminator to estimate the gradients, following Shen et al. (2020). In Lemma 1, we present the gradient formulas of L g .
Lemma 1. Let D(z) = ln(q φ (z)/p β (z)). Then where s D (z) = e D(z) is the scaling factor and the expectations are taken over all the randomness.
Since D depends on the unknown densities q φ and p β so that the gradients in Lemma 1 can not be directly computed from the data, we estimate the gradients by training a discriminator D ψ with parameter ψ via the empirical logistic regression: where S e and S g are finite samples from q φ (z) and p β (z) respectively. This leads to a GAN algorithm. The optimization of the enhanced loss (3) is similar. However, GAN is commonly known to suffer from unstable training or gradient vanishing. To stabilize our algorithm, we adopt the scaling clipping technique from Shen et al. (2020) and clip the scaling factor into a range of [r 0 , 1/r 0 ], where r 0 = 0.5 turns out to work well in all our experiments. Denote the clipped scaling by s D (z) = max{min{s D (z), 2}, 0.5}.
For the optimization of the consistency loss L c , we adopt the reparametrization trick from Kingma and Welling (2014) and estimate it by Algorithm 1: TILGAN Input: initial φ, θ, β, ψ, ξ, batch-size N , local size M while not convergence do // Update discriminators Randomly sample local blocks z i andẑ i with size M Update local discriminator by descending: // Update encoder, decoder and latent-generator Obtain xi, i, zi,ẑi,zi, z i andẑ i as above Compute φ-gradient: Update parameters φ, θ, β using the gradients The whole training procedure is summarized in Algorithm 1, where the colored parts stand for the enhanced decoder (green) and the multi-scale discriminator (blue and orange) introduced later.

Architecture
In this section, we present the Transformer-based architecture incorporated with multi-scale discriminators. We propose a Transformer autoencoder framework where both the encoder and decoder are self-attention layers with three novel ingredients specific to improve the generation performance in both quality and diversity: (i) a latent-generator g to transform Gaussian noises into an implicit prior distribution, (ii) decoder enhancement, and (iii) multi-scale discriminators. Figure 1 illustrates the entire architecture of TILGAN.
As mentioned in Section 1, we introduce multiple discriminators over the Transformer's latent space to utilize the semantic information on different scales, each of which operates on a different window of representations as the input. Specifically, given an input sentence x = [x 1 , x 2 ,. . . , x n ] where x i stands for the i-th word, it is passed through the Transformer encoder which results in a sequence of latent states z = [z 1 , z 2 ,. . . , z n ] where z i is the vector representation corresponding to x i . We introduce a global discriminator D ψ taking the whole sequence of representations z as the input, and a local discriminator d ξ with parameter ξ taking only a local neighborhood of the M randomly-sampled adjacent representations, e.g., z = [z i−1 , z i , z i+1 ] with M = 3, as the input. 2 Notably, the local discriminator takes the generated pieces of sequences into account, so it provides signals of phrase-level fidelity and local coherence, while the global discriminator is able to assess the general realism and the degree of natural coherence for the whole sequence.

Extension to Conditional Generation
Our proposed framework can be readily extended to conditional generation tasks such as story completion. To be specific, the goal is to learn a conditional real data distribution p r (x|c) where c is the given context following p r (c) with some missing content x to complete. We propose to feed c into all three components-encoder E, decoder G, and latent-generator g-of our model, and modify the terms in objective function (2) as follows

Theoretical Justification
The goal of generative modeling is to learn the generated distribution p G (x) that is close to the real data distribution p r (x). Our proposed formulation in (2), however, does not explicitly optimize a distance measure between p G and p r , so it is unclear whether our method can match the distributions in the data space. In this section, we provide justification for the proposed formulation (2) by connecting it with the above goal, based on the analysis of WAE (Tolstikhin et al., 2018).
Let P G and P r be the induced probability measures of p G (x) and p r (x) respectively. We have the Kantorovich's formulation of the optimal transport (OT) problem with the L 1 cost:  where c(x, y) = x − y 1 is the cost function and P(x ∼ P r , y ∼ P G ) is a set of all joint distributions of (x, y) with marginals P r and P G respectively. Note that W 1 (P r , P G ) is also known as the 1-Wasserstein distance between P r and P G . Then we have the following theorem which gives an upper bound of the 1-Wasserstein distance, whose proof is given in Appendix B.
Theorem 1. Let p θ (x|z) be a multivariate multinomial distribution with mean matrixḠ(z) ∈ R m×n which is a common choice for text modeling, i.e., each one-hot token x i |z follows a multinomial with is the aggregated posterior and p β (z) is the implicit prior.
Hence by minimizing (4) with respect to θ and β, we learn the composite generator G θ (g β ( )) : E → X that minimizes an upper bound of W 1 (P r , P G ), which is consistent with the standard goal of generative modeling. However, this optimization problem is generally intractable due to the equality constraint and the nonparametric nature. Our formulation (2) can be regarded as an approximate problem of it by parametrizing q(z|x) with a distribution family induced by a stochastic encoder mapping E φ , and relaxing the hard constraint q z (z) = p β (z) by introducing the relative entropy regularization D KL (q z (z) p β (z)).

Datasets
We conduct our experiments on three benchmark datasets, MSCOCO (Lin et al., 2014), WMTNEWS , and ROC- STORY (Mostafazadeh et al., 2016). All of the preprocessing steps are the same as Chen et al. (2018) and Wang and Wan (2019). The statistics of the resulting datasets are reported in Table 1.

Baselines
Unconditional Generation Three simplified variants of TILGAN are implemented for comparison: • TILGAN P : a plain baseline using our backbone model, that is, a Transformer autoencoder and a GAN in the latent space based on KL divergence. • TILGAN E : TILGAN P equipped with decoder enhancement. • TILGAN MD : TILGAN P with the multi-scale discriminator. In addition, the following existing models are adopted: recurrent neural network language model (RNNMLE), SeqGAN (Yu et al., 2017)

Automatic Evaluation Metrics
Unconditional Generation • TESTBLEU (Yu et al., 2017): a quality metric comparing the n-gram similarity between generated samples and the whole test set. • SELFBLEU (Zhu et al., 2018): a diversity metric calculating the similarity between one generated sentence and the whole remaining generation. The lower the SelfBLEU score is, the higher diversity we obtain in the generation. Specifically, following Chen et al. (2018), we report BLEU-2/3/4/5 for TestBLEU and BLEU-2/3/4 for SelfBLEU. Conditional Generation • BLEU (Papineni et al., 2002): the BLEU score is calculated by taking the geometric mean of the n-gram BLEU scores where n is from 1 to 4. • DIVERSITY (Li et al., 2016): the proportion of distinct n-grams in the generated results which evaluates the degree of diversity. D1 and D2 are reported for unigram and bigram, respectively.

Implementation
Unconditional Generation We implement a Transformer-based autoencoder with 2 layers, 4 heads, 512 embedding dimensions, and 512 hidden dimensions. The generator and discriminator are implemented by 3 layers multi-layer perceptron (MLP). We set the maximum sequence length to 15 and 32 for MSCOCO and WMTNews, respectively. During training, each sentence is padded to the maximum length when fed into the encoder, and then the encoder produces a latent vector for every input token. During testing, the latent-generator generates a sequence of latent vectors with the same maximum length, and then, conditional on the latent vectors, the decoder generates a sentence which ends when a special token, <EOS>, is generated. We adopt Adam (Kingma and Ba, 2015) as the optimizer with a learning rate of 0.00025 and 0.0001 for autoencoder and GAN structure, respectively with a dropout rate of 0.3. Conditional Generation We adopt the same Transformer encoder-decoder architecture as the backbone model and similar setups as Wang and Wan (2019). The Transformer structure has 6 layers, 8 self-attention heads, 512 dimensions for hidden states, and uses shared attention layers for encoder and decoder which allows the decoder to attend to the encoder state and the decoder state at the same time to make the completed story more coherent. The generator has 3 layers and the discriminator has 4 layers. We adopt Adam (Kingma and Ba, 2015) as the optimizer with a learning rate of 0.0001 and a dropout rate of 0.15. More details of the experimental setup and hyperparameter settings are shown in the Appendix A.

Unconditional Generation
• Generation Quality The first experiment is to compare the quality of generated sentences of different models. In general, as shown in Table 2, TILGAN outperforms all baseline models in Test-BLEU on both MSCOCO and WMTNews datasets, which clearly indicates the advantages of our proposed framework. We make five main observations. Firstly, we notice that TMLE is comparable to most GAN baselines, which shows the powerful fitting capacity of the Transformer architecture as well as the inferior performance of previous GAN implementations. Despite this, TILGAN P outperforms TMLE by a wide margin, demonstrating that our backbone model combining a Transformer autoen-coder and a GAN can not only take advantage of the Transformer's capacity, but also exhibit benefits from our GAN formulation to further boost the performance. Furthermore, compared with TILGAN P , our full version TILGAN achieves an improvement of 13.3% and 13.2% for TestBLEU5 on MSCOCO and News, respectively. This confirms the effectiveness of the proposed multi-scale discriminators and decoder enhancement. In detail, when comparing the simplified variants TILGAN E v.s. TILGAN MD , the improvement of TILGAN E over the plain baseline TILGAN P is larger than that of TILGAN MD , which illustrates that incorporating TILGAN E is more crucial to improving the generation quality. Lastly, we compare TIL-GAN with all previous methods and observe an average improvement of 6.3% for TestBLEU5 on two datasets against the previous state-of-the-art FMGAN, which suggests the superiority of our method.
• Generation Diversity The generation diversity is evaluated by SelfBLEU scores, which are shown in Table 2. First, compared with the baselines with comparable and worse TestBLEU, e.g., FM-GAN and LeakGAN, our TILGAN achieves lower SelfBLEU scores, which indicates a better qualitydiversity trade-off from TILGAN. We notice that RNNMLE achieves the best SelfBLEU score on WMTNews but its quality shown by TestBLEU is pretty low and tends to generate incoherent or meaningless segments, which can be confirmed by the generated samples in Appendix C. In addition, from the results of TILGAN MD , we find that incorporating the multi-scale discriminator leads to a significant drop in SelfBLEU, suggesting that most of the performance gains in generation diversity are attributed to our design of multi-scale discriminators in contrast to decoder enhancement. Moreover, when comparing the performance across two datasets, we find that the SelfBLEU scores of our models are lower on MSCOCO than that on WMTNews, illustrating that it is easier to generate more diverse texts on MSCOCO. One possible reason is that the texts in MSCOCO are shorter than the texts in WMTNews as shown in Table 1. When generating long sequences, models are prone to generate repeated tokens and phrases. The same phenomenon was also observed for many other baseline models like ARAE and FMGAN.

Conditional Generation
In addition to unconditional generation, we test our model in a story completion task to verify its ability in conditional generation. Table 3 shows the automatic metrics in four metrics, with several observations. (i) Overall, among all models, TILGAN achieves new state-of-the-art results on the ROCStory dataset, showing the superiority of our method. (ii) Our model obtains substantial improvement in the quality metrics of generated answers, with 0.32% gains in BLEU, 6.92% gains in the adversarial success. It demonstrates that the generated plots are in high coherence, which not only share a higher proportion of word overlap with ground-truth answers, but also have a higher success rate fooling the BERT classifier. (iii) As for diversity, TILGAN improves upon the state-ofthe-art methods from 3.63% to 3.88% on D1 and 23.46% to 25.61% on D2, showing that TILGAN produces stories consisting of more diverse and distinct n-grams.

Human Evaluation
Due to the limitations of automatic evaluation metrics, we invite 5 judges to rate 100 sentences generated by different models on a scale from 1 to 5 for both unconditional and conditional generation tasks. The results for unconditional generation are shown in Table 2. TILGAN shows a superior performance, which confirms that our model is able to generate more realistic samples than the baseline models on two datasets. Among all baseline models, FMGAN has a high quality score but a low diversity score, which indicates that most of its generated samples are repeated sentences that lack diversity. Additional evidence is shown through the case study in Section 6.2.
In addition, the human evaluation results on story completion are shown in Table 3 where we only compare TILGAN with the best baseline, i.e., T-CVAE. We use Gram metric to evaluate whether the generated story plot proceeds naturally, and Logic metric to evaluate whether the plot is reasonable and coherent following Wang and Wan (2019). Compared with T-CVAE, TILGAN is better in both Gram and Logic, demonstrating that the generated story plots of TILGAN are more natural and coherent.  • Implicit Prior v.s. Implicit Posterior In addition to imposing an implicit prior, one can instead impose an implicit posterior as well by moving the transformation network of the latent-generator to the encoder and leaving a Gaussian prior. This results in a variant with nearly the same total number of parameters, named IMP_POST. We see from Table 2 that IMP_POST performs worse than TILGAN P with an implicit prior, suggesting that enlarging the distribution family of posterior q φ (z|x) contributes less to improving the overall generation performance than enlarging that of prior p z (z), which is consistent to the analysis in the second paragraph of Section 2.1.

Case Study
To further analyze the real quality and diversity of the generated sentences, some are examined and presented in Table 4 and more examples are shown in Appendix C. First, the samples generated by TILGAN are more coherent and semantically meaningful. The majority of texts of TILGAN are in subject-verb-object order while those of other models are not. In addition, TILGAN exhibits more diverse sentence structures and word choices than others. For example, although each sentence generated by FMGAN looks good in quality, there are many repeated sentences or phrases, leading to a low diversity. The case study is consistent with the human evaluation results in Section 5.3.

Related Work
Conventional text generation models leverage maximum likelihood estimation (MLE) with teacher forcing and have shown powerful generation capabilities (Mikolov et al., 2010;Cho et al., 2014;Bahdanau et al., 2016;Radford et al., 2019;Brown et al., 2020) but they suffer from the exposure bias problem. To address this, several solutions were introduced including scheduled sampling (Bengio et al., 2015), professor forcing (Lamb et al., 2016), and Gibbs sampling (Su et al., 2018). GAN-based text generation methods can be categorized into three classes: reinforcement learning (RL) based methods, Gumbel-Softmax (GS) based methods and latent feature matching methods. RL-based methods (Yu et al., 2017;Lin et al., 2017;Che et al., 2017;Fedus et al., 2018) design a reward incorporated with the discriminators, and use policy gradient or actorcritic approaches to update the generator to resolve the gradient propagating issue over discrete tokens. However, they suffer from high variance and mode collapse issues caused by the unstable policy gradient training process and the lack of a reliable guiding signal Chen et al., 2018). GS-based methods (Kusner and Hernández-Lobato, 2016) apply Gumbel-Softmax which is a continuous relaxation technique for transforming the output of a generator to be as close to one-hot as possible in order to make the samples from a discrete distribution like a multinomial differentiable with respect to the distribution parameters.
Latent feature matching methods Zhao et al., 2018) learn a manifold in the latent space instead of the discrete output space.

RankGAN:
(1) A blue blue train sits on tracks with his residential asian toys.
(2) A reflection of two birds walking by a sidewalk. FMGAN: (1) A man is standing on a table with a dog.
(2) A man is standing on a table with a dog on a field.
(3) A man is standing on a field of a large building. TILGAN: (1) A little boy sitting on a bench with a little girl.
(2) A blue and white public transit bus is driving down acity street.
(3) A train is going down the tracks in a forest. This kind of methods usually incorporates an autoencoder to build the feature space and force the generator's latent output distribution to approach the real data latent distribution. Our method also resides in this category. To ease adversarial training,  introduce adversarial feature matching method by incorporating a kernelized discrepancy metric to match high-dimensional latent representations of real and synthetic sentences. ARAE (Zhao et al., 2018) extends AAE (Makhzani et al., 2015) to model discrete sequences and learns a parameterized prior by a generative model trained with WGAN. In contrast to our TIL-GAN whose Transformer-based encoder and decoder are both stochastic, ARAE uses RNN-based encoder and decoder which are both deterministic, as required in their theory, which reduces the model expressiveness and results in much poorer performance than ours as shown in Table 2. iVAE (Fang et al., 2019) proposes a VAE (Kingma and Welling, 2014) with an implicit posterior which is inferior to the implicit prior that we adopt according to the ablation study in Section 6.1. WAE-S (Bahuleyan et al., 2019) is a WAE(Tolstikhin et al., 2018) with a stochastic encoder trained using MMD with a distinct goal of improving the reconstruction ability.

Conclusion
In this paper, we proposed Transformer-based Implicit Latent GAN (TILGAN), for text generation. It combines a Transformer autoencoder and a GAN through matching the distributions of multi-token sequences in the Transformer's latent space based on KL divergence. To improve the local and global coherence, we introduced a multi-scale discriminator to utilize the semantic information on varying scales. To train the decoder reliably, we enhanced the objective function by another KL term, forcing the decoder to be compatible with the latentgenerator. We theoretically connected the proposed formulation with the standard goal of generative modeling. Empirically, TILGAN achieved the state-of-the-art performance on three widely used datasets for unconditional tasks and story completion task, which demonstrated the effectiveness of our method to generate texts of high quality and diversity compared with the existing approaches.

B Proof of Theorem 1
Proof of Theorem 1. In this proof, we let x be the real data, y be the generated data, and z be the latent variable. Let p G (y, z) = p β (z)p θ (y|z) be the joint distribution of (y, z), where z is sampled from prior p β (z) and then y is sampled from the decoder conditional p θ (y|z). Further let P x,y,z denote the set of all joint distributions of (x, y, z) such that x ∼ p r (x), (y, z) ∼ p G (y, z), and x ⊥ ⊥ y|z; let P x,z be the set of marginal distributions of (x, z) induced by P x,y,z , that is, the set of distributions with marginals x ∼ p r (x) and z ∼ p β (z).
Recall that n is the sequence length and m is the number of words in the vocabulary. For the i-th word x i which is an m-dimensional one-hot vector, indicator 1(x i = j) = 1 if the j-th dimension of x i is equal to 1 and 1(x i = j) = 0 otherwise, for j = 1, . . . , m. Then we have where the first inequality comes from Tolstikhin et al. (2018, eq. 9), and the second inequality is due to the fact that 1 − l < − ln l for all l ∈ (0, 1), leading to the desired result.

story:
krista was organizing her office .____________________ . they were big and heavy . she assembled them carefully . when she put all her books on them , they collapsed ! T-CVAE: she had a bunch of books . TILGAN: she bought some new books .

story:
the man won a contest . he went to the station to collect . ____________________ . he did n't really like the band . he tried to sell them back to the radio employees . T-CVAE: he got a ticket for a band . TILGAN: he saw some band members .

story:
billy is bored . billy sits with his friends thinking of something to do . billy suggest they all head to the lake to go fishing . ____________________ . billy takes his friends to go fishing and has great time . T-CVAE: billy and his friends go fishing together . TILGAN: billy and his friends go fishing .

story:
____________________ . her house was full of dust . she could n't believe how filthy it was . alicia then decided to clean it . when she was done cleaning and it sparkled . T-CVAE: alicia was in the basement . TILGAN: alicia was cleaning her house . -a couple of people are walking on a log and trees . -a street sign on a street .
-the train car traveling mannequin driving down the tracks . -is in a bathroom with a sink controls .
-people standing next to a large building .
RankGAN: -a blue blue train sits on tracks with his residential asian toys .
MLE: -an orange booth contains the in traffic light under a sign -a reflection of two birds walking by a sidewalk .
-a man with hat on the horse in the street -a man tourist train in the egret -a couple walking around city tracks with people -a white fire hydrant stands next to each other .
-the bird are walking next to a small blue coop LeakGAN: -a table topped with pots . . FM-GAN: -a man is standing on a skateboard on the beach -a bathroom with a glass shower , sink , toilet and sink .
-a man on a tennis game with a kite -a woman wearing a glass is sitting on a cupboard .
-a man on a table with a red and white and a building -a group of men talking .
-a man is standing on a table with a dog ARAE: -a city street sign in the park bench parked in the group group the man TILGAN: -a little boy sitting on a bench with a little girl -two people standing at motorcycles on the bench -a group of people in the middle of a field -two people standing at motorcycles on the bench at white kitchen -a large passenger plane flying through the sky -there is a city bus on their city street sign parked in blue blue bus -a woman is sitting in a kitchen next to a restaurant -a white plane on the air plane parked in snow group their plane parked -a small bird sitting on a branch of a tree SeqGAN: -it said the yield on our most traveled to china ' s capital for " the annual bank of cost credit against cuba .
-the cars ( , taiwan argues that the cease -fire contained included already reported on the outlook . -" both the republican leader in the years of leadership she was in rome and signed a close cabinet , " she said .
-russia has said in its twitter documents since january which demanded that lawmakers had clearly been involved about .
RankGAN: -the lakers left the us to hope of the office and chose the general administration to build a further mission . -ms . bush had been remembered in a red ring after it taking these sunday week made focusing himself against the number of games .
-" " i ' d thought something i am running everything i saw my own american life usually respect at all , as some equipment they have that .
-and i was hoping that management does still even bring to hillary carson , and listen to a guy playing off back .
MLE: -what we need do this case if anybody had now touched a in -town community , all your kids in exchange barriers are needed in , no more , all may come out the coming in reflect the options .
-so the scottish government is significant until not extension you own very so hard to change my job for your six points at half , social opinion and take beyond .
-us president -elect donald trump will re consider an effort to set out that it would be to accept from the us -city solution to the world . -local judges ask for her children and went into a video itself , she said for 2016 ' s early next day , he said .
LeakGAN: -picture west eight my might confidence , zero confidence my either nazi a a time having accounts , skills a difference x having must difference time having a develop pakistan confidence time time killed wilson partners nazi unfair zero phones develop vital confidence a might showed a having confidence develop a -pupils evidence accounts having confidence confidence theft abortion time time sized time west coming a unfair time affecting time my theft a a killed killed phones , , time questioned pakistan a partners evidence sized confidence unfair my eight time pakistan zero zero confidence partners either seventh having , killed a GsGAN: -i hope that i do something like that it ' s a very important thing , i know what you want to see this i didn ' t know , i think it does not be able to work out this way that ' s not a lot better needs to -the actor is it , well , which was a good job in a writing -christmas time out of a three -year -old woman who had been charged for the murder , he said that he was investigating the government ' s decision to 19 . 6 million -to give the first time , it added , he had a few days and now he ' s not just a new administration , he will do the same time before . -it is that , but the two -year -old woman he didn ' t agree on : they had been a right ago because i wanted to have to do something i was trying to kill them , i am , but i can do it had to stay FM-GAN: -The United States , the United States has been a major group of the United States in the United States , the United States in the United States . -We have to be able to pay the money to be able to pay the money to be able to pay the money to pay for the same time ," he said .
-" It ' s a lot of people who have been a woman , and I have been told the police ," she said .
-The man ' s death was a " bit of the incident , but the police said that the police had been taken to the city , and the police officer was a " very dangerous -driving area . -We have to be able to do the government to be able to do the government to be able to leave the country ," he said .
ARAE: -a more . 5 per cent the company said it would not be expected to rise if he hit the 2 percent year , it said , rose 2 , 500 ,when the only reason only be the best way for the best time for the best time for the best time for them and not being able to have done with much -the fact : the fact only now not being able to have done with a much more time for the age amount time with a much time for the best time for -the fact the only reason only be the best way . -the fact : the fact only now being a more person with a person with each person with a much time with the best time for them as much as a person TILGAN: -many people who died , although they didn ' t have been on the same day , not just because of those who had been out of them .
-" i had to be able to get a good deal with the right time ," he said in a statement . -that ' s why , in my life is now that ' s not the same thing , and how much money is .
-we are still working closely with the community who is still in the world , but we can ' t be the best .
-we can ' t get some good players in the league , but not only because we ' ve played well .