Recurrence Boosts Diversity! Revisiting Recurrent Latent Variable in Transformer-Based Variational AutoEncoder for Diverse Text Generation

Variational Auto-Encoder (VAE) has been widely adopted in text generation. Among many variants, recurrent VAE learns token-wise latent variables with each conditioned on the preceding ones, which captures sequential variability better in the era of RNN. However, it is unclear how to incorporate such recurrent dynamics into the recently dominant Transformer due to its parallelism. In this work, we propose TRACE, a Transformer-based recurrent VAE structure. TRACE imposes recurrence on segment-wise latent variables with arbitrarily separated text segments and constructs the posterior distribution with residual parameterization. Besides, we design an acceleration method by approximating idempotent matrices, which allows parallelism while maintaining the conditional dependence of latent variables. We demonstrate that TRACE could enhance the entanglement of each segment and preceding latent variables and deduce a non-zero lower bound of the KL term, providing a theoretical guarantee of generation diversity. Experiments on two unconditional and one conditional generation tasks show that TRACE achieves significantly improved diversity while maintaining satisfactory generation quality.

Among all variants, temporal VAE (Fabius et al., 2015;Chung et al., 2015) was prevalent in the era of RNN, which captures temporal variability by introducing the dependency of a series of latent variables with each associated with one time step.Such a VAE variant has succeeded in kinds of sequence modeling tasks, e.g., dialog generation (Kim et al., 2020), audio generation (Franceschi et al., 2020), and time series prediction (Li et al., 2019).
Temporal VAE can be categorized into three paradigms according to the dependency of prior distributions at each time step: a) independent normal distributions (abbr.IND) (Li et al., 2020c), b) context-conditioned Gaussian distributions (abbr.CGD) (Du et al., 2018) which are conditioned on preceding text, and c) recurrent Gaussian distributions (abbr.RGD), i.e., Recurrent VAE (Chung et al., 2015), which are conditioned on preceding both text and latent variables1 .Both IND and CGD ignore the interaction of latent variables, limiting their expressive ability.In comparison, by introducing the dependency of latent variables, RGD could better model the sequential variability and thus greatly improve generation diversity while maintaining satisfactory quality.We provide the theoretical proof of such an advantage in Sec.4.3.
These paradigms can be easily implemented with RNN benefiting from RNN's natural recurrent structure.Stepping into the age of Transformer (Vaswani et al., 2017), it is promising to adapt temporal VAE to this popular architecture.IND and CGD paradigms are naturally compatible with Transformer because their latent variables at each time step are independent which could be simply combined with the parallel computation of Transformer self-attention via causal and noncausal masks (Lin et al., 2020).However, there are no off-the-shelf solutions to incorporate RGD into Transformer-based VAEs, since recurrence would be a natural obstacle to parallelism (recurrent latent variables need to be sequentially sampled), which limits the capacity of this potential VAE paradigm.
Could we equip Transformer with such recurrent dynamics for better diversity while keeping the training parallelism?To answer this question, we propose TRACE 2 , a novel Transformerbased recurrent VAE structure.TRACE imposes recurrence on segment-wise (instead of token-wise) latent variables with arbitrary segmentation, e.g., sentences or segments with specified length.Besides, we construct the posterior distribution using residual parameterization and layer normalization, which could deduce a non-zero lower bound of the KL loss to alleviate KL vanishing (Bowman et al., 2016).Moreover, to accelerate training, we design a method to recover the parallelism in Transformer by approximating idempotent parameter matrices for the latent space, leading to improved diversity, satisfactory quality, and faster training.
In summary, our contributions are as follows: We are the first to (i) incorporate recurrent VAE into Transformer with recurrence on segment-wise latent variables which allows a flexible trade-off of diversity and quality; (ii) propose a method to recover parallelism and accelerate training with comparable performance; (iii) mathematically demonstrate that our model has a deducted non-zero lower bound to mitigate KL vanishing, and a theoretical interpretation of diversity improvement.(iv) We validate the effectiveness of our model on two unconditional and one conditional generation tasks.

Related Work
VAE has shown great effectiveness in a wide range of NLG tasks, such as storytelling (Yu et al., 2020;Fang et al., 2021), dialogue generation (Serban et al., 2017;Bao et al., 2020) and poetry composition (Yi et al., 2021).To further improve the expressive ability of VAE, researchers propose various variants, e.g., vMF-VAE (Xu and Durrett, 2018) that replaces the latent distribution with von Mises-Fisher distribution, ml-VAE (Bouchacourt et al., 2018) that learns multi-level latent variables, and BN-VAE (Zhu et al., 2020) that utilizes batch normalization to get a non-zero KL lower bound.
Among all variants, temporal VAE is the most prevalent one in the era of RNN, which introduces latent variables at each timestep and could natu-2 TRACE: Transformer Recurrent AutoenCodEr rally fit with the recurrent structure of RNN.Existing temporal VAE fall into three paradigms according to the parameterization and dependence of the latent variables' prior distributions, namely IND, CGD, and RGD, as mentioned in Sec. 1.For example, TWR-VAE (Li et al., 2020c) utilizes a timestep-wise regularisation through independent latent variables with IND.VAD (Du et al., 2018) incorporates CGD into latent variables and augments the posterior distribution with a backward RNN.Recurrent VAE (Chung et al., 2015) learns token-wise latent variables with each sequentially conditioned on preceding ones as well as the context (i.e., RGD).By modeling the trajectory of both observed text sequences and latent space, recurrent VAE could capture the sequential variability better (Goyal et al., 2017;Hajiramezanali et al., 2020).Besides, we will show that such recurrent dynamics could theoretically reinforce the dependence on the stochastic and generalized latent space, thus boosting generation diversity by a large margin.
Recently, with the flourishing of the powerful Transformer architecture, researchers have devoted to combining it with VAE for text modeling and generation (Wang and Wan, 2019;Li et al., 2020a;Fang et al., 2021;Hu et al., 2022).VAEs could promote generation diversity with satisfactory quality, benefiting from the intrinsic randomness in latent space.Therefore, VAE-based Transformers are essential for various tasks demanding creativity, such as advertising text generation (Shao et al., 2019).Two of the temporal VAE paradigms, IND and CGD, can be easily adapted into Transformer.For instance, SVT (Lin et al., 2020) applies CGDbased VAE to dialogue generation.Nonetheless, the integration of recurrent VAE is still an open challenge due to the conflict in the parallelism in Transformer and recurrent dependence of recurrent VAE.To fully exploit the expressive power of recurrence, we revisit recurrent VAE in Transformer and propose TRACE which possesses advantages of both generation diversity and training parallelism.

VAE
As one of the representative generative models, VAE has proven to be an effective paradigm for estimating the data distribution by introducing a latent variable z and modeling the joint distribution: (1) The prior distribution p(z) is commonly a standard Gaussian distribution.The conditional distribution p(x|z) is generally parameterized by a neural network, known as the generative network (decoder) to recover the observed data from latent variables.Directly estimating p(x|z) brings the intractable posterior distribution p(z|x).Instead, VAE introduces a variational approximation q(z|x) and derives the Evidence Lower BOund (ELBO): where KL means the Kullback-Leibler divergence.
In practice, the approximated posterior q(z|x) is parameterized as Gaussian distribution N (µ, diag(σ 2 )), where µ and σ are estimated by a neural network, known as the inference network (encoder).The generative network p(x|z) and inference network q(z|x) are jointly optimized by maximizing the lower bound in Eq.(2).

Temporal VAE
Unlike standard VAE, which only involves one latent variable z, temporal VAE learns one latent variable at each time step.Denote z t ∈ R l and x t ∈ R h as the latent variables and the observed data at t-th step, respectively.Next, we will present the mathematical details of three paradigms of temporal VAE, namely IND, CGD, and RGD.

IND:
The prior distribution p(z t ) follows the standard Gaussian distribution N (0, I), and the posterior one conditions on the preceding context as q(z t |x ≤t ).Then, we obtain the ELBO of IND: (3) CGD: CGD constructs the prior distribution considering the observed text p(z t |x <t ) and the posterior one based on the complete text (4) RGD: RGD parameterizes the generative process by the following factorization: (5) The latent variables z t follows the prior distribution p(z t |z <t , x <t ) and the posterior one follows q(z t |z <t , x ≤t ).Then, we obtain the ELBO: where q(z ≤T |x ≤T ) can be factorized as: We present the detailed deduction of Eq.( 6) in Appendix B.1.
In an RNN-like backbone, we can construct the representation of x ≤t with the hidden states at t-th step and compute the distribution parameters of z t .

Method
To incorporate recurrent VAE (RGD) into Transformer, we propose TRACE that learns recurrent segment-wise latent variables and design an acceleration method to make full use of the advantage of parallelism in Transformer.We present the adaption of recurrent VAE to Transformer and residual parameterization in Sec.4.1, demonstrate the parallel training method in Sec.4.2, and provide a theoretical interpretation of TRACE's effectiveness for boosting diversity in Sec.4.3.

Transformer-based Recurrent VAE
Different from the token-wise latent variables implemented in RNN-based VAEs, TRACE learns segment-wise z t based on the representation of t-th segment, x t .We can devise different principles to separate the segments, such as the inherent separation like sentence or utterance, or specifying a fixed segment length.We add a special token [SEP] to the end of each segment.
Fig. 1 depicts the architecture of TRACE.At the encoder, we design two kinds of attention mask matrices.First, we introduce an extra mask matrix, a partitioned lower triangular matrix (the left of Fig. 1), which allows each token to attend to all tokens in the same segment and previous segments.Second, we design an intra mask matrix, a strict partitioned matrix to make each token attend to only the tokens within the same segment.We input the separated text sequence into the Transformer encoder twice, with the extra and intra mask matrix, Transformer Decoder respectively.Then, the output of t-th [SEP] from the final encoder layer can be used as the representation of x <t and x t .Now, we can obtain the parameters of the prior distribution of z t by: where f is the prior network, parameterized as linear layers W f µ , W f σ ∈ R (l+h)×l .The prior distribution of z 1 is the standard Gaussian distribution.
For the posterior distribution, we utilize residual parameterization (Vahdat and Kautz, 2020) that parameterizes the relative difference between prior and posterior distributions.In this case, the difference lies in x t .Therefore, we can compute the posterior distribution as: where g is the posterior network, parameterized as We regulate the output of g by layer normalization (Ba et al., 2016).To reveal the benefits of residual parameterization and layer normalization, we give the following theorem: , where l is the latent dimension, γ and β are the parameters of layer normalization.
We leave the proof in Appendix B.3.Theorem 1 indicates that we can easily control the lower bound of the KL term by setting a fixed γ and hence mitigate KL vanishing.We choose layer normalization here since it is superior to batch normalization in Transformer based models (Shen et al., 2020) (See Table 5).Besides, Theorem 1 is compatible with both unconditional and conditional generation compared to the BN VAE model (Zhu et al., 2020).
After deriving the prior and posterior distributions, we can compute the KL loss and sample the latent variables with the reparameterization trick (Kingma and Welling, 2014).The sampled latent variables z t are injected into the Transformer decoder by adding with the input embedding.

Parallel Training for Recurrent VAE
The method above requires sequentially sampling each latent variable, which hinders the parallelism training in Transformer.For acceleration, we further design a parallel training method.
With the reparameterization trick, when sampling z t ∼ N (µ, σ 2 ), we actually first sample a white noise ϵ ∼ N (0, I), then get z = µ + ϵ • σ, where • is element-wise multiplication.We omit LayerNorm for simplicity.For each t, we have: where ϵ t and ξ i are independent white noises sampled from standard Gaussian distribution.
We leave the complete derivation in Appendix B.2.
In this way, we can parallelly train the model while keeping the advantage of RGD.We can parallelly compute v t for all time steps first, and then obtain u t in parallel based on v t .Then we can make the matrix multiplication between the concatenation of u t of all time steps and a lower triangular matrix of ones to parallelly obtain t−1 i=1 u i .Finally, by making a sum, we get the approximation of z t .Similarly, we obtain the latent samples from the posterior distribution in Eq.( 10).
In the approximation of Eq.( 12), we assume that W f µ1 and W f σ1 are idempotent matrices to avoid the power of matrix.However, such simplification may bring too much noise.To stabilize the training process, we adopt spectral normalization (Miyato et al., 2018) to restrain W f µ1 and W f σ1 .

Why Could TRACE Boost Diversity
We give a theoretical demonstration of the advantage of TRACE on the improvement of generation diversity.We have the following theorem: Theorem 2 Each reconstruction term in the ELBO, E q(z ≤t |x ≤t ) p(x t |z ≤t , x <t ), is upper bounded by I q (x t ; z ≤t |x <t ), where I is the mutual information.
The proof can be found in Appendix B.4.Based on Theorem 2, optimizing the reconstruction terms means maximizing I(x t ; z ≤t |x <t ), which could enhance the dependency between x t and z ≤t .In this way, the model would rely more on the flexible latent space than the deterministic context text, bringing more randomness to improve diversity while maintaining satisfactory coherence.

Dataset
We carry out experiments on two datasets for language modeling and unconditional generation, including the Yelp and Yahoo (Yang et al., 2017;He et al., 2019) and one dataset, WritingPrompts (WP) (Fan et al., 2018) for conditional generation.We list the detailed data statistics of these datasets in Table 1.Due to the limited computation capability, we restrain a max length of 750 for training in WP.

Implementation Details
We use the pretrained language model GPT-2 (Radford et al., 2019)  initialized with GPT-2 and are fine-tuned on the target datasets.We choose 32 for the dimension of latent space and use the cyclical annealing trick (Fu et al., 2019) during training.We set batch size as 32 and the learning rate as 5e − 5, γ in the layer normalization as 3.We separate segments with fixed length 10 on Yelp and Yahoo datasets and use the initial sentence as a segment on the WP dataset with NLTK toolkit (Bird, 2006) for segmenting.We use the top-k sampling strategy (Holtzman et al., 2020) to decode the sequence with k as 50 for all models on all datasets.When the generated token is

Baseline
We compare TRACE with several transformerbased solid models.All baseline models are utilized with the same backbone model.
IND/CGD: We implement the Transformer version of the two models, TWR-VAE (Li et al., 2020c) and VAD (Du et al., 2018) which belongs to IND and CGD respectively.The segment separation is consistent with TRACE.
No Recurrence (Li et al., 2020a): We remove the recurrence and only involve one latent variable to verify the effectiveness of recurrence.The latent variables are injected into the decoder by being added with the text embedding in the decoder.

Metrics
We evaluate unconditional generation tasks with three perspectives; (a) Representation Learning: we report ELBO, KL, mutual information (MI) (Alemi et al., 2016) and activate units (AU) (Burda et al., 2016).We change the threshold to 0.1 when computing the AU metric to distinguish different models further.(b) Generation Quality: we report PPL and CND (Li et al., 2020b) to evaluate the generation capacity of models.Different from standard auto-regressive language models like GPT-2, VAE-based models could not estimate exact PPL.Therefore, following He et al. (2019), we use importance-weighted samples to approximate log p(x) and estimate PPL.CND measures the divergence between the generated one and the ground truth testset.(c) Generation Diversity: we report Self-BLEU (Zhu et al., 2018), Dist (Li et al., 2016) and JS (Jaccard similarity) (Wang and Wan, 2018) to evaluate the diversity of generated text.
We consider the quality and diversity in story generation.We report BLEU (Papineni et al., 2002), Rouge-1, 2, L (Lin and Hovy, 2002), and BERTScore (Zhang et al., 2020) to evaluate the quality of generated samples, and the same diversity metrics used in unconditional generation.More details about metrics are listed in Appendix A.1.

Unconditional Generation
We present the results of the unconditional generation task on Yelp and Yahoo datasets in Table 2.As the results show, TRACE achieves significant improvement on most metrics.Better ELBO and MI indicate TRACE has stronger capability of representation learning.Especially, considerably higher MI empirically validates Theorem 2, which manifests that with the RGD mechanism, the observed data will be connected to the latent space more closely.Higher KL and AU also empirically show the benefit of residual parameterization.Besides, lower PPL and comparable CND indicates acceptable quality of the text generated by TRACE.
Moreover, TRACE can produce much more diverse text compared with baselines.Among the baselines, the CGD baseline suffers from the KL vanishing problem, which means the decoder ignores the latent variables.Therefore, without the randomness arising from latent variables, CGD performs the worst on generation diversity.In contrast, TRACE obtains the most prominent enhancement on all diversity metrics.Such improvement originates from two aspects: First, compared to GPT-2 and No Recurrence, sampling latent variables at each time step brings extra randomness for the output.Second, Unlike the other two temporal VAE baselines, TRACE is endowed with the theoretical advantage of RGD, which strengthens the interaction between the text and latent space and then absorbs much flexibility from the generalized latent space.IND simply increases randomness from standard Gaussian prior distribution, which negatively affects quality while causing limited diversity.
Lastly, comparing TRACE-R and TRACE-P, despite the slight drop of quality and diversity, TRACE-P still outperforms baselines on most metrics except CND.Such marginal cost is accept- able considering the acceleration with parallelism.
See the comparison of training speeds in Sec.5.8.The performance loss is mainly caused by the last two approximation steps in Appendix B.2. Theoretically, we approximate the multiplication of t independent random variables as one Gaussian distribution using the central limit theorem (holds for infinite length) in Eq.( 35), and restrict two parameter matrices to be idempotent (holds when the eigenvalue is 1 or 0) in Eq.( 36).In practice, these assumptions don't hold because the sequence length is not infinite, and the spectral normalization only restrains the largest eigenvalue (strictly limiting makes optimization difficult).Therefore, empirically we hurt the interaction of latent variables and sacrifice some information in them, leading to decreased performance compared to TRACE-R.

Conditional Generation
We report the results of conditional generation on the WP dataset.As shown in Table 3, TRACE achieves comparable generation performance on quality (still better than GPT-2) and significant improvement on diversity (especially on Self-Bleu and Dist compared with other VAE baselines).Although the quality of text generated by TRACE-P is relatively defective, TRACE-P still outperforms GPT-2 on both quality and diversity.The overall enhancement of diversity empirically further validates our theoretical analysis.Interestingly, TRACE-P achieves better generation diversity than TRACE-R on WP, which is opposite to the results on both Yelp and Yahoo datasets.This contrary tendency of diversity mainly originates from different sequence lengths.On Yelp and Yahoo datasets, whose text are relatively short, TRACE-R performs better.However, on the WP dataset with much longer text, the approximation in TRACE-P reduces the exploitation of x <t by reducing the interaction of each x t , as shown in The scores range from 1 (worst) to 5 (best).The p-value < 0.01, and the Kappa score is 0.61, which means the evaluation is in an acceptable inter-annotator agreement.
Eq.( 36), which loosens the dependency on context and forces TRACE-P to produce more uncertain text (better diversity but lower quality) than TRACE-R.

Human Evaluation
We conduct the human evaluation on the WP dataset.We generate 50 samples given the source input from the testset with each model and invite five proficient annotators to access the generated text by scoring three criteria: Fluency (whether generated text are syntactically fluent), Coherence (whether the generated part is consistently structured and coherent with input), and Novelty (whether each generated instance is novel and distinct), which cover both generation quality and quality we care about.See Appendix A.2 for more evaluation details.We report the evaluation results in Table 4. TRACE obtains satisfactory results on Fluency and Coherence but stands superior to other baselines on Novelty.The results of the human evaluation are consistent with automatic evaluations.

Ablation Study
Table 5 shows the results of the ablation study on the Yelp dataset.We mainly justify the effective-  ness of residual parameterization, layer normalization used after the posterior network, and spectral normalization for the prior network in TRACE-P.
We also compare batch normalization by replacing layer normalization with batch normalization.Experimental results show that all of them benefit the quality or diversity.Specifically, without layer normalization or residual parameterization, TRACE will tend to ignore the latent variables and loss the diversity brought by the recurrent structure during generation.The difference between batch normalization and layer normalization is relatively marginal, while the former still performs better.

Analysis
Training Speed We compare the training speed of TRACE-R and TRACE-P on Yelp dataset with different pre-defined fixed segment lengths.Small segment length will lead to more recurrence steps in TRACE-R.As shown in Fig. 2, small segment length leads to increased number of segments, resulting in much more training time for TRACE-R.In contrast, for TRACE-P, our proposed parallel training method remarkably shortens the train- ing time.When segment length is 1 (token-wise), the training of TRACE-P is more than twice faster than that of TRACE-R.When segment length is 5, TRACE-P is still 50% faster than TRACE-R.As the segment length grows, the number of segments decreases, and consequently, the training speeds of TRACE-R and TRACE-P reach unanimity.Such empirical results generally confirm the effectiveness of TRACE-P's parallelism.Besides, it is worth emphasizing that despite the deceleration of TRACE-R in the training phase, the inference process is hardly influenced by the recurrence structure because of the auto-regressive decoding manner.Since the efficiency bottleneck of deploying NLG models mainly lies in inference, we believe the speed of TRACE-R is still practical enough for downstream NLG tasks.
Prompt: When everyone turns 18 , they receive a pet which is figurative of their personality.You 're the first person to receive a dragon...

IND:
When people turn eighteen, they get an animal which represents them as one and can be found everywhere in their town.We didn't really ask for it though -only that we should at least know how it worked.Nowadays we had to choose between the puppy dog and cat…… TRACE-R: I have been waiting for this moment ever since the wizards gave me my special ability at elementary school.They said I was one step ahead of everyone else, and that your strength can grow exponentially if you use your abilities dragon gives you wisely and carefully…… Figure 4: Samples generated by different models on the WP dataset.TRACE-R produces responses to the dragon in the prompt and imagines an engaging story that the protagonist was given a special ability by wizards and would begin an adventure with his pet dragon.

Separation of Segment
We explore the influence of segment length on the performance of TRACE-R by evaluating CND for generation quality and Self-Bleu for diversity.As Fig. 3 shows, the generation diversity drops with the increase of segment length.
It is consistent with intuition that longer segment leads to looser context correlations but enhanced sampling randomness and thus improved diversity.

Decoding Strategy
We evaluate the performance of TRACE-R on generation quality and diversity with different decoding strategies, including greedy decoding and beam search with beam size = 10.In these cases, the randomness only originates from sampling the latent variables.As shown in Table 6, by recurrently sampling latent variables which interact with hidden states, TRACE has more distinct advantages in generating diverse text with greedy decoding or beam search.In contrast, sampling decoding itself can enhance generation randomness and thus dilute the diversity improvement obtained by our model.Therefore, using sampling decoding becomes the most difficult case for further improving diversity.We select such a challenging setting in main experiments to verify the effectiveness of TRACE and find it still achieves non-trivial enhancement on diversity, demonstrating the robustness of TRACE in various decoding methods.

Case Study
Fig. 4 gives one generation example of TRACE-R and IND from WP dataset.The input prompt mentioned that people receive a dragon as a pet.The generated text of IND only talks about receiving an animal.In contrast, the generation of TRACE-R first tells that wizards gave him the ability, and mentions the ability could grow with the dragon.We can see the text produced by our model tells an engaging story like a warrior would go to fight with his given dragon.In general, TRACE-R generates a more vivid continuation compared with IND.

Conclusion
In this paper, we revisit the recurrent VAE framework prevalent in the era of RNN, and propose a novel Transformer-based recurrent VAE structure TRACE.TRACE learns a series of segment-wise latent variables conditioned on the preceding ones.
We establish the latent distributions with novel residual parameterization.To accelerate training, we design an approximate algorithm of learning latent variables to fit with the Transformer framework.Experimental shows that TRACE achieves significant improvement in generation diversity based on the tight relationship between text hidden states and the latent space.In the future, we will further explore the potential of TRACE in other larger pretrained models like GPT-3.

Acknowledgement
Thanks to the anonymous reviewers for their comments.This work is supported by the National Key R&D Program of China (No. 2020AAA0106502) and Institute Guo Qiang at Tsinghua University.

Limitations
While TRACE achieves significant improvement in generation diversity, it still has some limitations.First, the trade-off between quality and diversity is a common problem in natural language generation.TRACE is not an exception.The diversity of TRACE's increases while the quality inevitably drops a little.We will further explore better methods to balance quality and diversity.Second, our parallel acceleration training method requires certain approximations, which, to some extent, hurts the initial advantage of recurrence and leads to a drop in the quality.We will continue to design better acceleration training methods.Third, the speedup of our parallel version of TRACE is limited.Under the practical segment length setting (e.g., 10 and 20), the acceleration is marginal.We also plan to further promote our methods to benefit faster training in the future.

Figure 1 :
Figure1: Architecture of TRACE.We add a special token at the end of each segment and obtain the representation of x t and x <t from the Transformer encoder (inference network) with two kinds of modified attention mask matrix.The solid and dotted lines are the posterior and prior paths, respectively.The sampled latent variables are added to the token embedding in the Transformer decoder (generator network).
[SEP], we sample a new latent variable based on the prior distribution to enter a new segment.We implement TRACE and other VAE baselines with open-source Huggingface Transformers (Wolf et al., 2020) library of v4.10.0 and use NVIDIA GeForce RTX 3090 to conduct all experiments.

Figure 2 :
Figure 2: Training speed (average seconds per step) of TRACE-R and TRACE-P with different segment lengths.

Table 1 :
as the backbone.The encoder and decoder of TRACE share the same parameters Statistics of datasets.Length means the average text length of the three datasets.

Table 2 :
Evaluation results for unconditional generation.SB: Self-BLEU.TRACE-R: TRACE with standard RGD.TRACE-P: The parallel version of TRACE.The best/second best results are in bold and underlined, respectively.

Table 3 :
Evaluation results for conditional generation.

Table 4 :
Human evaluation results on the WP dataset.

Table 5 :
Ablation study on Yelp.+BN means replacing layer normalization with batch normalization.-LN, -RP and -SN means removing layer normalization, residual design and spectral normalization, respectively.

Table 6 :
Comparison of different decoding strategies on Yelp Dataset.