Contrastive Deterministic Autoencoders For Language Modeling

,


Introduction
The variational autoencoder (Kingma and Welling, 2013) -henceforth referred to simply as the VAE -is a classical neural model that utilizes a paired encoder-decoder structure.For every data instance x i , the encoder network in the VAE is responsible for creating a compressed code distribution P (z i |x i ) parametrically.The decoder network, in turn, uses this P (z i |x i ) to form an approximation of the real input, xi through an intermediate sampling step.By minimizing a reconstruction loss between x i and xi , along with a KL-divergence between P (z i |x i ) and the isotropic Gaussian distribution, VAEs can perform both generative and denoising (reconstructive) tasks.Minimization of the KL loss allows the VAE to create an isotropic Gaussian distribution to sample and decode from.However, note that the isotropic Gaussian is not required and in NLP researchers have considered training the latent distribution (Fang et al., 2019) and learning structured, discrete representations (Zhao et al., 2018b).
One of the most pressing problems in practical VAE training arises when the encoder's distribution collapses to the standard Gaussian for every instance, that is, P (z i |x i ) becomes ≈ N (0, I) ∀i.This problem is commonly termed as posterior collapse (Bowman et al., 2016;Razavi et al., 2019;Lucas et al., 2019;He et al., 2019) and lies at the heart of modern issues with VAEs.Many fixes have been proposed towards this problem, ranging from setting a lower bound on the KL term (δ-VAE) (Razavi et al., 2019), aggressively training the encoder and analyzing the 'lag' between the encoder and the decoder (Agg-VAE) (He et al., 2019), force meaningful usage of the code z i through skip connections (Skip-VAE) (Dieng et al., 2019).These issues worsen when the VAE employs an autoregressive structure, such as for text or videos (Fang et al., 2019;Zhao et al., 2018b;Dai et al., 2020;Long et al., 2019).Thus, mitigating posterior collapse in VAE architectures is likely to have outsized benefits in NLP.
Independent of the VAE model, there exists the idea of aligning the aggregate posterior.The aggregate posterior consists of the latent space distribution, i.e. the distribution over z formed by evaluating P (x i )P (z i |x i ) over all x i , where the distribution over x i is usually computed by an empirical average over the training set.In these methods, the individual distributions, P (z i |x i ) may even be Dirac distributions, i.e. the mapping between z i and x i is purely deterministic.We can still find, in aggregate, a distribution over z that is close to an isotropic Gaussian.Methods in this vein utilize Wasserstein-based optimal transport (Tolstikhin et al., 2017), maximum mean discrepancy (Kolouri et al., 2018), etc. for this purpose.These methods cannot truly be termed VAEs, as they are often deterministic, but they work similarly.Due to their deterministic nature and differing loss functions, posterior collapse does not usually occur.In VAEs, the quantity to be maximized is the log likelihood log P (x i ).This proves intractable, and an equivalent optimization is done via the ELBO (Evidence Lower BOund) objective.While VAEs can be evaluated by log likelihood, another metric is to evaluate their reconstruction error, as well as the quality of samples generated when, in place of P (z), an isotropic Gaussian N (0, I) is substituted and the Gaussian samples fed through the decoder network.This notion of sample-based evaluation can also be done for deterministic autoencoders, which do not allow an ELBO evaluation.This allows us to compare deterministic autoencoders to variational options fairly to determine superiority.

Our Contributions
We seek to draw parallels to findings in image datasets that indicate deterministic autoencoding models outperform variational ones in terms of sample quality (Ghosh et al., 2019).Due to their relative freedom from posterior collapse, and the aforementioned outsized impact of posterior collapse in language modeling, adapting these deterministic architectures to NLP may improve autoencoders.We find that replacing a previous highly performing architecture -the BN-VAE (Zhu et al., 2020) with a similar deterministic variant improves the performance in language modeling tasks.We engage in an information theoretic analysis of the constant variance VAE, demonstrating that this case allows mutual information maximization.To aid convergence, we add an entropic regularization through contrastive learning (He et al., 2020).To our knowledge, this linkage of contrastive learning to entropic maximization is novel, though entropic regularization has previously been utilized for deterministic autoencoders (Ghose et al., 2020).We evaluate using perplexity-based benchmarks in both forward and reverse directions (Zhao et al., 2018a) and test the accuracy of our formulation using relatively large autoencoders built using transformer encoder-decoder pairs.In all cases, we observe improvements not only over the previous BN-VAE, but over a broad array of VAE architectures.We term our architecture CEAE (Contrastive Entropic Autoencoder) in figure 1.To account for the increasing relevance of large language models, we also test against appropriate architectures that utilize BERT and GPT-2 as parts of an overall VAE architecture.
2 Definitions and Prior Work

Variational Autoencoder
We present here first the classical set up of the VAE.We denote D, E as the decoder, encoder halves of the VAE.Generally, D, E are comparable in size, shape, architecture, etc.We propagate every data instance x i through subnetworks E µ , E σ 2 : These parameters, in turn, define a Gaussian distribution, from which z i is drawn.
The decoder is then used to produce an (approximate) reconstruction xi of x i : The loss function for VAE optimization is: where the || sign between the two normal distributions denotes the KL-divergence.Note also that the first term (squared loss) is commonly used in papers that describe VAEs, but it should be replaced by cross entropy in classification tasks (including most NLP tasks).Furthermore, the loss function may not be the same for all variants of the VAE.
For instance, the β-VAE (Higgins et al., 2016) performs the optimization based on β turns into a tunable hyperparameter, and is generally not 1.Such tuning may improve performance -in particular avoiding posterior collapse.

Deterministic Autoencoders
Utilizing the same notation as the VAE, if we instead consider replacing the sampling step with z i = µ i , the resulting flow is completely deterministic.In this case, training minimizes: where R can be any regularizer on z, with the simplest form being a least-squares regularizer, i.e.R(z) = ||z|| 2 , forming a Regularized autoencoder (RAE) (Ghosh et al., 2019).To generate samples from the RAE, a Gaussian, or a Gaussian mixture model, is fit to the distribution of z after training.With a suitable regularizer R, such as the MMD regularizer forming the Wasserstein autoencoder (WAE) (Tolstikhin et al., 2017), guarantees can be made on the latent space distribution P (z) that do not require this post hoc fitting and allow drawing samples from N (0, I) directly -this case is our focus of interest.For image datasets such as MNIST, CIFAR-10, and CelebA, deterministic autoencoders have a notable advantage (Ghosh et al., 2019) and we would like to carry this over to text.

VAE Optimization for Language Modeling
The failure mode in VAE optimization (when modeling text, images or general datasets) manifests itself by posterior collapse, where N (0, I) takes the place of every latent code's distribution.Autoregressive VAEs suffer the worst in this matter, which impacts NLP disproportionately as nonautoregressive models are generally not usable.It has been suggested that a primary cause of the collapse is due to training dynamics (He et al., 2019).
In this view, the inference network (encoder) is in terms of training initially at a very poor performance, which causes the generator network (decoder) to neglect the latent codes generated by the encoder.This in turn leads to uninformative codes (the posterior collapse phenomenon).Alternative fixes include looking at weighing of the KL-term.Such methods include the β-VAE which adds a weight of β to the KL term (Higgins et al., 2016), and methods which do not allow the KL to go to zero (Razavi et al., 2019).Architecture-wise, skipconnections reduce posterior collapse (Dieng et al., 2019), as does reducing the complexity of the decoder network.During the main training loop, the loss can be amortized (Kim et al., 2018), annealed, or applied in a cyclic manner (Fu et al., 2019), all of which reduce the phenomenon.Finally, the optimizer itself may be changed, with SGD being somewhat preferable to ADAM (Srivastava and Sutton, 2017) for the purposes of avoiding posterior collapse.Orthogonal to all these fixes, we may of course simply use deterministic autoencoders.

BN-VAE Architecture -a Strong Baseline
Let µ ij , σ ij denote the j-th index of the posterior parameters of the i-th instance of a latent code of dimension K.It may be shown that the expectation of the KL divergence obeys the relation (Zhu et al., 2020): where the expectation is taken over the samples i.e., over i.We directly use the resulting inequality: has a defined first and second moment, E[X 2 ] = V ar(X) + (E[X]) 2 .We enforce a batch norm constraint on µ ij , that fixes the expectation and/or the variance, thereby setting the lower bound on the expectation of KL.Batch normalization here simply refers to the near-ubiquitous practice of minibatchlevel normalization, i.e., adding a layer of the form: where µ, σ represent the mean and standard deviation computed over the minibatch of x i .This batch norm layer is usually accompanied by an affine function, i.e., a function of the form f (x) = Ax + b (Ioffe and Szegedy, 2015).We will explicitly make a distinction between the two parts.
Over experiments against other architectures such as Skip-VAE and others (all of which are purported designs to circumvent the posterior collapse issue) BN-VAE results in superior performance in both language modeling (NLL metrics) and the downstream usage of learnt representations (Zhu et al., 2020).Our lesson from this success is to set a baseline with a VAE which batch normalizes the means of latent representations.

Information Theory of Constant Variance Batch Normed Autoencoders
Let the generation process of the latent code be : With the batch norm constraint that : Consider an intermediate case between the VAE (where each σ j can be distinct) and the deterministic autoencoder (each σ j = 0).Set every σ j = c, c > 0 constant.Then, indexwise and vectorwise, where Z is a random vector from N (0, c 2 I).Consider the mutual information between z, µ.This denotes the amount of useful information sent through the encoding and maximizing it avoids posterior collapse (Zhao et al., 2017).It equals Where H is the entropy function.Now, z|µ = (µ + Z)|µ.Therefore, the required entropy is of Z is independent of µ.This differs from general VAEs, where the variance is instance dependent and σ i , µ i are related, relating Z, µ.Note that here we refer to the instance-index i and not the dimension index j as we discuss instance-level dependence.This yields the final mutual information expression as : Now, H(z) has a fixed value of E[z] and also a fixed variance, since it is the direct sum of two random variables Z and µ, both of which, by hypothesis, have fixed means and variances.Under this condition of fixed mean and variance, it is known that the entropy H is maximized iff z is distributed as a Gaussian (Thomas and Cover, 1999).Therefore, a more informative constant variance batch normed VAE induces a more Gaussian representation on z.Since µ = z − Z, and z approaches a Gaussian, while Z is itself one, the desired latent mean µ also approaches a Gaussian as mutual information rises.Note that our analysis holds for any c > 0. This means even very low values of cslowly annealed to 0, approximating the deterministic autoencoder case -will work as long as the mutual information required is high.To create an aggregate posterior which is the isotropic Gaussian, we assume a = 0, b = 1, i.e.

E[µ
When µ i becomes a Gaussian with the above two constraints, our job of creating an appropriate deterministic autoencoder with the right aggregate posterior is done.We have already discussed the interaction of mutual information with that process, but now, observe that becoming a Gaussian can also be done by controlling the entropy H, which is maximized iff µ i is Gaussian, which also implies z i is Gaussian.The process accelerates when minimizing a regularizer in the form of −λH(µ i ), λ ≥ 0.

Entropy and Contrastive Regularization
We require an effective entropic estimator for H(µ i ) to make it Gaussian.
A first step may be repulsion-based estimators such as the Kozachenko-Leonenko estimator : (Delattre and Fournier, 2017) for a sample consisting of 577, an estimate of the entropy of the distribution is: The estimator relies primarily on the leading sum over Y i , which computes "repulsions" between latent representations which are too close.Only this sum (with weight λ) needs to be computed at all for gradient based learning, as the other terms are constants.In practice, this estimator has been used only sporadically for image autoencoders (Ghose et al., 2020) and rarely in general for neural networks, and direct implementations of the method for language autoencoders leads to convergence failure.
We turn to the contrastive learning literature to look for a solution, which has recently emerged as a strong baseline in representation learning for both unsupervised (Chen et al., 2020;Zhang et al., 2022) and supervised (Khosla et al., 2020;Zhang et al., 2022) contexts.In unsupervised contrastive learning, it is desired to learn two alternative representations Z i , Z + i of some instance X i ∈ X , so that Z i , Z + i are close (e.g. by the inner product) and Z j arising from X j ∈ X , j ̸ = i has a low inner product.This is done by minimizing: i arises from a noisy or augmented version of X i , such as a crop (if X is an image).One suitable method is momentum contrast (MoCo) (He et al., 2020), where Z i is generated by a model with parameters θ, and a model with θ ′ , a time average of θ, generates Z + i .Therefore, the encoding method learns to be insensitive to these noises and learns encodings Z i , Z + i that are more or less invariant to such.Simultaneously, the denominator discourages proximity between codes Z i , Z j arising from different instances X i , X j -a repulsive regularization controlling the entropy which sketches our derivation.Unlike directly controlling the entropy with a Kozachenko-Leonenko repulsion loss, this method is well understood empirically in terms of training and implementation.Further, it can be shown that this loss approximates the entropic regularizer for Gaussian distributions in high dimensions.The full proof and derivation appears in the appendix.We use the following loss function L ent , with µ i , µ + i being respectively generated by E θ , E θ ′ , where The details of how to choose m, λ t and their justifications also appear in the appendix.The overall training loss is formed by adding L ent to the reconstruction loss 4 Experimental Details and Methodology

Dataset Choices
We present results primarily using the Yahoo and Yelp corpora (Yang et al., 2017) for language modeling tasks using LSTM encoder-decoder architecture autoencoders.This maintains consistency with the BN-VAE (Zhu et al., 2020) in terms of comparing performance for these tasks.We additionally use a small-scale transformer, with its structure based on the Transformer-XL model (Dai et al., 2019) specifically for Penn Tree Bank(PTB) dataset (Marcus et al., 1993), for results on this dataset (in appendix).Unlike transformer-XL, which is a decoder-only model, we employ an encoderdecoder transformer, but keep hyperparameter and training recommendations aligned with the original source code for the PTB Transformer-XL details.We name our architecture CEAE -Contrastive Entropic Autoencoder.

Metrics
Generally, variational autoencoders are compared on the following metrics: • Negative log-likelihood of the data (usually estimated via ELBO/importance sampling) • Mutual information between x i , z i , capturing latent code quality (Alemi et al., 2016) • KL divergence (averaged) between each latent code's distribution and the isotropic Gaussian, i.e. (N (µ i , σ 2 i )||N (0, I)) Other metrics such as active units (AU) (Burda et al., 2015) may be used, which capture the number of latent dimensions which are actually being utilized in the autoencoder.None of the above mentioned metrics (except the AU) can be used to compare VAEs with deterministic autoencoders.We hence use a different suite of metrics based on forward and reverse perplexities (Zhao et al., 2018a;Gagnon-Marchand et al., 2019): • Reconstruction error, measured by negative log likelihood of decoder-reconstructed input.
• Forward perplexity, where we generate and decode 1000 samples from N (0, I).The perplexity of this sample is evaluated by a critic model, optionally after training on the corresponding train segment of the corpus.
• Reverse perplexity, in which 10, 000 samples are generated from the autoencoder just as above, but are now used for training a different model, then tested on the test segment of the corresponding corpus.
We chose two different critic models to reflect two different ends of the model power spectrum: a simple LSTM model for language modeling, and GPT-2 (Radford et al., 2019) (standard configuration, 117M parameters).The reverse-perplexity task was performed only with the LSTM critic, as it was found that training GPT-2 on the (relatively low quality compared to real text) samples hurt downstream performance on the uncorrupted test corpus.We add human evaluation of generated samples as a sanity check.

Comparisons and Benchmarking
We include a full suite of comparisons for the Yahoo and Yelp corpora, with the following architectures targeting the posterior collapse problem: • Skip-VAE: latent-decoder skip connection (Dieng et al., 2019) • Annealing the KL loss (Bowman et al., 2015) • β-VAE (KL weight β) (Higgins et al., 2016) • Cyclic annealing of KL loss (Fu et al., 2019) • Free bits in the KL loss (Kingma et al., 2016) • δ-VAE (minimum KL δ) (Razavi et al., 2019) • Von-mises fischer (vMF) VAE (Xu and Durrett, 2018) • Semi-amortized (SA) VAE (Kim et al., 2018) • Aggressive-VAE (He et al., 2019) These benchmarks correspond to the ones in (Zhu et al., 2020), from which we also find the required reconstruction metrics and BN-VAE variants.We add a vanilla VAE and LSTM language model for baselines.For the PTB dataset, we compare the deterministic autoencoder to a VAE setup optimized analogously to the much larger BERT-GPT-2 VAE in (Li et al., 2020).Only the standard VAE is considered, with KL annealing kept for consistency.Standard deviations, implementation and architectural details, hyperparameters etc. are in the appendix.

Large Language Model Comparison
We also add a comparison to OPTIMUS architecture of VAEs (Li et al., 2020) with BERT as the encoder, GPT-2 as the decoder, and pre-training on the Wikipedia dataset (details in appendix).Further, the embeddings learnt by our method were compared to BERT zero-shot embeddings per sentence (this result appears in the appendix).

Results
In terms of text modeling, results appear in Tables 1 and 2. In general, our method outperforms the competitors and if not is close to the top performer.It should be noted that realistically, the LSTM critic's performance relative to GPT-2 is due to the fact the LSTM is trained on the relevant corpus while GPT-2 is tested zero-shot.Even though GPT-2 is a stronger model, the LSTM has more domain knowledge, causing their perplexities to be close.This implies that GPT-2 evaluates the samples based on general knowledge of the English language and plausibility as English sentences (as it is tested zero-shot) while the LSTM evaluates it with emphasis on the domain knowledge (which is the sole train data).Having both perplexities thus evaluates differently, and we perform well on both.
To evaluate the quality of the latent space for downstream tasks, we extract the latent representations in a shortened Yelp dataset following (Shen et al., 2017), along with the labels for a small fraction of the dataset.These labels reflect the nature of the shortened review (positive or negative).We then apply a simple shallow neural classifier to get the labels (review valence).Note that the task of representation learning always has access to the same number of unlabeled sentences, only the number of labels is varied.Our method proves superior especially as more labels become available with all results in Table 3.For human evaluation, we gathered five individuals (graduate students or machine learning engineers) who were provided 200 choice-based questions and asked to pick the most coherently generated choice among permuted options.Each choice corresponded to a sample from a method on a corpus.We compare to the BN-VAE and vanilla VAE and find a significant advantage in terms of being chosen across both LSTM and transformer architectures, as shown in Tables 4 and  5.A follow-up with GPT-4 appears in the appendix.Overall, across a broad variety of tasks, we improve on the BN-VAE architecture.6 Qualitative and Quantitative Analysis of Interpolations Autoencoders allow smooth linear interpolation between two sentences in the latent space in a manner that should allow both syntactic (sentence structure) and semantic meaning.This is captured using the subset of the short Yelp review dataset, which consists of small single sentence reviews of establishments with either 0 or 1 for a negative or positive review, and is used in the classification task in Table 3.We perform the following sanity checks: • That interpolating between a positive and a negative review yields a neutral review.
• That interpolating between two reviews of the same nature (positive-positive or negativenegative) always yields reviews of the same nature, but of differing content or sentence structure, reflecting the source sentences.
• That these interpolations have numerical scores (from the classifier of Table 3) that match the decoded content.
Results demonstrating these qualitative characteristics are summarized in Table 6.Moving between reviews of different kinds changes the score as expected to one which is ambivalent i.e. around 0.5, which reflects in the text as well such as in the example on row 3 of the positive and negative interpolation.Between reviews of the same nature (clustered around 0 or 1) interpolation causes changes in sentence structure and content -in the case of two negative reviews, the interpolation closer to the sentence "are you kidding me?" begins with "are you", and the interpolation involving two positive reviews also associates common sentence structures.

Conclusions and Future Work
Attention based transformer models (Vaswani et al., 2017) have achieved great success in all areas of NLP (Devlin et al., 2018;Radford et al., 2018Radford et al., , 2019) ) .Transformer models retain an autoencoderlike parallel with encoder and decoder modules.Though they are dissimilar to VAEs in the sense that often, decoder-only models are used for text generation, whereas encoder-only models are used for representation learning, using both simultaneously creates a VAE-like architecture.This analogy has been used to train massive VAEs for NLP e.g.OPTIMUS (Li et al., 2020) that employ pretrained encoders.We view our work as an indication that deterministic autoencoding, unlike traditional VAEs, can design better autoencoders for text.Issues of VAE training, namely posterior collapse, worsen with increasing model power (Yang et al., 2017;Semeniuta et al., 2017;Razavi et al., 2019).Powerful VAE design requires tackling this and deterministic models may offer the solution.We also consider our work to be of significance to the field in its successful usage of contrastive learning for NLP, which for text often suffers from less clear augmentations (Rethmeier and Augenstein, 2021) relative to images.We focus on BatchNorm, however, in natural language tasks, increasing emphasis is being laid on layer normalization aka Lay-erNorm (Ba et al., 2016;Xu et al., 2019) which forms a key part of Transformers.We discuss the Layer norm case further in the appendix.

Limitations
Our methodology focuses on autoencoders which may include transformer architectures, however, by necessity, autoencoders involve an encoder-decoder pairing which may be absent in some architectures which may be encoder-only (for word embeddings) or decoder-only (for language generation).In these cases, our approach is not scalable and will require some rethinking.Further, text generation is a field in a state of flux.Although we have tested with large models in the form of GPT-2, it is possible our results do not scale to as-of-yet unreleased but extremely potent models, such as GPT-4.

Ethics Statement
Text generation may include harmful content, undesirable generation and algorithmic bias.We do not, however, view our work as being particularly prone to these failure modes any more than other papers in this domain, and we believe no particularly strong ethical statement is warranted.

Appendix : contents
In order, we go over : • The training process and hyperparametersand how to choose them

A.1 Training Flow
From the discussion in section 3.1, it is clear that we will achieve our goal of an aggregate Gaussiandistributed, deterministic µ i by either driving up its entropy or requiring high mutual information.
Let the σ i = c.The overall maximum mutual information equals : By applying the Gaussian entropy formula which states that for variance V , the entropy of a Gaussian random variable is1 2 log(2πeV ), we obtain that the mutual information per latent dimension equals The upper bound is reached, and the Gaussian distribution formed only when c is high, and the expression is bounded above by a low value which is reached to lower the reconstruction error.We set for epoch t : c t = c 0 /t, t ≤ t max , c t = 0∀t > t maxt max is the point where deterministic training begins, i.e. c = 0. Depending on the dataset, the value of c 0 is can be chosen with justifications and our process of choosing it appears in appendix A.4.As c goes to zero, the final epochs are trained deterministically, with the entropic loss L ent taking over -since the need to raise mutual information arises from the high values of c t , we utilize an epoch dependent regularization on H(µ i ) minimizing the contrastive loss, unifying notation from X i to x i with λ t = (1 − 1/t)λ 0 , and µ i , µ + i being respectively generated by E θ , E θ ′ , where θ ′ = mθ ′ + (1 − m)θ is a time averaged version of the main model E θ .m ≈ 1 and we use m = 0.999 to make θ ′ update slowly in keeping with standard implementations (He et al., 2020).The overall training loss is formed by adding L ent to the reconstruction loss Due to L ent in the second and c in the first stage of training, there is no need to fit a post hoc distribution (Ghosh et al., 2019) and the resultant posterior is a Gaussian.Two-stage training is necessary as the contrastive loss only begins to approximate the entropy under assumptions reached later in the training process.

A.2 Architectural Details
For the LSTM-based VAE for Yahoo and Yelp, architectural details were tuned to match the BN-VAE, with latent dimension of 512, and single layer LSTMs of size 1024 for both decoder and encoder, with corresponding feeding patterns of the latent code following exactly the BN-VAE implementation.Minibatches of size 32 with SGD and gradient clipping were used for training, along with annealing on either the KL or the entropy loss following original BN-VAE implementation 1 .
For training critic models in the forward and reverse perplexity tasks, a simple recurrent LSTM with two hidden layers of size 200 was used and trained also by SGD.While the LSTM was trained on the text corpus before evaluation on the sample for forward perplexity, GPT-2 was tested zeroshot.Due to the structure of GPT-2, the sample is also evaluated for coherence over all generated sentences, even though each sentence is generated individually.However, we expect differences between our models to respect only the individual quality of generated sentences, as all of them have this independent generation framework.For the Penn Tree Bank models using a transformer model based on Transformer-XL, we built on source code 2 released from the author's website.The decoderonly model was modified (keeping hyperparameters and latent sizes) to an encoder-decoder model.Architectural details were kept as-is, with 6 layers for both encoding and decoding and using a sentence level representation based on the start of sentence token.

A.3 Randomness and Seed Dependence, Hardware, Number of Runs
We do not consider our method to depend on randomness i.e. the seed meaningfully.For the BN-VAE architectures we ran with both the preprovided fixed seed 783435 for the BN-VAE repository and also without setting any seed averaged over 10 runs.The best result among the two was reported (except the reconstruction loss for which we directly use the previous figures reported -we could not actually replicate those numbers in re-running them, however, we got within 0.1 so we consider them correct in the original BN-VAE paper).For our methods, no fixed seed was set and the average over 10 runs was reported directly.Results were reported with a Tesla V100 32 GB, using Pytorch 1.6, on Ubuntu 18.04.

A.4 Hyperparameter Optimization and Notes on Transformer Training
In general, we keep all hyperparameter and architectural details in line with previous implementations (linked in footnotes).As reported in the paper, the number of cached minibatches is r = 3, and every K = 5 minibatches we use the global statistics of the batchnorm.Further we try m = 0.99, 0.999 for the momentum update step and find better results with 0.99 which leads to the figures reported in the paper, however m = 0.999 also outperforms BN-VAE.The SGD is trained with a gradient clip of 5.0 and initial learning rate of 0.5 (BN-VAE) and 0.1 (ours) decayed by a factor of 2 (i.e.multiplied with 0.5) at most 5 times with a decay criterion based on non-improvement on validation for 5 epochs (same as BN-VAE).However, in our experiments, we find that the gradient clip is the only sensitive hyperparameter for both architectures.
The BN-VAE parameters that fix batch norm statistics (γ) are set according to the original implementation's best performances, at 0.6, 0.7, while for our case the results reported in the paper correspond to setting the contrastive entropic reg-2 http://zihangdai.github.io/misc/ptb.zipularizer's weight to 6 × 10 −3 .This hyperparameter was based on trying all combinations of {2, 3, 6, 7} × 10 {−3,−4} , i.e. among 8 choices.All entropic hyperparameters with 10 −3 order obtained comparable results to the ones reported in the paper, and with 10 −4 still outperformed BN-VAE.In general, all architectural details, parameters, hyperparameters that do not differ between BN-VAE and our method follow exactly for LSTMs.For transformers, the same KL weight, r, K, m was kept as for LSTMs.However the other parameters were shifted to match the implementation of a small scale transformer for Penn Tree Bank as in Transformer-XL with the exception of changing the decoder-only model to a model using both transformer encoder and decoder layers to more closely match the autoencoder framework.
Choice of c 0 and associated parameters : We recall from the main text that at epoch zero the initial channel capacity (maximum mutual information) of the autoencoder with P latent units is : For perfect decoding of even the train set consisting of M sentences with lengths n i , . . ., n M , we have to consider all valid targets.For a sentence, this is any contiguous subsentence.For a length n i , it equals n i (n i −1)

2
. So we compute the total valid subsegments of the train set as : By the channel capacity theorem, for perfect reconstruction we would need (1 + 1/c 2 0 ) P/2 ≥ S However, this assumes a perfect function approximator on part of the decoder and encoder.As such, we choose c 0 to support (1 + 1/c 2 0 ) P/2 = 5S.

A.5 OPTIMUS training
We followed the training procedure from the original paper (Li et al., 2020), taking the lowest λ model unless there was a tie in the perplexity values and a better reconstruction was available.Note that we also examined experimental results using different λ values.The following figures should be taken in context with respect to Table 2.For Yahoo, both λ = 0.5, 1 do obtain better reconstruction accuracy (275.9, 270.8) respectively for OPTIMUS.However, this comes at the cost of significantly worse metrics for G2-F, L-F, and L-R (written as triplets) : (45.2, 84.7, 97.5) for λ = 0.5 (i.e.statistically comparable to BN-VAE and significantly worse than CEAE) and (47.6, 88.2, 100.2) at λ = 1 which is worse than the other two models.For Yelp, only λ = 1 is superior with a reconstruction of 325.8.However, this comes with a G2-F, L-F, L-R triplet of (49.2, 63.1, 84.8) which is again significantly worse than our CEAE result.
For adapting BN-VAE and CEAE, we note that the original training process used a KL thresholding.This was kept as-is.To create smoother training, we train GPT-2 for 1 epoch as a fine-tune step as per OPTIMUS before beginning the main training loop for all of our models.

B Relation of MoCo to Entropy Approximation
Our analysis here follows previous theoretical analyses of contrastive learning (Wang and Isola, 2020).We also repeatedly use the property that high dimensional isotropic Gaussians are clustered around a scaled hypersphere, that is, their L 2 norm is tightly concentrated.For an exposition on these matters, we refer the reader to (Vershynin, 2018).We use the properties of isotropic Gaussians in high dimensions without further reference.We recall that we use a MoCo loss of the form (with the understanding that µ i arises from X i through an encoder E θ and correspondingly µ + i is from an encoder of parameters θ ′ : Generally, this loss includes the positive pair in the denominator, i.e. Note that in the general case there is also a temperature hyperparameter τ which scales the inner products, i.e. ⟨µ i , µ + i ⟩ becomes ⟨µ i ,µ + i ⟩ τ and so on.We ignore this for the sake of exposition and set τ = 1, our methods will carry over to that case as well.Recall that µ i , µ + i for us arise from two distinct encoders respectively both of which receive X i or x i (abusing notation), one of which has parameters θ, and the other has parameters θ ′ = mθ ′ + (1 − m)θ, updated with m ≈ 1.When near convergence, we may assume that µ i ≈ µ + i , since θ ′ ≈ θ.Let us now assume for the moment that ∥µ i ∥ = c, a constant.The above sum for the second case then becomes : X j ∈X exp(⟨µ i , µ j ⟩) Since c is a constant, the log can be taken out, and X i ∈X is just an expectation over the dataset.We are left with minimizing E log The term within the log, in expectation, is a kernel function (a valid kernel for probability density estimation and thus an estimator of the density).Specifically, it is the (unscaled) vMF (Von Mises Fisher) kernel (Banerjee et al., 2005), which uses the cosine distance on the hypersphere.We then have that Where C(d, |X |) is a scaling function for the kernel density dependent on the number of elements and the dimension, and P (µ i ) is the kernel estimate of the probability.The expectation of the log then asymptotically converges to the entropy.Now let us examine the cases of : • µ i is not of constant norm • The sum does not include the positive term • The convergence is not asymptotic Varying norm : In this case, we turn X j ∈X exp(⟨µ i , µ j ⟩) Which can be turned into minimizing : But by hypothesis, all µ i are batch-normed with fixed statistics (0 mean 1 variance).We conclude that the new term introduced is a constant, and cannot influence optimization.
More pressingly, we are no longer on the hypersphere, and cannot use the vMF kernel without checking it works.However, note that the vMF kernel, i.e.

exp(⟨µ
where α denotes proportionality, and this proportionality holds when ∥µ i ∥ = ∥µ j ∥ and constant.We recognize the right hand side as the Gaussian kernel (Keerthi and Lin, 2003), which is always applicable.It simply remains to ask if the correction factor exp(− 1 2 ∥µ i ∥ 2 − 1 2 ∥µ j ∥ 2 ) between the two kernels is strongly concentrated.By hypothesis, we are near convergence, i.e. µ i is approximately distributed in a Gaussian fashion.We know that high dimensional Gaussians are closely approximated by the uniform distribution on the hypersphere -that is, ∥µ i ∥ 2 is strongly concentrated around a constant (1) -and thus, since the function in question is Lipschitz over the domain, exp(− 1 2 ∥µ i ∥ 2 − 1 2 ∥µ j ∥ 2 ) also concentrates in measure.
Non-inclusion of the positive term : If we instead have : Then this is a leave-one-out evaluator of the kernel (Barnard, 2010).It is well known that in this case, the sum (upto some scaling, and with a different bandwidth than the original sum) again approximates the entropy, but with an error equal to (in expectation) the generalization error.Hence, these two analyses do not differ asymptotically.
Non asymptotic convergence : For this case, we bring back the temperature term and assume it to be set "correctly".Asymptotic results guarantee convergence for all temperatures, but in the nonasymptotic domain, this case is only analyzable in the low temperature limit, because under the non-low temperature scenario we have : However, we recognize that the sum (after E log) is of a random variable which is the exponential of a Gaussian when µ i , µ j are isotropic Gaussians.This is because ⟨µ i , µ j ⟩ is the scaled projection of all µ j on a fixed random vector µ i .This is a 1-dimensional gaussian, making the sum a sum of log normal random variables.While this can be approximately expressed via methods such as the Fenton-Wilkinson moment matching method, this is far less clean than the case of low τ .
Instead, consider τ << 1.We have that : is approximately equal to : arg max j ⟨µ i , µ j ⟩ τ We recall the definition of the Kozachenko Leonenko estimator : for a sample consisting of X 1 , X 2 , . . ., X N +1 ∈ R d drawn from an unknown distribution P , assuming N > 1, With R i = min j̸ =i ||X i − X j || 2 , Y i = N (R i ) d , B d the volume of the unit ball in R d , γ the Eulermascheroni constant ≈ 0.577, an estimate of the entropy of the distribution is: We do not need to calculate B d and γ as they are constants per instance.Rather, we observe that : Changing notation from X i to µ i for the purpose of unifying our derivations, note that R i is attained at the lowest value of ||µ i − µ j || 2 i.e. at the highest value of ⟨µ i , µ j ⟩ if ||µ i ||, ||µ j || are constants.
However, since we do not care about scaling or constant shifts (recall that optimizing a function f is equivalent to working with λf + c for any λ > 0 and any constant c), we can express log R i as : Take the 2 in common, and note that this comes out of the log as log(ab) = log a + log b.But we do not care about such constants as optimizing f is equal to working with a linear transform of f upto the learning rate, leaving us with arg min j 1 2 log(1 − ⟨µ i , µ j ⟩) Finally, note that µ i , µ j are high dimensional isotropic Gaussians of dimension d.Thus, ⟨µ i , µ j ⟩ is a zero mean univariate Gaussian of variance 1 d , i.e. with high probability, we have that : Allowing us to replace the above term with (applying the identity that for x << 1, log(1+x) ≈ x) : arg min j 1 2 (−⟨µ i , µ j ⟩) Swapping argmin with argmax and positive with minus we finally get : Which is (upto scaling by 2 τ ) the desired nonasymptotic approximation.

C Extra Results -Penn Tree bank
Here, we add results on the penn tree bank dataset in table 11.Only a subset of all models that do not use fixes specific to the LSTM architecture were considered, since we had to adapt the tricks to transformers.Our base model is Transformer-XL.

D GPT-4 validation
We asked GPT-4 to validate the choices generated by our models, using the following prompt : "You are a human asked to choose between more realistic sentences.Among the following sentences, which is the most consistent and high quality semantically, grammatically, and linguistically ?".This yielded the following results.Note that GPT-4 actually yielded more strong results towards our model than the human evaluators.

E BERT validation
We randomly selected 1000 sentences from Yelp and Yahoo corpora.This yields a total of 1000 × 999 × 1 2 pairs of sentences, each of which yields an inner product similarity.The corresponding similarities were computed from BERT (Devlin et al., 2018).We can then compare the quality of embeddings from BN-VAE, VAE, CEAE with respect to BERT over three choices of correlation metrics : Pearson (linear), Spearman (rank-based) and Kendall (pairwise).The results are as follows.

F The Layer norm case
Under layer normalization (Ba et al., 2016), the normalization is not done over a minibatch but over a layer.A consequence of this is the fact that the resulting latents form a hyperspherical space, i.e. ∥z∥ = c where c is a constant.Now, we know that : arg max H(z), ∥z∥ = c is the uniform distribution over the hyperspherical shell of radius c.It is also well known that in high dimensions, the multivariate isotropic Gaussian has almost all of its support (Vershynin, 2018)

Table 1 :
Language modeling.Reconstruction log loss (Rec), Forward perplexity with GPT-2 (GPT2-F), Forward perplexity with LSTM (L-F), Reverse perplexity with LSTM (L-R) in that order on Yahoo, Yelp.Reverse perplexity with GPT-2 was not meaningful as the fine-tuning is without effect due to pre-existing large corpus in the pretraining phase.Statistical analysis -standard deviations and confidence intervals etc. -appears in appendix.* indicates statistically insignificant best method.

Table 2 :
Performance of transformer autoencoders on Yahoo and Yelp, evaluated using the same metrics.

Table 3 :
Accuracy on Yelp -downstream task performance, with a small MLP trained on labeled samples of a fixed number.* indicates statistical insignificance.

Table 4 :
Frequency of choice among generated samples among LSTM models.

Table 5 :
Frequency of choice among generated samples among transformer models.

Table 6 :
Interpolation between different reviews using LSTM models, with weights (0.8, 0.2) and (0.2, 0.8) respectively for the first/second interpolations relative to source sentences.There is change of estimated review score calculated by the classifier used to compute the accuracy in Table3matching the text.Sentences tagged with "Source" are original sentences retrieved as-is from the corpus.

Table 7 :
Frequency of choice among generated samples among LSTM models by

Table 8 :
Frequency of choice among generated samples among transformer models by GPT-4.