Towards Robust and Semantically Organised Latent Representations for Unsupervised Text Style Transfer

Recent studies show that auto-encoder based approaches successfully perform language generation, smooth sentence interpolation, and style transfer over unseen attributes using unlabelled datasets in a zero-shot manner. The latent space geometry of such models is organised well enough to perform on datasets where the style is “coarse-grained” i.e. a small fraction of words alone in a sentence are enough to determine the overall style label. A recent study uses a discrete token-based perturbation approach to map “similar” sentences (“similar” defined by low Levenshtein distance/ high word overlap) close by in latent space. This definition of “similarity” does not look into the underlying nuances of the constituent words while mapping latent space neighbourhoods and therefore fails to recognise sentences with different style-based semantics while mapping latent neighbourhoods. We introduce EPAAEs (Embedding Perturbed Adversarial AutoEncoders) which completes this perturbation model, by adding a finely adjustable noise component on the continuous embeddings space. We empirically show that this (a) produces a better organised latent space that clusters stylistically similar sentences together, (b) performs best on a diverse set of text style transfer tasks than its counterparts, and (c) is capable of fine-grained control of Style Transfer strength. We also extend the text style transfer tasks to NLI datasets and show that these more complex definitions of style are learned best by EPAAE. To the best of our knowledge, extending style transfer to NLI tasks has not been explored before.


Introduction
The Text Style transfer (TST) task is a form of controlled language generation. The goal is to produce fluent style-altered sentences from a given base sentence, while also preserving its style-independent content. The definition of "style" depends on the class labels of the end task. Our work focuses on the unsupervised scenario i.e. performing the training on completely unlabelled corpora. By inducing the latent space organization through input perturbation, TST can be performed using a simple vector arithmetic method (discussed in Section 6).
Background. Several well-known architectures use auxiliary objectives that serve as regularizers to ensure that the latent space geometry is smoothly interpolatable and learns high-level semantic features (refer Appendix D). Inspired by successes in denoising approaches in vision (Creswell and Bharath, 2019;Vincent et al., 2008), we look at input perturbation based approaches for unsupervised style transfer. Unlike text, in vision, freedom exists to finely control the degree of Gaussian "blur" on the continuous input image space. We conjecture that this notion of "controlling blur" may be useful in the text domain as well, serving as our motivation for our chosen model for Input Perturbation. Shen et al. (2020) map "similar" (defined by low Levenshtein distance/ high word overlap) sentences together in the latent space by introducing a simple denoising objective over an underlying Adversarial Autoencoder (AAE) (Makhzani et al., 2015). This noise model includes simple token manipulation (token dropout and substitution) with some probability p, to reconstruct the original sentences from the perturbed inputs. We can reason intuitively that discrete word dropout allows sentences with high word overlap (or low Levenshtein distance) to have a higher chance of being perturbed into one another, thereby mapping them close in the latent space during the training time. However, this method allows for stylistically dissimilar sentences, albeit with high word overlap, to be mapped together in the latent space. Idea. We argue that this negatively impacts the quality of final latent geometry and the results of the subsequent style transfer task. As an alternative, we explore Embedding Perturbation, where a noise vector is sampled, appropriately scaled and added to the embeddings of each input sentence, such that the resultant embeddings are constrained to live inside an E-dimensional hypersphere. The radius of this hypersphere is controllable using a tune-able hyper-parameter ζ. This hypersphere constraint is partially inspired by concepts in adversarial robustness and ensures that each resultant noised word embedding is not altered to the extent of causing the underlying semantics of the sentence to change. We argue that this allows sentences with stylistically similar constituent embeddings to be mapped together, also encouraging the formation of stylepreserving latent neighbourhoods (more on this in Section 4). The resulting latent representation from "embedding-perturbed" autoencoders consequently are better semantically organized and stylistically robust, enabling us to perform TST using an inductive method i.e. simple vector arithmetic on latent vectors. Contributions. We show that this extended model of input perturbation with both discrete and continuous components, allows for overall better quality text style transfer, particularly in its ability to preserve style-independent "content" information. To expand the traditional definitions of styles such as Polarity, Formality, which are based on simple "intra-sentence" attributes, we introduce the "Discourse Style Transfer Task" by salvaging NLI datasets in which the flow of logic between sentences are captured using "Entailment", "Contradiction", and "Neutral" labels. This enables interesting applications such as discourse manipulation in which a pair of sentences agreeing with each other can be made to contradict, and vice versa. We also test our model on fine-grained styles present in the Style-PTB dataset. We empirically show that our model performs the best on a diverse set of datasets with styles ranging from coarse-grained styles (like sentiment) to fine-grained styles (like tenses) and complex inter-sentence styles (like discourse or flow of logic).

Related work
Seminal work in TST. Autoencoder based approaches for TST on labelled non-parallel datasets are quite well explored (Shen et al., 2017;Fu et al., 2018). Some techniques involve implicit Style-Content disentanglement of the latent space using Back Translation (Prabhumoye et al., 2018) and adversarial learning (John et al., 2019). Li et al. (2018) achieve disentanglement using simple keyword replacement. Most studies look at simple nonparallel classification datasets, defining their style to be the class label. Studies also look at Syntax-Semantic disentanglement of the latent space (Chen et al., 2019;Bao et al., 2019). A λ 1 penalty is imposed on the log variance of the perturbations to prevent it from vanishing. The latent vacancy problem (Xu et al., 2020) of the β−VAE is mitigated by introducing auxiliary losses and provided for one of the earliest methods for unsupervised TST. Similar to our work, Rubenstein et al. (2018) introduces the Latent noised AAE (LAAE), Gaussian perturbation is instead added to latent encodings to promote organization. Unsupervised work includes using a language model as a discriminator for a richer feedback mechanism (Yang et al., 2018), allowing it to increase performance in word substitution decipherment, sentiment modification, and related language translation. More seminal work related to autoencoders in the context of Style Transfer is mentioned in Appendix D. Contemporary work in TST. More recent work, treat the style transfer problem as paraphrase generation and fine-tune pre-trained language models (Krishna et al., 2020). Malmi et al. (2020) trains masked language models or MLMs on the source and target domains to identify input tokens to be removed and replace them with tokens from the target MLM in an unsupervised manner. Liu et al.
(2020) uses gradient-based update rules in the continuous latent space z from style and content predictor networks, enabling the transfer of fine-grained styles without using an adversarial approach. Reid and Zhong (2021) performs TST on sentiment and politeness datasets using token editing methods (similar to Li et al. (2018)) using Levenshtein editing operations. Lee et al. (2021) also focuses on enhancing content preservation by introducing a method to remove style at the token level using reverse attention and fuse this content representation with style using a conditional layer normalization technique. Riley et al. (2021) adapts the T5 model (Raffel et al., 2020) for few shot text style transfer by extracting a style representation and performing conditioned decoding, using only a handful of examples at inference time.

Method
The underlying language model is a generative auto-encoder which models an input distribution p(x) assumed to be from an underlying latent distribution p(z). A deterministic encoder E representing the distribution q(z|x), in the form of an RNN, whose output is reparameterized by another distribution z i ∼ N (µ i (x), σ i (x)) to give the aggregated posterior distribution q(z). Various auxiliary loss functions are used to enforce the learned prior q(z) to match p(z). The Generator G representing p(x|z), also in the form of an RNN, decodes back the sample from the learnt prior q(z) into its corresponding input from p(x). Gradient descent is applied on the reconstruction loss of the autoencoder given by: We use the AAE (Makhzani et al., 2015) as our choice for the underlying generative autoencoder over which the input perturbation techniques were applied. AAE uses a discriminator D to enforce q(z) to match p(z), a standard Gaussian, by learning to distinguish between samples from the two different distributions. This adversarial loss serves as a regularizer for the latent space, giving it the ability the organize itself better for smooth sentence interpolation.
The final min-max objective is a sum of the reconstruction loss (given below) and λ weighted adversarial loss: We found empirically that AAEs performed well and were stable during training. On the other hand, β-VAEs required careful tuning of the β hyperparameter to prevent posterior collapse and did not perform as well.
3.1 Finely-controlled continuous noise on embedding space Figure 2: An abstract representation of the continuous embedding noise approach for two arbitrary embeddings e i and e i+1 on an embedding space E ∈ R 2 To further organize the latent geometry of the underlying AAE to encode style-based semantic similarity among sentences, we propose a perturbationbased approach on the continuous embedding space. Word embeddings of dimensionality E, of each input token x i are denoted as e i . Consider an input sentence of length l containing tokens x 0 , · · · , x l−1 having embeddings e 0 , · · · , e l−1 respectively. Our objective is to blur every embedding vector e i by adding some appropriately scaled noise vector n i to produce a resultant noised embedding vector e i , such that e i does not lie too far away from e i to not change the underlying semantics of the word completely. We do this by ensuring that each new e i lives inside an E-dimensional hypersphere. The center of the hypersphere is the original embedding e i and its radius is defined as |e i | * s i , where s i ∈ R 1 is a random variable sampled from a distribution P (s). P (s) is probability distribution of the form Y(µ = 0, σ) where Y is some arbitrarily chosen distribution and σ as function of hyperparameter ζ ∈ R. This distribution P (s) parameterized by Y, ζ) models the probability density cloud inside the embedding hypersphere, controlling its radius and interior densities. Figure  1 neatly summarizes the aforementioned embedding perturbation mechanism for a simple example with only two individual word embeddings. In practice, we use a vectorized representation of the above mechanism to blur the embeddings of a minibatch of sentences of size L in constant O(1) time. The embedding perturbation method is summarized below in a vectorized form: where z,ẑ, s, n and e are vectorized representations of z i ,ẑ i , s i , n i and e i respectively for a mini-batch of sentences of size l. and * denote element wise and scalar multiplication respectively. |x| denotes the magnitude of a vector/batch of vectors x. r is the vectorized mini-batch representation of r i , the expected radius of hypersphere i . We choose the Gaussian distribution as our choice of the probability distribution Y, as on testing it produced the best results † . To couple the radius r i of hypersphere i to our hyperparameter ζ to enable fine-grained control using ζ, we constrain the probability density of Y to live inside the hypersphere within three standard deviations. To achieve this, we equate 3σ to ζ, and consequently set the variance σ 2 of Y to be (ζ/3) 2 .

Discrete word dropout probability
The "Denoising Autoencoder" or DAAE (Shen et al., 2020) considered discrete token-level perturbations such as token masking, substitution and deletion. We consider token deletion as discrete noise. Drop probability p = 0.3 is found to be the best in both their experiments and ours.
Token deletion is the only type of input perturbation that can alter the number of tokens in a sentence. Any continuous model for input noise cannot mathematically generalize the effects of this kind of discrete word dropout. Furthermore, during our experiments, we find that for some datasets, both discrete and continuous noise components are required to produce the overall best model. In such a case, we first perform token deletion and then subject the leftover token embeddings to perturbation. We refer to this generalized noise model as † We also tried the uniform distribution with varying values for ζ "Embeddings-Perturbed Adversarial Autoencoder" or "EPAAE" parameterized by ζ and p.

Semantic Similarity in Latent Space Neighbourhoods
We contrast and compare the resultant latent space neighbourhoods of DAAEs and EPAAEs.

Preliminary reasoning
Here, we first investigate the question -Does token deletion during training group truly put semantically similar sentences together in latent space Z?.
Intuitively we can reason that the answer may be in the affirmative if the drop probability is small. For example, the sentences "The food was good" and "The food was superb" might get perturbed into a common version, i.e. "The food was" and therefore be mapped nearby in Z during training. As analysed and concluded in (Shen et al., 2020), latent neighbourhoods in the latent space of DAAEs successfully cluster sentences with high word overlap (low Levenshtein/Edit distance) together. However, the Levenshtein distance metric is not an accurate measure of the true semantic similarity between sentences. A pair of sentences with high word overlap might convey different ideas w.r.t the underlying style-based semantics of the dataset. For example, in the context of the Yelp dataset, the sentences "The food was good" and "The food was bad" are stylistically opposing (style being polarity in this case) and yet still get mapped close by in Z for DAAEs.

Testing the hypothesis with a Toy Dataset
We conduct specifically curated experiments on a synthetic dataset to verify our hypothesis that EPAAEs map stylistically similar sentences together. Details of Toy dataset: Inspired by Yelp, each sentence in this dataset either represents a positive or negative sentiment. It also contains different sentiment independent components, such as the identity of the person and the subject of the review. Each sentence is of the format: "The <identity_token> said the <subject_token> is <de-cision_token>" where identity, subject and decision token classes are the only variable parameters in each sentence. The entire set of all permutations of these token classes forms the dataset. For example, the decision class is further subdivided into two subclasses, i.e. positive/ negative sentiment,    Table 1. We produce output labels for each sentence by using a 3-bit representation, one bit for each of three token classes, where the values of each bit represent the token label subclass within that class. For example, a label of 5 is encoded as 101 corresponding to a sentence with Female, Food and Negative labels. There are 2 3 = 8 labels in total labelled 0-7. Qual. Analysis of Latent space. We consider two models, DAAE with token deletion probability p = 0.3 and EPAAE with ζ = 2.5. Both models are initially pre-trained over the unlabelled Yelp dataset. The synthetic dataset is used during inference time only. We encode all 1950 sentences into their respective latent space vectors and use TSNE plots to visualise the latent space of each model ( Figure 1). We pick a random query sentence, e.g.  "The man said the pasta is spicy", and encode it into Z. We then retrieve the top five nearest encoding to the query and observe their decoded outputs to check for the preservation of style-based semantics (Table 2) around the query. We find that the EPAAE maps the latent neighbourhood such that stylistically similar (positive/negative sentiment in this context) are grouped. This is not the case with the DAAE, evident from Table 2 in which neighbours 1 and 3 have a differing sentiment from the query. This offers an explanation as to why the DAAE is not able to produce tightly confined clusters.
Quant. Analysis of Latent space. To further validate our hypothesis that EPAAEs better preserve style-based semantics in latent neighbours, we generate useful quantitative metrics over a knearest neighbours experiment, done on the test split across all datasets. We document these metrics in Table 3. Column 3 conveys the mean distance to the closest neighbour with the opposite label. Column 4 conveys the mean number of hops to reach to the closest neighbour with opposite labels. In all but a few datasets, the DAAE reports a smaller "mean hops for label flip", supporting our hypothesis that stylistically dissimilar sentences are mapped closer together in the latent neighbour than our proposed model. We also see this trend mostly holds true for the "Mean L2 Norm for Label Flip" metric in Column 4 as well, providing further evidence to our hypothesis.

Experimental Setup
Baselines. This work focuses on simple denoising approaches for Unsupervised TST and subsequently constrains our choice of baselines to follow these criteria. We consider three other autoencoder based models for our experimentation, i.e. Denoising AAE (DAAE), Latent-Noised AAE (LAAE) and the β-VAE.
Hyperparameters and Setup. Details on hyperparameter selection can be found in A.2. Other common hyperparameters (detailed in A.1) related to encoder/decoder architectures remained identical across all models. Training is completely unsupervised, and labels are only used during inference time. Details on computation time, number of parameters and infrastructure used can be found in Appendix. E.

Datasets
In this section, we briefly describe the different kinds of datasets used for experimentation. Further details is provided in in Appendix C.
Complexity of Styles in Datasets: Current studies mainly focus on high-level styles to validate the approaches. To validate our hypothesis that the EPAAE can perform fine-grained style transfer due to semantics learnt from embeddings, we consider three tasks: sentiment, discourse and fine-grained text style transfer. As prepossessing, we remove non-essential special characters and lowercase all sentences. Except for the Yelp dataset, no pruning is done based on sentence length. Sentiment Style Datasets: We use the preprocessed version from (Shen et al., 2017) of the Yelp dataset. The sentiment labels (positive, negative) were considered as style.
Discourse Style Datasets: To check for the model's ability to alter the discourse or flow of logic between two sentences we make use of NLI datasets such as SNLI, DNLI, and Scitail. Each instance in the SNLI dataset (Bowman et al., 2015) consists of two sentences that either contradict, entail (agree) or are neutral towards each other. Similarly, the DNLI dataset (Welleck et al., 2019) consists of contradiction, entailment and neutrality labelled instances. Scitail (Khot et al., 2018) is an entailment dataset created from multiple-choice science exams and the web, in a two-sentence format similar to SNLI and DNLI. The first sentence is formed from a question and answer pair from science exams and the second sentence is either a supporting (entailment) or non-supporting (neutrality) premise.
Fine-grained Style Datasets: The Style-PTB dataset (Lyu et al., 2021) consists of 21 styles/labels with themes ranging from syntax, lexical, semantic and thematic transfers as well as compositional transfers which consist of transferring more than one of the aforementioned fine-grained styles. To check whether the EPAAE can capture fine-grained styles better by leveraging its better organised latent space, we make use of three styles i.e. Tenses, Voices (Active or Passive) and Syntactic PP tag removal (PPR). In the Tenses dataset, each sentence is labelled with "Present", "Past", and "Future". The Voices dataset contains "Active" and "Passive" voices labels and the PPR dataset contains "PP removed" and "PP not removed" labels.

Automatic Evaluation metrics
Evaluation for text style transfer includes checking for a) Style Transfer accuracy, b) Content preservation metrics and c) Fluency of output sentences. Recent studies show that automatic Evaluation metrics are still an open problem and can be gamed (Xu et al., 2020).
Style Transfer Measure: A pre-trained classifier is used to check the presence of the target label in the output sentence. (Mir et al., 2019) introduces the notion of checking the style transfer intensity apart from just the presence of the target label. While we find this notion intriguing, we wish to first accomplish the style transfer task convincingly for the current set of tasks before assuming a more complex metric.
Content Preservation: In recent work, we observe that models typically struggle more in content preservation and the ability to preserve the contextual meaning of the base sentence. The BLEU score alone does not suffice to correlate strongly with actual qualitative results. To truly validate our hypothesis that EPAAE's are better able to preserve content better, we augment the bucket of standard content preservation metrics ‡ . Apart from the standard of using BLEU between sentences of source and target styles, we borrow evaluation techniques from fields similar to Text Style Transfer such as Machine Translation and Text Summarization, such as METEOR (Banerjee and Lavie, 2005), ROUGE-L (Lin, 2004), CIDEr (Vedantam et al., 2015) which have been shown to correlate more strongly with human judgement. Following the study of (Sharma et al., 2017), in which they show BLEU does not necessarily correlate with human evaluations in dialogue response generation, we also adopt Embedding Average, Vector Extrema (Forgues et al., 2014) and Greedy matching score (Rus and Lintean, 2012).
Fluency of generations: Past work measures the perplexity using a pre-trained language model to gauge the fluency or grammatical correctness of the style transferred outputs. (Mir et al., 2019) argues such perplexity calculations for style transfer tasks may not necessarily correlate with a human judgement of fluency. Adversarial classifiers in the form of logistic regression networks are trained with the goal of distinguishing between humanproduced and machine produced sentences. These classifiers are then used to score the naturalness of the output sentences. We follow this metric during our evaluation of fluency or naturalness §

Experiments
In this section, we look at the quantitative and quantitative results for the text style transfer task for four autoencoder models in seven datasets. We use the vector arithmetic method on latent space representations, inspired by (Mikolov et al., 2013) where it showed that word embeddings learnt can capture linguistic relationships using simple vector arithmetic. Analogous to the standard example where "King" -"Man" + "Woman" ≈ "Queen", we manipulate an arbitrary sentence encoding z x of Style X to Style Y ¶ : where z i y , z i y denotes the latent vector of the i th sentence in style y and x respectively and N x and N y represent the number of encoding present in the corpus for style x and y respectively. k is a scaling parameter used to control the style transfer strength. § We use code and pretrained models from https://github.com/passeul/ style-transfer-model-evaluation to measure naturalness. ¶ Simply put, the difference of the means of all vectors in Style Y and X is computed and scaled by a factor k, then added to any arbitrary latent vector with style Y to convert it to style X.
The resultant latent vector is passed through the decoder to produce the output sentence.

Quantitative Analysis
TST accuracy, content preservation and naturalness were computed on the converted sentences (shown in Table 4, 5, 6). Content preservation metrics can be found from Column 4 onwards. We consider two versions of the EPAAE i.e. only continuous embedding noise, continuous embedding noise + token deletion, and find that in some cases a mixture of both is required for optimal performance. We find that a slightly lowered value p for the EPAAE combined with its optimal ζ parameter outperforms other models as well. Table 4 summarises the results of TST on the Yelp dataset. On visual inspection, there appears to be a general tradeoff between TST% and content preservation metrics. For example, the β-VAE achieves the best TST% but suffers from bad content preservation capability. In this case, we observe that EPAAE (ζ = 2.0, p = 0.1) has the best content preservation capabilities across all metrics and achieves a reasonable tradeoff of TST%=77.1 as well. Table 5 summarises the results on four NLI datasets. Similar to the sentiment task, we see that EPAAEs have the best content preservation capabilities as well as the naturalness metric. It achieves this while achieving a comparable TST% as well. β-VAE shows best TST% but again suffers in content preservation. DAAE also display reasonable TST% vs Content preservation tradeoffs but overall cannot match the tradeoffs achieved by EPAAEs. Human evaluations on the SNLI dataset (Table 8) confirm this as well. The TST% achieved by any model in any task peaks at only 60.4% compared to 81.9% in sentiment task, a significant difference that hints at the fact that the Discourse TST task might be intrinsically more complex than sentiment style. Future work that aims to increase performance on the NLI task will be beneficial.   ming distance between the base and converted sentence. According to this hierarchy Tense inversion, PP removal/addition and Voice change are labelled in ascending order as easy, medium and hard respectively. Our results seem to partially validate this observation, in that the max TST% is obtained on the Tenses dataset (100%). Generally speaking, we also observe that TST has much better performance on fine-grained Style-PTB datasets than sentiment and discourse styles. Similar to before, the EPAAE shows best content preservation at competitive values of TST% as well. It is also noteworthy to consider the direction of style transfer, particularly in the case of complex styles such as Discourse styles present in NLI datasets. Results and analysis on direction-specific metrics for Discourse TST are presented in Appendix B.2.2.

Qualitative Analysis
Samples of Output. For qualitative analysis, sample outputs by DAAE and EPAAE for the Yelp, SNLI and Tenses dataset are given in Table B (Table B.9) and the SNLI (Table 7)  Smooth Interpolation. Sentence interpolation experiments are reported in Appendix B.1, in which latent space points in an interpolation along a specific direction are decoded to gauge the smoothness of the space and its ability to generate coherent sentences

Human Evaluations
Each human annotator was given a set of base sentences and asked to vote for which model produced the most appropriate corresponding style inverted sentences. Please refer to Appendix A.3 for full details on the setup. Results are shown in Table  8. We observe that the proposed EPAAE model was overall more preferred across all three chosen datasets. This margin was most significant in the case of the SNLI dataset.   a man enjoys some extravagant artwork . a man is making art a man enjoys some extravagant artwork . a man is making art k = −1 a man makes a strange art that says unk . a man is making clothing a man examines an art unk stall . a man is making art k = −1.5 a man makes a strange art that says unk . a man is making clothing a man examines an art unk by . a man is making art k = −2 a man makes a strange art that says unk . a man is making clothing a man examines an art unk by . a man is making art k = −2.5 a man makes a strange art that says art . a man is making clothing a man examines an art gallery unk . a man is making art k = −3 a man makes a strange art that says art . a man is making clothing a man examines an art gallery unk . a man is making art

Conclusion
We introduce the "Embedding Perturbed AAE" or EPAAE and show that it best captures underlying style-based semantic features in the latent space in an unsupervised manner compared to its baselines. By inducing robust latent space organization through embedding perturbation in an unsupervised manner, we also demonstrate the possibility of finegrained TST, where we can control the strength of the target style. Using a diverse set of datasets with varying formulations and complexities of style, we empirically that EPAAE performs overall best in the text style transfer task, particularly in its ability to preserve style-independent content across all datasets.

Future Work and Limitations
Regarding work in TST. We wish to augment existing state of the art methods with embedding perturbation to check if doing so aids performance. We see degrading TST performance in the Entailment to Contradiction Task across all models. Future work will focus on methods to improve this task.
Generally. It is also interesting to further analyse the effects of embedding perturbation to latent representations and resultant properties. A theoretical analysis would be beneficial to cement the use of embedding perturbation in a more general setting. There also remain more important questions that need answering for, e.g. "What if you apply continuous perturbation to hidden states instead of embeddings?", "What is the relation between this type of perturbation and techniques like dropout?". We wish to explore these important questions in the future.

Ethics Statement
Any TST model can be used for nefarious purposes, e.g. performing a "Non-toxic to toxic" modification of text in a real-world setting and causing social harm. Therefore, it is important we keep in mind a code of ethics (e.g. https: //www.acm.org/code-of-ethics) for usage, research and development in this type of research.
We have made all our code open-source and provided all details of experimentation and implementation to the best of our knowledge.

Acknowledgements
We would like to thank the reviewers whose feedback we believe substantially increased the quality of this work. We would like to thank the human annotators for their participation. All baseline models were trained with all underlying architectures apart from their individual objective losses. Bi-directional GRUs were used for the encoder and decoder with input embeddings of size 300, a hidden representation of size 256 and a latent space of size 128. For all models using AAEs as the underlying autoencoder, the discriminator was a single-layered perceptron with 512 units. The ADAM optimiser with β 1 , β 2 as 0.5, 0.999 and a learning rate of 0.001. All models were trained for 30 epochs as any more training steps caused the reconstruction loss to dominate and decrease overall performance on the TST task. All input perturbations were disabled in inference time.

A.2 Hyperparameter Selection
For the hyperparameters p, λ l , and β for the DAAE, LAAE and β-VAE, we fixed the values as 0.3, 0.05, 0.15 respectively. This decision was aligned with the results in Shen et al. (2020), which showed that these values produced the best reconstruction vs BLEU trade-off. We found this to be the case during subjective manual testing as well. λ adv was set to 10 for all models having AAEs as the underlying architecture. For EPAAE, we found that ζ ∈ [2.0, 3.0] overall showed the best results across all datasets. Therefore a manual search around this range was conducted to determine the optimal ζ for each dataset.

A.3 Human Evaluations
TST outputs of two models, the baseline DAAE and the proposed EPAAE model, on the Yelp, SNLI and Tenses dataset were considered. The best performing EPAAE was chosen according to the results in Table 4, 5 and 6, particularly with respect to the content preservation metrics (Since TST% were similar across all models). Two hundred sentences (hundred from each base style) were randomly sampled from the test split of each dataset and style inverted (with scaling factor k = 2). Six hundred instances (each instance being a base and converted sentence pair) were equally split between three human evaluators. Each evaluator was given the task of labelling all two hundred instances from each dataset. The models were anonymous to evaluators and randomly named as "Model 1" and "Model 2". Each instance was to be labelled by an evaluator with four possible decision outcomes, i.e. "1 is best", "2 is best", "All are bad" and "All are good". For each instance, the majority of three votes from each of the three annotators were taken as the final decision for that corresponding instance. Instances without a majority were marked as "NA". The evaluation guidelines was formulated to consider which model a) successfully transferred the target style b) preserved the style-independent content and c) was overall fluent and grammatically coherent.

B Additional Experiments: B.1 Sentence Interpolation in Latent Space -Qualitative Examples
We perform sentence interpolation using DAAE and EPAAE, starting from the same input sentence, incrementally moving along the same direction in the latent space in five fixed-size steps. We see that both the baseline and proposed model are able to produce fluent and coherent sentences, indicative of a smoothly populated latent space.

B.2.1 Qualitative Examples
TST with scaling factor k = 2 was performed on DNLI, Scitail datasets as seen in B.5 and B.6 respectively. Similarly, it was also performed on Voices and PP Removal datasets from the Style-PTB benchmark as shown in B.7 and B.8 respectively. The proposed model for each dataset, was chosen to be the EPAAE with best performance in the content preservation metrics (shown in Table  4, 5 and 6). Samples were specifically selected in which at least one of the models was able to generate the ideal, style converted sentence with near-perfect content preservation and coherence. This was done by human evaluators, where both the models were anonymized.

B.2.2 Direction specific Discourse Transfer metrics
Particularly in the case of Discourse Style, it is natural to speculate that the difficulty of style transfer might be sensitive to the direction i.e. two blond women are hugging one another the women are sleeping two blond women are hugging one another the women are sleeping two blond women are hugging one another the women are sleeping two blond women are hugging one another the women are sleeping three dogs affectionately playing the dogs are sleeping three dogs affectionately playing the dogs are sleeping two dogs are playing and wrestling with each other two cats are chasing each other through the house two dogs are playing and wrestling with each other two cats are chasing each other through the house two men in wheelchairs crash and they reach for the ball the men are sleeping two men in wheelchairs crash and they reach for the ball the men are sleeping three dogs in different shades of brown and white biting and licking each other the dogs are fighting three dogs in different shades of brown and white biting and licking each other the dogs are fighting Table B.1: Sentence Interpolation on sentence from SNLI dataset (Left: EPAAE, Right: DAAE) the report will follow five consecutive declines the report will follow five consecutive declines the report will follow five consecutive declines the report will follow five consecutive declines the report will follow five consecutive declines in full monthly figures the report follows five consecutive declines the report followed five consecutive declines the report followed five consecutive declines the report follows five consecutive declines the report will follow five consecutive declines in full monthly figures the index hits its low 20297 off 2042 points the report follows five consecutive declines in full monthly figures the soviets had a world leading space program the guests noted the soviets had a world leading space program the guests noted the soviets had a world leading space program the guests noted the soviets had a world leading space program the guests noted the challengers had a big price advantage the plant had a hairy stem that produced flowers and diminutive seeds the expansion set off a marketing war the japanese used 00 of the world s ivory the uaw was seeking a hearing by the full 14 judge panel the diplomat added that mr krenz had several things going for him total return was price changes plus interest income the japanese used 40 of the world s ivory food was not the waitress and was slow .

Converted (EPAAE)
i had great prices and everything is exquisite ! definitely stay here ! the food is n't just extremely bland .
food was ok. service was slow and disappointed SNLI Entailment to Contradiction Contradiction to Entailment Base a man and lady standing on a seesaw at a park . a man and woman are on a seesaw outside a dog is jumping through the water . tha animal is in the water a young child in a green shirt is on a carousel . the young child is wearing a blue shirt three people shopping in an isle in a foreign grocery store . three people look at clothes in the mall Converted (DAAE) a man and a woman are on a seesaw at a beach. a man and woman are on a seesaw a dog is running through the water the animal is running through the water a young child in a blue shirt is wearing a red shirt and a blue hat a child is on a swing three people in shopping carts in a shopping mall. three people shopping in a store Converted (EPAAE) a man and a woman are sitting on a seesaw at a park. a man and a woman are sitting on a couch a dog is jumping through the water the animal is flying through the air a young child in a red shirt is on a green slide. the child is wearing a shirt three people shopping in a shopping mall in a foreign city. three people shopping in the mall

Tenses
Future to Past Past to Future Base miller brewing co and general motors will be included by clients i will see a possibility of going to 2200 this month spiegel was 80 controlled mr deaver had reopened a public relations business Converted (DAAE) miller brewing co and general motors were included i was a math major but i was going at this 42 will be advanced mr breeden will not be public relations business will say Converted (EPAAE) miller brewing co and general motors was included by clients i saw a possibility of going to 2200 this year spiegel will be 80 controlled mr deaver will have reopened a public relations business   the problems will be magnified by the june killings the senate will not vote on six lesser charges membership will have since swelled the dispute will pit two groups DAAE the problems will be the senate will come membership will have since swelled to between 20 of smaller creditors the dispute will pit two groups of claimants against each other of each other EPAAE the problems will be magnified the senate will not vote membership will have since swelled to at least 21 since friday the dispute will pit two groups of claimants against the two of japan EPAAE (ζ = 2.0, p = 0.1) Input very disappointed they ran out of average things in the menu very disappointed they ran out of average things in the menu k = 1 very nice and ran out of the food in ! very disappointed and all in out of the menu . k = 1 very nice and varied with the quality menu ! very disappointed and all great flavors in the menu . k = 2 very nice and varied selection of food ! very prepared and great flavors in the menu . k = 2.5 very nice locally and varied menu ! ! very delicious flavors and great flavors in the k = 3 very great selection and great seafood ! very delicious flavors and great fish selections . Input food was excellent and service was fast food was excellent and service was fast k = −1 food was excellent and the service was fast food was excellent and service was fast . k = −1.5 food was excellent and the service was fast . food was excellent and service was fine but service k = −2 food was not the waitress and was slow . food was undercooked and service was fine but delivery . k = −2.5 it was not the waitress , but was slow . food was undercooked and service was not fine at food . k = −3 it was not ignored the service , but was slow . ! To analyze this, direction-specific quantitative metrics for Discourse TST are conducted for the SNLI (B.10), DNLI (B.11) and SciTail (B.12) datasets. We notice a disparity in performances in fact, does exist, mainly highlighted by the differences in the TST% metric. This sensitivity to direction is present in all models across all datasets but is most significant in the SNLI dataset in which TST% goes as low as 20.6% for the Contradiction to Entailment task and as high as 83.7% for the opposite task. Future work can focus on trying to specifically improve the Contradiction to Entailment task, as doing so will be a measure of a model's ability to detect and carefully align the content of one sentence to match another.

C Details on Datasets
Here we provide some additional details of all the datasets used in this work. Complexity of Styles in Datasets: As discussed in Section 5.2, we consider three tasks-sentiment, discourse and fine-grained text style transfer. As prepossessing, we remove non-essential special characters and lowercase all sentences. Except for the Yelp dataset, no pruning is done based on sentence length. The vocab size during training was limited at 25k unless mentioned otherwise.
Sentiment Style Datasets: We use the preprocessed version from (Shen et al., 2017) of the Yelp dataset. It contains 200k, 10k, 10k sentences in the train, dev and test split respectively. The sentiment labels (positive, negative) were considered as style.
Discourse Style Datasets: We used three NLI datasets -SNLI, DNLI, and Scitail. Each instance in the SNLI dataset (Bowman et al., 2015) consists of two sentences. These sentences either contradict, entail (agree) or are neutral towards each other. The resultant dataset contained 341k, 18k, 18k in the train, dev, test splits respectively. The DNLI dataset (Welleck et al., 2019) consists of contradiction, entailment and neutrality labelled instances instead in the form of a first-person dialogue like representation. The dataset contains 208k, 11k, 11k sentences in train, dev and test respectively. Scitail (Khot et al., 2018) is an entailment dataset created from multiple-choice science exams and the web, in a two-sentence format similar to SNLI and DNLI. The first sentence is formed from a question and answer pair from science exams and the second sentence is either a supporting (entailment) or nonsupporting (neutrality) premise obtained from the internet. The dataset contained 24k, 1.3k, 1.3k sentences in the train, dev, test splits respectively. For SNLI and DNLI, all instances with the "neutrality" label were removed. Style transfer task performed from "contradict" to "entail" and vice versa. For SciTail (Khot et al., 2018), the transfer task was from "neutral" to "entail" and vice versa.
Fine-grained Style Datasets: We used Style-PTB dataset (Lyu et al., 2021) for fine-grained style. It consists of 21 styles/labels with themes ranging from syntax, lexical, semantic and thematic transfers as well as compositional transfers which consist of transferring more than one of the aforementioned fine-grained styles. To check whether the EPAAE can capture fine-grained styles better by leveraging its better organized latent space, we make use of three styles i.e. Tenses, Voices (Active   Table B.10: Direction-wise TST metrics for the SNLI dataset. "Entailment" and "Contradiction" are denoted by "E" and "C" respectively.   Table B.11: Direction-wise TST metrics for the DNLI dataset. "Entailment" and "Contradiction" are denoted by "E" and "C" respectively.  Table B.12: Direction-wise TST metrics for the SciTail dataset. "Entailment" and "Neutrality" are denoted by "E" and "N" respectively or Passive) and Syntactic PP tag removal (PPR). In the Tenses dataset, each sentence is labelled with "Present", "Past" and "Future". The Voices dataset contains "Active" and "Passive" voices labels and the PPR dataset contains "PP removed" and "PP not removed" labels. The resultant sizes of the test, dev, test splits were 71k,8.8k,8.8k (tenses), 90k,11k,11k (PPR) and 44k,5.5k,5.5k (voices).

D More details on related work
Bowman et al. (2016) extend Variational Auto-Encoders (Kingma and Welling, 2014) for text generation and address the posterior collapse problem wherein the decoder completely ignores the latent channel leading to poor generation. Adversarial auto-encoders (Makhzani et al., 2015) substitute the KL loss with an adversarial approach to enforce the latent Gaussian prior. On the other hand, AAEs are shown to naturally avoid the posterior collapse problem and promote strong coupling between the encoder and decoder. Wasserstein autoencoders (Tolstikhin et al., 2018) introduce a family of regularized autoencoders that learn a flexible prior by solving an optimal transport problem to match P(z) and Q(z) using an adversarial approach. Adversarially regularised autoencoders (ARAEs) follow the Wasserstein Autoencoder framework to learn a learnt prior unlike AAEs, which assume the prior to be a fixed standard Gaussian distribution.

E Computational Expense and Infrastructure used
The most parameter-heavy EPAAE model was from the SNLI dataset and we therefore report statistics for this model. The model has 13 million parameters and each epoch took approximately 60 seconds to train on an Nvidia V100-SMX2 GPU and an Intel(R) Xeon(R) E5-2698 CPU. For complete details, please refer to the log.txt files present in each model's directory present in https: //github.com/sharan21/EPAAE.