A Hierarchical VAE for Calibrating Attributes while Generating Text using Normalizing Flow

In this digital age, online users expect personalized content. To cater to diverse group of audiences across online platforms it is necessary to generate multiple variants of same content with differing degree of characteristics (sentiment, style, formality, etc.). Though text-style transfer is a well explored related area, it focuses on flipping the style attribute polarity instead of regulating a fine-grained attribute transfer. In this paper we propose a hierarchical architecture for finer control over the at- tribute, preserving content using attribute dis- entanglement. We demonstrate the effective- ness of the generative process for two different attributes with varied complexity, namely sentiment and formality. With extensive experiments and human evaluation on five real-world datasets, we show that the framework can generate natural looking sentences with finer degree of control of intensity of a given attribute.


Introduction
The ubiquity of online social networks and world wide web has brought in diverse and often conflicting groups of users consuming similar information but from different perspectives. So the onus falls on the content producer to cater customized content based on the users' profile. Consider an example related to a Spanish football (soccer) league. Say the news is "Barcelona has defeated Real Madrid". This news needs to be presented in different tones to a Barcelona Fan -"Barcelona smashed Real-Madrid", a Real-Madrid Fan -"Real Madrid lost the epic battle" and a (say) Villarreal Fan -"Barcelona wins three points against Real-Madrid". Automatic generation of content with fine regulation of attributes like sentiment and style is extremely beneficial in this context. There are several related works in similar space of text-style-transfer techniques (Hu et al., 2017;Logeswaran et al., 2018;Shen et al., 2017;Singh and Palod, 2018) which attempt to switch polarity of a text from, e.g., formal to casual, or positive to negative sentiment. However, none of the work focuses on more involved problem of fine-grained regulation of attributes to generate multiple variants of a sentence.
Several of the existing style-transfer methods (Fu et al., 2018;John et al., 2018) convert a continuous entangled generative representation space obtained using variational auto-encoder (Bowman et al., 2015) into disentangled attribute and content space. It facilitates attribute polarity switch by perturbing attribute representation without interfering with context. However, a disentangled generative representation may result in a loss of information about complex inter-dependency of content and attributes otherwise captured in an unmodified entangled generative space. Hence, trivial extension of the variational inference (encoding) mechanism for finer attribute control by allowing incremental perturbation of the attribute representation in the disentangled generative space often leads to generation of 'not-so-natural' sentence mostly unrelated to the original content.
More specifically, there are two design challenges which need to be tackled to achieve fine grained attribute control (a) smooth regulation of attributes via disentangled attribute space perturbation and (b) natural sentence generation preserving the content. This paper builds up a layered VAE to tackle these problems simultaneously. Specifically, we propose the model Control Text VAE (CTVAE), that transforms a derived representation of entangled and enriched text embedding (obtained using the BERT encoder) into a disentangled representation of attribute and context using a transformation module followed by a factored prior imposition to ensure independence between context and attribute dimensions. Further using attribute supervision on the dimension designated for a given attribute, we establish a correlation between the continuous representation to the discrete attribute value facilitating smooth interpolation as intended in (a). It preserves both the disentangled and entangled representations in different hierarchy of inference module. Designing the transformation network as reversible, it restores the original entangled sentence representation which is our generative space, from the disentangled space to achieve (b).
We demonstrate the effectiveness of CTVAE to generate controlled text by fine tuning two different attributes namely sentiment and formality. Using five publicly available datasets, we show that CT-VAE improves the performance significantly over previous controlled text generative models while performing content preserving style transfer and fine tuning of the target attribute. With human evaluation on generated sentences, for three different metrics -meaning preservation, degree of target attribute transfer and naturalness -we show that CTVAE can generate attribute regulated content preserving natural sentences. 1

Related Work
Unlike style-transfer, fine grained attribute regulated text generation is less explored yet extremely necessary. State-of-the-art methods for style transfer are categorized as supervised and unsupervised techniques. If parallel examples are available for any attribute, i.e., training data consisting of original and corresponding attribute flipped sentences, then supervised techniques (Bahdanau et al., 2014;Vaswani et al., 2017) could be used to perform style transfer. The papers (Xu et al., 2012;Jhamtani et al., 2017;Rao and Tetreault, 2018) introduced parallel corpora consisting of formal and corresponding informal sentences and showed that coarse-grained formality transfer is possible and benchmarked various neural frameworks for the same. Generating parallel training corpus for fine grained attribute transfer is expensive and impractical as for one sentence we need to generate multiple style transferred text bearing fine-grained attribute.
Some recent works focus on semi-supervised approaches incorporating attribute informations with non-parallel datasets. These techniques mainly focus on disentangling the attribute and content representation in the latent space (Fu et al., 2018;John et al., 2018;Logeswaran et al., 2018;Shen et al., 1 https://github.com/bidishasamantakgp/ CTVAE 2017; Singh and Palod, 2018) by using different encoding modules along with feature supervision. A recent work (John et al., 2018) uses adversarial setup in a multitasking setting to achieve attribute representation independent of the content. As this work disentangles context and attribute in multidimensional spaces it limits interpolation of the attribute space to desired degree. Moreover, the disentangled generative space causes loss in important context. Similarly, the paper (Hu et al., 2017) uses attribute information as a structured or one-hot vector, which is not continuous restricting interpolation. They replace the attribute representation to a desired value (corresponding to opposite polarity) and generate sentences from this disentangled space. However, a naive extension for fine grained control by perturbing the attribute space by a small amount is difficult as the representation is multidimensional moreover, leads to unnatural, poorly readable sentence.
From a different perspective, a recent work (He et al., 2020) proposed an unsupervised framework to achieve style transfer. They propose a generative probabilistic model that assumes non-parallel corpus as partially observed parallel corpus. They do not infer posterior distribution of the observed data, hence fine grained attribute transfer is difficult.
As the extensions of current style transfer methods are non-trivial, a recent work  has proposed fine grained sentiment regulation keeping the content intact. It gradually updates the entangled latent representation using costly fastgradient-iterative modification until it can generate a sentence entailing target attribute from that . However, overemphasis on content preservation often results in generation of the original unmodified sentence followed by new phrases bearing target attribute. This is not ideal to extend them for more difficult attributes like casual to formal transformation. Understanding the criticality of fine grained attribute transfer, we propose a new framework towards this direction, which does not only facilitate fine-grained control even for complex attributes, but is also able to mitigate the existing problems of disentangled generative space.

CTVAE for Fine Grained Control
We propose a hierarchical model using Variational Autoencoders (Kingma and Welling, 2013) to achieve fine grained control over attribute space while maintaining the quality of the generated sen-  Figure 1: The architecture of CTVAE. The encoder module (A) takes a word sequence x and converts obtained BERT embedding to a continuous space z s . Using T transformation modules z s is converted to z f and assigns the last dimension of z f for attribute representation z a . The decoder (D) samples z f from prior or posterior. It decodes categorical attribute from z a and reverse transforms z f to z s . and use it to generate word sequence x. The grey block indicates a single transformation step which is reversible (B indicates forward and C reverse).
tences. We provide a high level overview of CT-VAE along with key technical aspects of the individual components followed by training procedure.

Model overview
We consider an input set X = {x 0 , · · · , x M −1 } of M observed sentences sampled from some underlying unknown data distribution p D . Along with the sentences, we observe ground truth attribute, For ease of reference, we will henceforth denote a training instance x i and f i by x and f respectively. Detailed architectural overview of CTVAE is depicted in Figure 1, which can be divided into two modules consisting of a hierarchical encoder and a corresponding hierarchical decoder. We start by describing the inference model (encoder) followed by the generation model (decoder).

Inference model
The inference model is designed as a bottom-up hierarchical encoder with two distinct layers for modelling word sequence representation z s , and feature representation z f . We model an enriched sentence representation z s ∈ R d with latent dimension size d from word sequence x as follows. We first obtain the contextual word embeddings for each word w in x from the BERT pre-trained model (Turc et al., 2019). Then, we generate an aggregated encoding E s by taking an average of them. Finally, we transform it into a continuous d dimensional Gaussian space using a fully connected neural network g φ by the following two steps.
The sentence representation z s is sampled from this posterior distribution q φ (z s |x). It is an entangled complex manifold of different salient features present in multiple dimensions. This enriched representation is the generative representation as we decode sentences from z s for better quality.
Next, we transform the sentence representation z s into another representation z f on which we impose disentanglement constraints followed by attribute supervision such that z f could be decomposed into independent space of context and attribute. We need an efficient transformation to maintain the inherent dependencies between the context and attribute during this process. Also it is important to restore enriched z s from decomposed z f i.e. to capture the reverse dependency. Instead of modeling two different transformation networks to capture the dependency in both ways, we design a single reversible transformation module. It guarantees that given a z f , we getback an appropriate entangled z s useful for natural sentence generation.
Hence, we build our transformation network extending R-NVP (Dinh et al., 2016) which is a reversible auto-regressive normalizing flow to achieve mentioned interdependency and inversion. Specifically, we split z s into two parts. The first d − 1 dimensions of the z s is dedicated to model latent factors important for context modelling. The rest of the (last) dimension is used to derive a representation for the specified attribute. The detailed interconnection between them in one transformation step is depicted in Figure 1(B). We obtain z f by T transformation steps, where T is a hyper parameter. In a transformation step t we obtain a representation distribution q t (z t |z t−1 ), which is characterized as the ordered set of following opera-tions: The Eq. (4) describes intuitively that the attribute representation field is dependent on first d − 1 dimensions or context. The Eq. (6) encodes how context is influenced by the attribute. Here, Ψ 1 t and Ψ 2 t are designed as multilayer fully connected feed-forward networks which are not invertible. However, a careful inspection of Eqs. (4) and (6) reveals that given a z t , the input z t−1 can be fully recovered. We provide the reverse transformations in the next subsection. Thus, we can get q φ (z f |z s ) := q φ (z T |z s ) and we assign z f := z T . We pick the d th (last) dimension of z f to model specified attribute representation z a . To facilitate smooth interpolation in this attribute space, we keep z a as unidimensional. We further use attribute supervision to establish the correlation with categorical values of the attribute. We will discuss the process in the next subsection. The rest of the dimensions of z f are kept for other contextual features z u . We discuss about disentanglement of z f in Sec. 3.4. The overall posterior distribution achieved by the hierarchical inference mechanism:

Generative model
We design our generative model p θ using a topdown hierarchy, with two different variables z s and z f . The overall distribution of the latent variables for the generation is defined as: Here p π (z f ) is a factored prior of the feature representation z f , which can be expressed as We use a standard normal distribution, which is a factored isotropic distribution, as prior, i.e., p π (z f ) = N (0, I). Imposing this factored prior enforces disentanglement (Kim and Mnih, 2018) on the derived space q φ (z f |z s ). As discussed in the previous section, we have designated the last dimension of the z f to capture any attribute of interest, and remaining dimensions for other contextual features. Henceforth, attribute representation prior can be sampled from p π (z d f ) and other contextual features prior representations can be sampled from d−1 i=1 p π (z i f ). We use feature supervision on z a to increase the correlation between the representation and the attribute value as follows. Given z a , we decode the categorical attribute value of the given sentence x and back propagate the loss of prediction to modify the network parameters. More specifically, the decoding distribution for the ground truth attribute is Here ξ is a scaling network to convert the singular value z a into a logit vector corresponding to categorical values of ground-truth attribute. Next, the network tries to decode the entangled distribution z s from the disentangled distribution z f . We apply the reverse transformation flow to recover z s using T inverse transformations. Starting from z f (z T ), we recover z s by reverse transformation steps p t (z t−1 |z t ), as a set of ordered operations: The Eq. (11) is the reverse transformation corresponding to the Eq. (6). Similarly Eq. (13) defines the reverse flow of Eq. (4). It may be noted that µ 1 t , µ 2 t and σ 1 t , σ 2 t are derived from the same neural network Ψ 1 t , Ψ 2 t as Eqs. (3), (5). Hence, given a z t we can easily get back z t−1 without any loss of information. Thus we get z s := z 1 . Following the density estimation theory (Dinh et al., 2016), the log probability density of p θ (z s |z f ), i.e., log p T (z s |z f ) denoted as: where f t denotes transformation function at step t described in Eqs.
(3)-(6). Finally, with the decoded z s , we sample the word sequence x(j) using a recurrent unit as follows: here h(j) = r θ (x(j − 1), z s ) is the hidden state of gated recurrent unit r θ which takes the previously generated token x(j − 1) and the sentence representation z s . Then we pass this hidden state information to a feedforward network m θ to generate logits. Subsequently, we sample words based on the softmax distribution of the generated logits. The joint likelihood of the sentence, features, and the latent variables

Training
We can learn the model parameters by optimizing the joint likelihood given in Eq. (16). To learn the complex transformation of disentangled attribute and context in z f from entangled z s precisely, we need to first estimate the approximate posterior q φ (z s |x) accurately. However, in the initial iterations of training the encoder fails to approximate the posterior distribution (He et al., 2019). Hence, we first train the lower layer by maximizing ELBO (Kingma and Welling, 2013) : This is an unsupervised training as we are not using any attribute information and this objective helps to update encoder parameters to generate entangled z s . Once the lower layer is trained, we update the transformation parameters (Eq. (14)) and impose feature supervision by maximizing the marginal likelihood of z f given below: where α and β are regularizing parameters to enforce disentanglement of z f and emphasize on attribute supervision respectively. If we breakdown the KL term of the above objective func- , minimizing which the model achieves disentanglement on z f along the dimensions (Higgins et al., 2017). Also, the mutual information I(f, z a ) between specified attribute and z a can be computed using entropy function H(.) as H(f )−H(f |z a ) ≥ , is lower bounded by the likelihood p θ (f |z a ), hence, we emphasise on the likelihood term in the objective function using β to maintain higher correlation between z a and f . Thus we update the network parameters phase by phase using Eqs. (17) and (18).

Experiments
We broadly looked into two evaluation criteria to compare the performance of different generative models (a) Attribute control: efficiency in generating sentences entailing target attribute of interest (b) Fine-grained transfer: efficiency of content preserving fine-grained attribute regulated text generation. In this section we discuss datasets, baselines followed by the performance across datasets.

Datasets
We focused on two attributes of varied complexity, namely, (a) sentiment and (b) formality. In Table 1 we describe the datasets in detail. For sentiment we include two review datasets and one hate-speech dataset. The Gab dataset is designed for counterhatespeech learning and every hateful sentence has a candidate counter hate-speech. We consider them as non-hateful (NH) class of content. Thus we have training examples with hateful (H) and non-hateful (NH) contents. The formality datasets have formal (F) and corresponding casual (C) instances. We report all the results on the test data provided.

Baseline methods
We compare CTVAE performance with semisupervised method -(a) ctrlGen (Hu et al., 2017), supervised method -(b) DAE (John et al., 2018) that focus on text-style-transfer using disentanglement, and unsupervised method (c) ProbStyle-Transfer (He et al., 2020). We also compare with (d) entangleGen  which focuses on fine-grained style transfer using entangled representation. Apart from these state-of-the-art baselines, we inspect (e) CTVAE-NR (CTVAE Non-Reversible transformation) where we replace the invertible transformations of CTVAE with two separate transformation networks responsible to capture q φ (z f |z s ) and p θ (z s |z f ). For different evaluation  criteria we compare CTVAE with different subsets of these methods described in relevant sections.

Performance on attribute control
Experimental setup: We estimate the average representation value of z a corresponding to each categorical (binary) value for an attribute of interest as z max and z min from training data. We generate attribute controlled sentences in two ways. First we sample a generative representation vector from the prior distribution (i.e., p θ (z s |z f ∼ N (0, I)) and assign either z max or z min to z a . We sample 10 sentences from a representation and select the one which bears the target attribute. If there is no such sample generated we consider it as a failure case. Similarly, we assign z max or z min to z a depending on the target attribute to posterior representation of a given sentence x. We sample 10 sentences from that and select the one most similar with x (BERT embeddings having cosine similarity greater than τ = 0.71) and entails the target attribute. If we fail to find any candidate following both the criteria we consider that a miss. We identify the generated sentences with target attribute using a classifier build by extending BERT and train on different datasets. We investigate multiple cosine similarity thresholds τ (0.65 to 0.75 with granularity 0.01). We observe the generated sentences having cosine similarity with original sentence less than 0.7, don't contain important context words. On contrary, we observe all methods except CTVAE and entan-gledGen were able to generate only a very small number of candidates with high similarity scores (>0.73). To provide a fair comparison we keep τ at 0.71 for all datasets across all methods. Metrics: We report controlled generation accuracy, i.e., percentage of generated sentences from prior bearing target attribute and style inversion accuracy, i.e., the percentage of generated sentences from posterior bearing target attribute and related content. We also report percentages of related content generation for style inversion. We report mean performance of each model trained with three ran-dom initialization. Baselines: We report ctrlGen and DAE for both metrics as they can sample generative representation from both prior and posterior. Whereas en-tangleGen and probTrans can only generate sentences corresponding to a given posterior, we compare them only for style inversion.

Sentiment control
We report controlled generation accuracy and style inversion accuracy for Yelp, Amazon and GAB in Table 2. It can be observed that CTVAE outperforms all competing methods across three datasets for controlled generation. The superior performance of CTVAE stems from the fact that attribute supervision on disentangled representation helps to achieve better control of attributes than the semi supervised ctrlGen. DAE which is also an attribute supervised technique performs exactly same like ours. CTVAE effectively generates more related content than others and achieves best accuracy for style inversion in Amazon and both hateful to nonhateful (H-NH) and non-hateful to hateful (NH-H) transitions for GAB. It is the second best in Yelp. DAE, along with ctrlGen, uses disentangled generative space which often causes content information loss. Hence, they generate less related content with respect to other methods which leads to a drop in accuracy for style inversion. entangleGen performs best for style inversion for Yelp and second best in other datasets. It achieves relatively low accuracy even after producing larger amount of related content. It uses BERT embedding space to search for a candidate embedding closest to the original sentence for style inversion. As Yelp contains shorter coherent sentences it is easy to find related yet opposite polarity sentence embedding whereas for GAB the H and NH sets are quite different and their representation spaces are far from each other causing poor performance. The unsupervised method probTrans performs well in relatively simpler dataset Yelp and Amazon however, fails to generate related content for complex GAB   Figure 2: The variation of relatedness (R) and attribute polarity scores (AP) with respect to attribute control grades in F across datasets. As we move from f 1 to right CTVAE generate sentences with monotonic increase in AP maintaining high R. −f 1 the AP decreases monotonically. For M usic the variation of AP is not consistent. and scores the lowest. As converting a counterhatespeech to hateful content is difficult, all methods perform poorly. The performance of CTVAE-NR is significantly inferior compared to CTVAE. Close inspection reveals that even though at training we achieve very low KL between q φ (z f |z s ) and p θ (z s |z f ), the decoded z s is not exactly the same as the encoded distribution. Thus, it performs poorly in style inversion.

Formality control
From the Table 2, we can see that CTVAE performs best in both M usic and F amily datasets for all metrics. Conversion of a casual sentence into formal (C-F) is more difficult as it would require some structural change of the sentence, whereas the reverse transformation (F-C) is easy. Though the disentangled based methods perform better for C-F relatively than F-C conversion, overall they perform poorly as they are unable to generate related content after perturbing disentangled generative space for the same. entangleGen also performs poorly in both the datasets for both C-F and F-C. As a pair of formal and corresponding informal sentences have very high content overlap, only structure, capitalization etc are different, in the BERT representation space they become very close. The generative model for entangleGen generates sentences from this representation space, hence it cannot distinguish much on smaller change of representation. It confuses the generative model and it generates the original sentence as it is very often. Unlike GAB, probTrans performs better than all semi-supervised methods along with entangleGen even though formality is a difficult attribute like hatred. As the formality datasets are parallel data, probTrans can accurately estimate the latent variables for them which otherwise is difficult. Hence, they learn to successfully generate style inverted text given parallel sentence.

Significance test
We perform student t-test with significance level 0.05 and report expected p-values with closest baseline following Reimers et al. (Reimers and Gurevych, 2018) for two tasks i.e controlled generation and style inversion.
For controlled generation we find the p-values per dataset as follows. For Yelp the p-value is 0.009 compared against ctrlGen, for Amazon 0.019 with respect to ctrlGen, GAB 0.015 with ctrlGen, Music 0.012 against DAE and for Family the p-value is 0.008 compared with DAE. In first three datasets, DAE and CTVAE performs exactly same. Similarly, for style transfer we obtain the p-values as follows. For Amazon it is 0.028 in comparison to entangleGen, in GAB for (H-NHS) we get 0.028 compared against entangleGen and for (NHS-HS) it is 0.032 in comparison to ctrlGen. Music (C-F) yields 0.002 and (F-C) yields 0.017 with prob-Trans, for Family (C-F) for 0.024 against ctrlGen and for (F-C) 0.030 compared against probTrans.

Fine grained attribute control
Experimental Setup: We evaluate the performance of fine grained attribute control as follows. We create a set with n equidistant values between z min to zero denoted as {−f i } and another n values between zero to z max denoted as {f i }. The entangleGen ctrlGen CTVAE F Original sentence: every encounter i have had with her ... she is always rude or angry . Attribute transfer: Negative to Positive sentiment f1 every encounter i have had with her ... she is always friendly or angry.
i always get the burger because i have liked it.
she is always angry and she has with her ... and she is rude. f2 i love purchasing i have easy with her who has always friendly and fun.
i have always have vegetarian suite. she is always friendly and she is her ... i think that it is absolutely outstanding .. f3 i love purchasing i have easy with her who has always friendly and fun. excellent, their food is always.. she is always outstanding and i completely recommend her ... with her food. F Original sentence: yep, full retard .. political grandstanding Attribute transfer: Hateful to non-hateful f1 .. in order for little, the biggest straight humans who think it really does n't help anyone to clean up their offensive terms.
its inappropriate behavior prior to use those phrases that.
f2 .. in order for little, the biggest straight humans who think it really does n't help anyone to clean up their offensive terms.
its inappropriate behavior prior to use ' retarded ' .. lol, no. please know your political opinions. thanks.
f3 .. in order for little, the biggest straight humans who think it really does n't help anyone to clean up their offensive terms.
a word is highly offensive to those completely uncalled for.
not sure of your political points. thanks. union set F represents attribute control grades.
Greater indices indicate higher perturbation in the attribute representation space and the sign denotes the direction. Given a posterior representation z f of a sentence x, we assign z a to a value from F keeping z u fixed and decode a z s from that. We generate 10 sentences from it and select the sentence whose BERT embedding is closest to the original sentence as well as bears target attribute value. We repeat this for all values in F. We consider equivalent set F with n values for entangleGen with different increasing modification weights w which they used for fine grained attribute control in the original paper and generate sentences corresponding to that. Though ctrlGen does not support fine-grained transfer, we extended it by interpolating between two structured attribute representation vector [0,1] and [1, 0] and generating real valued vectors in F where each vector summed to one. For each attribute representation vector, we generate sentences from them similar to CTVAE. As, other models cannot be extended for the same, we do not compare their performance here. Metrics: We report attribute polarity score AP which estimates degree of attribute polarity of a generated sentence and a relatedness score R capturing the relatedness with the original sentence. For review datasets Yelp and Amazon, AP is obtained from a pre-trained Stanford regressor model (Socher et al., 2013) normalized between 0 (most negative) and 1 (most positive). A pilot study on randomly picked 25 sentences shows that the pre-trained regression score is highly corelated (Spearman's rank correlation 0.68) with human judgements. We report R as Jaccard overlap (Tustison and Gee, 2009) of unigrams between original and generated sentence excluding stop words for these datasets. However, for other three datasets the correlation observed is low. Hence, we resort to human evaluation via crowdflower platform 2 .
Given a test sentence, we generate n sentences corresponding to n different grades in the set F and ask three annotators to rank these sentences from 1 to n. We get the average rank for this instance and repeat for all test sentences to obtain average ranks as AP corresponding to each of the n values. We ask them to provide an absolute score for relatedness (R) of the generated sentences with respect to the original sentence in a scale of 1 to 10, 1 being least related, we rescale it and present the result in the scale of 0 to 1. A coherent scheme would see monotonic change in value of AP with attribute control grades varying from −f n to f n and the value of R staying close to one throughout.

Fine-grained sentiment control
We demonstrate the performance of generative models on one review dataset Yelp and hatespeech dataset GAB in Figure 2(a), (b) respectively. We show the variation of attribute polarity AP and relatedness score R with n = 4. We can observe that there is a smooth increase in AP as we move from f 1 to f 4 (denoting greater shift from original z a values towards z max ) while achieving consistently high R for CTVAE in both the datasets. Similarly as we move from −f 1 to −f 4 CTVAE shows monotonic decrease in AP still achieving highest R. Though a similar pattern is observed in ctrl-Gen in Yelp, it has extremely poor R score which denotes that it generates unrelated sentences in the process of fine-grained attribute regulation. Moreover, it shows minimum variation in sentiment score thoughout the process. In contrast, entan-gleGen achieves highest R score as they focus on content preservation, however, the sentiment score transition is uneven and doesn't follow the desired coherency. ctrlGen shows minimum variation in sentiment score thoughout the process. In contrast, CTVAE successfully maintains a balance for relatedness and attribute control. It can be observed that CTVAE shows a monotonic transition as we move from left to right denoting higher degree of attribute representation change for Amazon while other methods show haphazard changes.
In GAB ctrlGen shows abrupt change in AP and lowest score for R which demonstrates very less control towards fine-tuned attribute regulation for hatred filtering. Though entangleGen achieved lowest score in AP, signifying it can more accurately remove hateful content than CTVAE, the variation is not monotonic. Further inspection reveals that entangleGen mostly generates counter hate-speech as BERT representation clusters H and NH for GAB locate in two distant spaces. Hence, the relatedness R of the generated sentences is low. In contrast, CTVAE successfully maintains a balance for relatedness and attribute control in both.

Fine-grained formality control
We experiment with n = 3 equidistant values in each direction in F and report the performance on M usic and F amily dataset in Figure 2 (d,e). It can be observed from the figure that all the methods received a similar AP score, around 2.0, for C-F transformation from f 1 to f 3 . Also, as we move to right after f 1 , the changes in AP are inconsistent for CTVAE and entangleGen. However CTVAE achieves relatively better formality score thoughout. entangleGen achieves best R and low AP due to generation of original content verbatim very often. ctrlGen shows lowest relatedness and achieves a transfer score AP = 1.5 on average, that is, overall it fails to generate formal sentences. Moving towards casual transition, i.e., from −f 1 to −f 3 we observe a similar trend for CTVAE and entangle-Gen. Though the variation with respect to attribute control grades in F is abrupt, we achieve the lowest AP, i.e., most informal sentences. ctrlGen performs very poor with respect to all the methods. for F amily there is no trend in AP found. CT-VAE maintains high R, whereas ctrlGen was able to achieve lowest relatedness score.

Fluency
We also investigate the fluency of these methods across datasets reported in Table 4 and found that CTVAE produces very high percentage fluent sentences similar to entangleGen. As we have observed, entangleGen tends to copy the content for formality datasets because the formal and casual sentences lie close in the representation space, the fluency is high. Similarly for GAB dataset, as it tends to generate counter-hatespeech the fluency remains high.  Table 4: Percentage of fluent sentences generated in the fine grained attribute transition process Finally, Table 3 provides examples of fine grained sentiment and hatred regulated sentences generated by CTVAE, entangleGen, and ctrlGen. We observe that entangleGen generally produces long sentences, sometimes copies the original content. It produces same sentence multiple times. On the other hand, ctrlGen mostly generates sentences hardly related with the original content. In contrast, CTVAE can generate related sentences and provides finer attribute variation, controlled by f i .

Conclusion
The major contribution of this paper is to propose CTVAE which consists of a carefully designed hierarchical architecture facilitating disentangled representation to control attribute without affecting context as well as enriched entangled generative representation for meaningful sentence generation. The invertible normalizing flow as a transformation module between the two representation of CT-VAE enables learning of complex interdependency between attribute and context without the loss of information. Such a design choice is key to achieving accurate fine tuning of attributes (be it sentiment or formality) while keeping the content intact. This is a key achievement considering the difficulty of the problem and modest performance of state-of-the-art techniques. Extensive experiments on real-world datasets emphatically establish the well-rounded performance of CTVAE and its superiority over the baselines.

A Analysis of attribute supervision
Here we perform an ablation study by demonstrating the importance of the last dimension z a of the representation z f in capturing sentiment. As we ensure independence of every dimension, we calculate the correlation of every dimension of z f with the sentiment labels in the test data. We observe that z a achieves the highest correlation of 0.72 in Yelp and 0.42 in Amazon. We further train a logistic regression classifier with z a of training data as a feature to predict sentiment labels, and we achieve a high accuracy of 0.85 and 0.64 on test data in Yelp and Amazon respectively. While training with the most correlated dimension of z f other than z a , with a correlation of 0.12 for Yelp and 0.14 for Amazon, we achieve an accuracy of only 0.52 and 0.58 respectively. This implies that z a is the most expressive dimension for capturing sentiment in comparison to any other dimension.

B Parameter Setting
The sentence encoder is designed using pre-trained BERT-base-uncased model (embedding dim = 768) followed by 2-layer feed-forward network with hidden dim 200. The output of the same is the sentence embedding which is of dimension 256 for every dataset. The flow network is designed as R-NVP with T = 3 and each ψ t is designed as three layer feed forward network with tanh activation function for the initial two layers and hidden dimension is 100 for the intermediate layers. The scaling network for sentiment classification is designed as a two dimensional vector [−1, 1]. The sentence decoder is designed as a gated recurrent unit where output of each step is passed through a fully connected feed-forward network to convert it to a logit of length of the vocabulary size. The weighing parameters β and γ are set to 10 for feature supervision and disentanglement.

C Qualitative Examples
In Table 5 we provide some examples of Casual to Formal conversion. We can see with increase of the perturbation CTVAE introduces more formal notions to the sentences as proper capitalization or not using any abbreviation etc. Whereas entangle-Gen fails to introduce such changes to keep content intact and ctrlGen generates unrelated content.
Original sentence: i 've got a crush on him, like, forever ! Attribute transfer: Casual to Formal transfer F entangleGen f1 i 've got a crush on him, like forever, which is wrong ! f2 i 've got a crush on him, like forever, which is wrong ! f3 i 've got a crush on him, like forever, because in real movie. ctrlGen f1 you would have to say yes, but you are such a favorite artists. f2 he is great, unfortunately. f3 you would have to say yes, but you are such a favorite artists. CTVAE f1 i have a crush, about him, so I have a crush on him ! f2 I have a crush on him, like an crush on him. f3 I have a crush on him, like a crush.

D Training time comparison
In this section we provide a comparative analysis of training time and sampling time of CTVAE with entangleGen. Fig 3 shows that CTVAE is much faster than that of entangleGen for both cases.