Polarized-VAE: Proximity Based Disentangled Representation Learning for Text Generation

Learning disentangled representations of realworld data is a challenging open problem. Most previous methods have focused on either supervised approaches which use attribute labels or unsupervised approaches that manipulate the factorization in the latent space of models such as the variational autoencoder (VAE) by training with task-specific losses. In this work, we propose polarized-VAE, an approach that disentangles select attributes in the latent space based on proximity measures reflecting the similarity between data points with respect to these attributes. We apply our method to disentangle the semantics and syntax of sentences and carry out transfer experiments. Polarized-VAE outperforms the VAE baseline and is competitive with state-of-the-art approaches, while being more a general framework that is applicable to other attribute disentanglement tasks.


Introduction
Learning representations of real-world data using deep neural networks has accelerated research within a number of fields including computer vision and natural language processing . Previous work has advocated for the importance of learning disentangled representations (Bengio et al., 2013;Tschannen et al., 2018). Although attempts have been made to formally define disentangled representations (Higgins et al., 2018), there is no widely accepted definition of disentanglement. However, the general consensus is that a disentangled representation should separate the distinct factors of variation that explain the data (Bengio et al., 2013). Intuitively, a greater level of interpretability can be achieved when independent latent units are used to encode different attributes of the data (Burgess et al., 2018).
However, recovering and separating all the distinct factors of variation in the data is a challenging problem. For real-world datasets, there may not be a way to separate each factor of variation into a single dimension in the learnt fixed-size vector representation. An easier problem would be to separate complex factors of interest into distinct subspaces of the learnt representation. For instance, a representation for text could be separated into content and style subspaces, which then enables style transfer.
Unsupervised disentanglement of underlying factors using variational autoencoders (Kingma and Welling, 2014) has been explored in previous work (Higgins et al., 2017;Kim and Mnih, 2018). However, Locatello et al. (2019) argue that completely unsupervised disentanglement of the underlying factors may be impossible without supervision or inductive biases. Disentangling textual attributes in a completely unsupervised manner has been shown to be especially difficult, but attempts have been made to leverage it for controllable text generation (Xu et al., 2020).
In this work, we propose an approach referred to as polarized-VAE 1 to disentangle the latent space into subspaces corresponding to different factors of variation. We control the relative location of representations in a particular latent subspace based on the similarity of their respective input data points according to a defined criterion (that corresponds to an attribute in the input space, e.g., syntax). This encourages similar points to be grouped together and dissimilar points to be farther away from each other in that subspace. Figuratively, we polarize the latent subspaces, and hence the name.
Most previous work on supervised disentanglement for text has focused on adversarial training (John et al., 2019;. Recently, the task of disentangling textual semantics and syntax into distinct subspaces has received attention from researchers. For instance, Chen et al. (2019b) use a sentence VAE model with several multitask losses such as paraphrase loss and word position loss for this disentanglement task. Bao et al. (2019) incorporate adversarial training and make use of syntax trees along with specific multitask losses to disentangle semantics and syntax.
In polarized-VAE, we achieve disentanglement through distance based learning. In contrast to previous approaches, our method does not require the use of several multitask losses or adversarial training, both of which can result in optimization challenges. Furthermore, we do not need precise attribute labels, and we show that using proxy labels based on the concept of similarity is sufficient.
In summary, the main contributions of this paper are three-fold: (1) We propose a general framework for learning disentangled representations. Even though we test our method on an NLP task, the underlying concept is very general and can be applied to other domains such as computer vision; (2) We provide a method for disentanglement that does not rely on adversarial training or specialized multitask losses; (3) We demonstrate an application of our method by disentangling the latent space into subspaces corresponding to syntax and semantics. Such a setting can be used to perform controlled text decoding such as generating a paraphrase with a desired sentence structure.

Proposed Approach
In VAEs, a probabilistic encoder q φ (z|x) is used to encode a sentence x into a latent variable z, and a probabilistic decoder p θ (x|z) attempts to reconstruct the original sentence x from its latent representation z. The objective is to minimize the following loss function: is the sentence reconstruction loss and L kl = D kl (q φ (z|x)||p(z)) is the Kullback-Leibler (KL) divergence loss. The KL term ensures that the approximate posterior q φ (z|x) is close to the prior p(z), which is typically assumed to be the standard normal N (0, I); λ kl is a hyperparameter that controls the extent of KL regularization.
The idea behind our polarized-VAE approach is to impose additional proximity regularization on the latent subspaces learnt by VAEs. Let C = {c 1 , ..., c k } be the collection of criteria, based on which we wish to disentangle the latent space z of the VAE into k subspaces: z = [z (1) , . . . , z (k) ].
Here z (i) denotes the latent subspace corresponding to the criterion c i (see Figure 1). In this paper, we focus on the case where the latent space is disentangled into semantics (c 1 ) and syntax (c 2 ), i.e., k = 2.

Supervision based on Similarity
We assume that we have information (possibly noisy) about pairwise similarities of the input sentences. Given a pair of sentences, the similarity information can be either a binary label (whether both the sentences belong to the same class or not) or an integer or continuous scalar variable (e.g., edit distance). In this work, the similarity criterion is a binary label: (2) In our case, the two criteria for disentanglement are semantics (c 1 ) and syntax (c 2 ). We use this additional information to regularize the latent space of the VAE by incorporating the proximity based loss functions, denoted as D(z j |c 2 ).

Training Method and Proximity Function
Extending the traditional VAE approach, we have a set of RNN-based encoders parameterized by φ c that learn the approximate posteriors q φc (z (c) |x). Given two data points x i and x j , we denote the proximity of their encodings in the latent subspace by D(q φc (z (c) |x i ), q φc (z (c) |x j )). We experiment with multiple forms of proximity functions and found cosine distance to perform the best: Based on the above distance, we add a regularization term to the VAE loss function as follows. For each example (x, c), we have a positive sample x p and m negative samples x n 1 , ..., x nm , such that Sim(x, x p |c) = 1 and Sim(x, x n j |c) = 0; j ∈ {1, ..., m}: This regularization function can be viewed as a max-margin loss over the proximity function. The final objective then becomes

Experiments
To demonstrate the effectiveness of polarized-VAE in obtaining disentangled representations, we carry out semantics-syntax separation of textual data, using the Stanford Natural Language Inference dataset (SNLI, Bowman et al. (2015)). Model implementation details are provided in Appendix A.

Reconstruction and Sample Quality
We evaluate our model on reconstruction and sample quality to ensure that the distance-based regularization used does not adversely impact its reconstruction or sampling capabilities. For this purpose, we compare our model and the standard VAE on two metrics: reconstruction BLEU (Papineni et al., 2002) and the Forward Perplexity (PPL) 2 (Zhao et al., 2018) of the generated sentences obtained by sampling from the model's latent space. As seen in Figure 2, there is a clear trade-off between reconstruction quality and sample quality, which is expected. Overall, polarized-VAE performs slightly better than standard VAE and this indicates that the proximity-based regularization does not inhibit the model capabilities.

Controlled Generation and Transfer
We follow previous work (Chen et al., 2019a;Bao et al., 2019) and analyze the performance of controlled generation by evaluating syntax transfer in generated text. Given two sentences, x sem and x syn , we wish to generate a third sentence that 2 PPL is computed using the KenLM toolkit (Heafield et al., 2013) combines the semantics of x sem and the syntax of x syn using the following procedure: Following the evaluation methodology of Bao et al. (2019), we measure transfer based on (1) semantic content preservation for the semantic subspace and (2) the tree edit distance (Zhang and Shasha, 1989) for the syntactic subspace.
We consider pairs of sentences from the SNLI test set for evaluation. We would like the generated sentence to be close to x sem and different from x syn in terms of semantics, which is measured using BLEU scores. We also report the difference to indicate the strength of transfer denoted by ∆BLEU. Additionally, we would like the generated sentence to be syntactically similar to x syn and different from x sem , which is measured by averaged sentence-level Tree Edit Distance (TED). We also report ∆TED to indicate the strength of the syntax transfer. Finally, we use the Geometric Mean of ∆BLEU and ∆TED to report a combined score ∆GM.
Our default variant of polarized-VAE uses the entailment labels from SNLI dataset as a proxy for semantic similarity based on which positive and negative samples are chosen. For this model, we threshold the difference in TED of syntax parses as a proxy for syntactic similarity. As shown in Table 1, we also evaluate three other variants of our model. In polarized-VAE (wo) we use word overlap (BLEU scores) as a heuristic proxy for estimating semantic similarity, while keeping syntactic training unchanged. We also experiment with heuristics for syntax in polarized-VAE (len) where we use length as a heuristic proxy for syntax, while still using ground truth entailment labels for semantic training. Finally we combine these two   (2019) report TED after multiplying by 10, we report their score after correction. For each model, the human evaluation scores represent percentage of instances that it was ranked the best for a given criteria (semantics preservation/syntax transfer/fluency).
heuristics in polarized-VAE (wo, len), which can be viewed as an unsupervised variant that does not make use of any ground truth labels or syntax trees. Our model outperforms the VAE baseline on all metrics. In comparison to (Bao et al., 2019), polarized-VAE is much better at ignoring the semantic information present in x syn during syntax transfer, as evidenced by our lower BLEU scores w.r.t. x syn . On the other hand, we perform slightly worse on BLEU w.r.t. x sem . Our model does a better job at matching the syntax of sentence x syn as indicated by the lower TED score w.r.t. x syn . Qualitative samples of syntax transfer are provided in Appendix C.

Human Evaluation
We carried out a human evaluation study for comparing outputs generated from different models. The test setup is as follows -we provide as input two sentences, x sem and x syn to the model; we wish to generate a sentence that combines the semantics of x sem and the syntax of x syn . We asked 5 human annotators to evaluate the outputs from the 3 models: baseline-VAE, polarized-VAE and the model from (Bao et al., 2019).
Each annotator was shown the input sentences (x sem and x syn ) and the outputs from the 3 models (randomized so that the evaluator is unaware of which output corresponds to which model). They were then asked to pick the one best output for each of the following three criteria: (1) semantic preservation -level of semantic similarity with respect to x sem , (2) syntactic transfer -level of syntactic similarity with respect to x syn and (3) fluency. We obtained annotations on 100 test set examples from SNLI dataset. To aggregate the annotations, we used majority voting with manual tie breaking to find the best model for each test example (and for each test criteria).
For each model, we report the percentage of instances where it was voted as best for each criteria.
From the human evaluation results in Table 1, we note that polarized-VAE is better at semantic transfer and worse at syntactic transfer in comparison to (Bao et al., 2019). The human evaluation results are consistent with the automatic evaluation metrics, where polarized-VAE scores higher on ∆BLEU (indicator of semantic transfer strength) and (Bao et al., 2019) is better at ∆TED (indicator of syntax transfer strength). With respect to fluency criterion, polarized-VAE ranks higher than (Bao et al., 2019). However, the most fluent sentences are produced by the baseline VAE. We hypothesise this to be due to the presence of additional regularization terms in the loss functions of both (Bao et al., 2019) and polarized-VAE, which in turn affects the fluency of their generated text (due to the deviation from the reconstruction objective).

Conclusion and Future Work
We proposed a general approach for disentangling latent representations into subspaces using proximity functions. Given a pair of data points, a predefined similarity criterion in the original input space determines their relative distance in the corresponding latent subspace, which is modelled via a proximity function. We apply our approach to the task of disentangling semantics and syntax in text. Our model substantially outperforms the VAE baseline and is competitive with the state-of-theart approach while being more general as we do not use specific multitask losses or architectures to encourage preservation of semantic or syntactic information. Our methodology is orthogonal to the multitask learning approaches by Chen et al. (2019b) and Bao et al. (2019) and can be naturally combined with their methods. We would further like to investigate this approach on disentanglement applications outside of NLP. Another interesting research direction would be to further explore suitable proximity functions and identify their properties that could facilitate disentanglement. Both the semantic and syntactic encoders are bidirectional LSTMs (Hochreiter and Schmidhuber, 1997) with hidden size of 128, followed by two feed-forward layers to parameterize the Gaussian mean (µ) and standard deviation (σ) parameters similar to standard VAE formulations used by (Bao et al., 2019). The latent space dimensions were taken to be dim(z 1 ) = 64 and dim(z 2 ) = 16. The decoder is a unidirectional LSTM with a hidden size of 128. We train the model for 30 epochs in total using the ADAM optimizer (Kingma and Ba, 2015) with the default parameters and a learning rate of 0.001. We adopt the standard tricks for VAE training including dropout and KL annealing followed by (Bowman et al., 2016). We anneal both semantic and syntactic KL weights (λ kl ) upto 0.3 (5000 steps) using the same sigmoid schedule (Bahuleyan, 2018).

B Proximity Functions
We provide results for the other proximity functions that we explored for the polarized-VAE model.  We note that since there is no closed form expression for the JS divergence between two Normal Random variables we used the generalized JS Divergence proposed by (Nielsen, 2020).

C Transfer Examples
We provide qualitative examples of our transfer experiments, where we generate a sentence with the semantics of x sem and the syntactic structure of x syn in Table 4. We also provide the sentences generated by a standard-VAE for comparison.

D Disentanglement of Latent Subspaces
We test if there a possibility that the two latent subspaces encode similar information. This is only likely to happen if the attributes themselves are highly correlated (e.g., if we want to disentangle syntax from length). For such cases, even existing methods based on adversarial disentanglement (John et al., 2019) may fail to completely separate out correlated information. However, if the attributes are different enough (or ideally independent) for e.g., syntax and semantics, this is less problematic. Note that we apply our proximity loss independently to each of the subspaces (i.e., leaving the other space(s) untouched for a given input). This encourages the semantic encoder to encode semantically similar sentences close together and dissimilar ones far apart in the semantic space (same applies for the syntax encoder).
We empirically compute correlations between the semantic and syntax latent vectors for 1000 test sentences as a way to check whether the two encoders learn similar information. By feeding 1000 sentences from the test set to the Polarized-VAE, we obtain their corresponding semantic (z sem ) and syntax (z syn ) latent vectors. We then empirically compute the correlation between z sem and z syn . To analyze the level of similarity of information represented in z sem and z syn , we report the maximum absolute correlation (max across all pairs of dimensions) and also the mean absolute correlation. A higher value of correlation would indicate that there is more overlapping information learnt by the semantic and syntactic encoders. As illustrated in Table 3, the analysis indicates that the semantic and syntax latent vectors in polarized-VAE encodes less correlated information than standard-VAE (due to the proximity-based regularization). This demonstrates that the 2 latent spaces learned by our model encode sufficiently different information.  x sem x syn polarized-VAE standard-VAE A man works near a vehicle.
A woman showing her face from something to her friend.
A man directing traffic on a bicycle to an emergency vehicle.
A woman works on a loom while sitting outside. A family in a party preparing food and enjoying a meal.
Man reading a book.
A person enjoying food.
A man plays his guitar.
Two young boys are standing around a camera outdoors.
Three kids are on stage with a vacuum cleaner.
Two young boys are standing around a camera outdoors.
Two people are standing on a snowy hill.
There are a group of people sitting down.
They are outside. There are people. They are outside a woman wearing a hat and hat is chopping coconuts with machete.
The person is in a blue shirt playing with a ball.
a woman with a hat is hanging upside down over utensils.
A girl in a pink shirt and elbow pads is swirling bubbles. The young girl and a grownup are standing around a table , in front of a fence.
A guy stands with cane outdoors.
The young girl is outside.
The little boy is doing a show.
A person is sleeping on bed.
A man and his son are walking to the beach , looking for something. A man and a child sit on the ground covered in bed with rocks.
A man is wearing blue jeans and a blue shirt walking. The men and women are enjoying a waterfall.
A dog is holding an object.
The man and woman are outdoors.
The two men are working on the roof. a man dressed in uniform.
There is a man with a horse on it.
A man dressed in black clothing works in a house.
A man dressed in black and white holding a baby. Table 4: Examples of transferred sentences that use the semantics of x sem and syntax of x syn