Unsupervised Representation Disentanglement of Text: An Evaluation on Synthetic Datasets

To highlight the challenges of achieving representation disentanglement for text domain in an unsupervised setting, in this paper we select a representative set of successfully applied models from the image domain. We evaluate these models on 6 disentanglement metrics, as well as on downstream classification tasks and homotopy. To facilitate the evaluation, we propose two synthetic datasets with known generative factors. Our experiments highlight the existing gap in the text domain and illustrate that certain elements such as representation sparsity (as an inductive bias), or representation coupling with the decoder could impact disentanglement. To the best of our knowledge, our work is the first attempt on the intersection of unsupervised representation disentanglement and text, and provides the experimental framework and datasets for examining future developments in this direction.


Introduction
Learning task-agnostic unsupervised representations of data has been the center of attention across various areas of Machine Learning and more specifically NLP. However, little is known about the way these continuous representations organise information about data. In recent years, the NLP community has focused on the question of design and selection of suitable linguistic tasks to probe the presence of syntactic or semantic phenomena in representations as a whole (Bosc and Vincent, 2020;Voita and Titov, 2020;Torroba Hennigen et al., 2020;Pimentel et al., 2020;Hewitt and Liang, 2019;Ettinger et al., 2018;Marvin and Linzen, 2018;Conneau et al., 2018). Nonetheless, a finegrain understanding of information organisation in coordinates of a continuous representation is yet to be achieved.
Arguably, a necessity to move in this direction is agreeing on the cognitive process behind language generation (fusing semantic, syntactic, and lexical components), which can then be reflected in the design of representation learning frameworks. However, this still remains generally as an area of debate and perhaps less pertinent in the era of self-supervised masked language models and the resulting surge of new state-of-the-art results.
Even in the presence of such an agreement, learning to disentangle the surface realization of the underlying factors of data (e.g., semantics, syntactic, lexical) in the representation space is a nontrivial task. Additionally, there is no established study for evaluating such models in NLP. A handful of recent works have looked into disentanglement for text by splitting the representation space into predefined disentangled subspaces such as style and content (Cheng et al., 2020;John et al., 2019), or syntax and semantics (Balasubramanian et al., 2021;Bao et al., 2019;Chen et al., 2019), and rely on supervision during training. However, a generalizable and realistic approach needs to be unsupervised and capable of identifying the underlying factors solely via the regularities presented in data.
In areas such as image processing, the same question has been receiving a lot of attention and inspired a wave of methods for learning and evaluating unsupervised representation disentanglement (Ross and Doshi-Velez, 2021;Mathieu et al., 2019;Kim and Mnih, 2018;Burgess et al., 2018;Higgins et al., 2018Higgins et al., , 2017 and creation of large scale datasets (Dittadi et al., 2021). It has been argued that disentanglement is the means towards representation interpretability (Mathieu et al., 2019), generalization (Montero et al., 2021), and robustness Bengio, 2013). However, these benefits are yet to be realized and evaluated in text domain.
In this work we take a representative set of unsupervised disentanglement learning frameworks widely used in image domain ( §2.1) and apply them to two artificially created corpora with known underlying generative factors ( §3). Having known generative factors (while being ignored during the training phase) allows us to evaluate the performance of these models on imposing representation disentanglement via 6 disentanglement metrics ( §2.2; §4.1). Additionally, taking the highest scoring models and corresponding representations, we investigate the impact of representation disentanglement on two downstream text classification tasks ( §4.3), and dimension-wise homotopy ( §4.4).
We show that existing disentanglement models, when evaluated on a wide range of metrics, are inconsistent and highly sensitive to model initialisation. However, where disentanglement is achieved, it shows its positive impact on improving downstream task performance. Our work highlights the potential and existing challenges of disentanglement on text. We hope our proposed datasets, accessible description of disentanglement metrics and models, and experimental framework will set the path for developments of models specific to for text.

Disentanglement Models and Metrics
Let x denote data points and z denote latent variables in the latent representation space, and assume data points are generated by the combination of two random process: The first random process samples a point z (i) from the latent space with prior distribution of z, denoted by p(z). The second random process generates a point x (i) from the data space, denoted by p(x|z (i) ).
We consider z as a disentangled representation for x, if the changes in single latent dimensions of z are sensitive to changes in single generative factors of x while being relatively invariant to changes in other factors . Several probabilistic models are designed to reveal this process, here we look at some of the most widely used ones.

Disentanglement Models
A prominent approach for learning disentangled representations is through adjusting Variational Auto-Encoders (VAEs) (Kingma and Welling, 2014) objective function, which decompose the representation space into independently learned coordinates. We start by introducing vanilla VAE, and then cover some of its widely used extensions that encourage disentanglement: VAE uses a combination of a probabilistic encoder q φ (z|x) and decoder p θ (x|z), parameterised by φ and θ, to learn this statistical relationship between x and z. The VAEs are trained by maximizing the lower bound of the logarithmic data distribution log p(x), called evidence lower bound, The first term of is the expectation of the logarithm of data likelihood under the posterior distribution of z. The second term is KL-divergence, measuring the distance between the posterior distribution q φ (z|x) and the prior distribution p(z) and can be seen as a regularisation.
β-VAE (Higgins et al., 2017) adds a hyperparameter β to control the regularisation from the KL-term via the following objective function: Reconstructing under β-VAE (with the right value of β) framework encourages encoding data points on a set of representational axes on which nearby points along those dimensions are also close in original data space (Burgess et al., 2018). (Burgess et al., 2018) extends β-VAE via constraint optimisation:

CCI-VAE
where C is a positive real value which represents the target KL-divergence term value. This has an information-theoretic interpretation, where the placed constraint C on the KL term is seen as the amount of information transmitted from a sender (encoder) to a receiver (decoder) via the message (z) (Alemi et al., 2018), and impacts the sharpness of the posterior distribution (Prokhorov et al., 2019). This constraint allows the model to prioritize underlying factors of data according to the availability of channel capacity and their contributions to the reconstruction loss improvement. (Mathieu et al., 2019) introduces an additional term to β-VAE, D M M D (q φ (z), p θ (z)),

MAT-VAE
where D M M D is computed using maximum mean discrepancy (Gretton et al. (2012), MMD) and λ is the scalar weight. This term regularises the aggregated posterior q φ (z) with a factorised spikeand-slab prior (Mitchell and Beauchamp, 1988), which aims for disentanglement via clustering and sparsifying the representations of z.

Issue of KL-Collapse
In text modelling, the presence of powerful autoregressive decoders poses a common optimisation challenge for training VAEs called posterior collapse, where the learned posterior distribution q φ (z|x), collapses to the prior p(z  , 2018;Higgins et al., 2017). In this work, both β-VAE (with β < 1) and CCI-VAE are effective methods to avoid KL-collpase.

Disentanglement Metrics
In this section we provide a short overview of six widely used disentanglement metrics, highlighting their key differences and commonalities, and refer the readers to the corresponding papers for exact details of computations. Eastwood and Williams (2018) define three criteria for disentangled representations: disentanglement, which measures the degree of one dimension only encoding information about no more than one generative factor; completeness, which measures whether a generative factor is only captured by one latent variable; informativeness, which measures the degree by which representations capture exact values of the generative factors. 2 They design a series of classification tasks to predict the value of a generative factor based on the latent code, and extract the relative importance of each latent code for each task to calculate disentanglement and completeness scores. Informativeness score is measured by the accuracy of the classifier directly. Other existing metrics reflect at least one of these three criteria, as summarised in Table 1  Higgins et al. (2017) focus on disentanglement and propose to use the absolute difference of two groups of representations with the same value on one generative factor to predict this generative factor. For perfectly disentangled representations, latent dimensions not encoding information about this generative factor would have zero difference. Hence, even simple linear classifiers could easily identify the generative factors based on the changes of values. Kim and Mnih (2018) consider both disentanglement and completeness by first finding the dimension which has the largest variance when fixing the value on one generative factor, and then using the found dimension to predict that generative factor. Kumar et al. (2018) propose a series of classification tasks each of which uses a single latent variable to predict the value of a generative factor and treat the average of the difference between the top two accuracy scores for each generative factor as the final disentanglement score.
Apart from designing classification tasks for disentanglement evaluation, another method is based on estimating the mutual information (MI) between a single dimension of the latent variable and a single generative factor. Chen et al. (2018) propose to use the average of the gap (difference) between the largest normalised MI (by the information entropy of the generative factor) and the second largest normalised MI over all generative factors as the disentanglement score, whereas the modularity metric of Ridgeway and Mozer (2018) measures whether a single latent variable has the highest MI with only one generative factor and none with others.
The algorithmic details for computing the above metrics are provided in Appendix A.
Empirical Difference. To highlight the empirical difference between these metrics, we use a toy set built by permuting four letters: A B C D. Each letter representing a generative factor with 20 choices of assignments (i.e, X = {X1, . . . , X20} where X ∈ {A, B, C, D}). We consider two settings where each generative factor is embedded in a single dimension (denoted by Ex.1), or two dimensions (denoted by Ex.2). In each setting we uniformly sample 20 values from -1 to 1 to represent 20 assignments per factor and use them to allocate the assignments into distinctive bins per each corresponding dimension. By concatenating dimensions for each generative factor, we construct two ideal disentangled representations for data points in this toy dataset, amounting to 4 and 8 dimensional representations, respectively. Using these representations (skipping the encoding step), we measured the above metrics.
which contains observations who has value v ij on generative factor f i while everything else is arbitrary. We present two synthetic datasets ( §3) that meet these criteria and use them in our experiments ( §4).

Generative Synthetic Datasets
The use of synthetic datasets is the common practice for evaluating disentanglement in image domain (Dittadi et al., 2021;Higgins et al., 2017;Kim and Mnih, 2018). Generative simplistic datasets in image domain define independent generative factors (e.g. shape, color) behind the data generation. However, a comparable resource is missing in text domain. We develop two synthetic generative datasets with varying degrees of difficulty to analyse and measure disentanglement: The YNOC dataset ( §3.1) which has only three structures and generative factors appearing in every sentence, and the POS dataset ( §3.2) which has more structures while some generative factors are not guaranteed  to appear in every sentence. The YNOC dataset offers a simpler setting for disentanglement.

YNOC Dataset
Sentences in YNOC are generated by 4 generative factors: Year (Y), Name (N), Occupation (O), and City (C), describing the occupation of a person. Since we often use different means to express the same message, we considered three templates to generate YNOC sentences: The templates were then converted into real sentences using 10 years, 40 names, 20 occupations, and 30 cities. This amounted to a total of 720K sentences, split as (60%,20%,20%) into training, validation, and test sets.

POS Dataset
We use part-of-speech (POS) tags to simulate the structure of sentences and define a base grammar as "n. v. n. end-punc.", where 'n.' denotes noun, 'v.' denotes verb and 'end-punc.' denotes the punctuation which appears at the end of sentences. Then we define simple sentence structures as "(adj.) n.
where 'conj1.' and 'conj2.' denote two different kinds of conjunction, 'comma' denotes ',' and 'S1' and 'S2' are two simple sentence structures without 'end-punc. ' We limit the number of POS tags that appear in 'S1' and 'S2' to 9 to control the complexity of generating sentences and obtain 279 complex structures in total. A maximum of 5 words is chosen for each POS to construct our sentences.
The frequency of appearance for each word in a sentence is limited to one. Although this construction does not focus on sentences being "realistic", it simulate natural text in terms of the presence of an underlying grammar and rules over POS tags. 3 We deliberately ignore semantics, since isolating semantics in terms of generative factors potentially involves analysis over multiple dimensions (combinatorial space) and quantifying grouped disentanglement requires suitable disentanglement metrics to be developed. We leave further exploration of this to our future work.
We split the dataset into training, validation and test sets with proportion 60%, 20%, 20%. This proportion is used for every structure to ensure they have representative sentences in each portion of the data splits. The final size of (training, validation, test) sets are (1723680, 574560, 574560). All three sets are unbiased on word selection for each POS tag: e.g., all 5 noun POS vocabs from Table 2 have equal frequency (i.e., 20%). Exactly the same proportions are preserved for validation and test sets.
Through the process of the generation, we can define each POS tag as one ground truth generative factor for sentences. 4 Because the choices of words for different POS tags are independent, these generative factors are independent. However, for the same POS, the choices of words are dependent and POS tags are dependent on the structures as well. It is noteworthy that in contrast to the image domain where all generative factors are always present in the data, in POS dataset this cannot be guaranteed, making it a more challenging setting.

Experiments and Analysis
In this section, we examine the introduced disentanglement models on text. We measure the disentanglement scores of each model on our two synthetic datasets and quantify how well-correlated these metrics are with reconstruction loss, active units, and KL ( §4.1). We then look at various strategies for coupling the latent code during decoding and highlight their impacts on training and disentanglement behaviors ( §4.2). We continue our analysis by showing how the representation learned by the highest scoring model (on disentanglement metrics) performs compared to vanilla VAE in two text classification tasks ( §4.3), and finish our analysis by looking at these models' generative behaviors ( §4.4).
Training Configuration. We adopt the VAE architecture from (Bowman et al., 2016), using a LSTM encoder-decoder. Unless stated otherwise, (word embedding, LSTM, representation embedding) dimensionalities for YNOC and POS datasets are (4D, 32D, 4D) and (4D, 64D, 8D), respectively, and we use the latent code to initialize the hidden state of the LSTM decoder. We use greedy decoding. All models are trained from multiple random starts using Adam (Kingma and Ba, 2015) with learning rate 0.001 for 10 epochs. We set batch size to 256 and 512 for YNOC and POS, respectively.

Disentanglement Metrics
Taking the models ( §2.1) and also an Autoencoder (AE) as a baseline we use the YNOC and POS datasets to report average KL-divergence (KL), reconstruction loss (Rec.), and number of active units (AU) 5 in Table 3, and illustrate disentanglement metrics' scores in Figure 1.
As demonstrated in Table 3, different models pose various behaviors, noteworthy of those are: roles (e.g., subject-noun and object-noun, etc) is a possibility for future investigation.   (1) the positive correlation of C with AU which intuitively means the increase of channel capacity demands more dimensions of the representation to carry information which then translates into having a better reconstruction of data, (2) the negative correlation between the increase of β and decrease of reconstruction loss, (3) the best Rec. and AU are achieved by AE and MAT-VAE whereas the worst one is achieved by the (collapsed) vanilla-VAE, (4) the MAT-VAE (β = 0.01, λ = 0.1) model which induces more sparse representations 6 performs the best on both datasets, indicating the positive impact of representation sparsity as an inductive bias. As illustrated in Figure 1, the difference between means of each disentanglement score on various models is relatively small, and due to large standard deviation on metrics, it is difficult to single out a superior model. This verifies findings of Lo-6 Sparsity is measured using Hoyer (Hurley and Rickard, 2009). In this paper we report this as the average Hoyer over data points' posterior means. Hoyer for data point xi with posterior mean µi is calculated as  Table 3 (Top-3 column) we report the number of appearances of a model among the top 3 highest scoring models on at least one disentanglement metric. The ranking suggests that β-VAE with smaller β values reach better disentangled representations, and MAT-VAE performing superior on YNOC and poorly on POS, highlighting its more challenging nature. For MAT-VAE we also observe an interesting correlation between sparsity and disentanglement: for instance on YNOC, MAT-VAE (β = 0.01, λ = 0.1) achieves the highest Hoyer (See Table 4) and occurs 7 times among Top-3 (see Table 3). Interestingly, the success of MAT-VAE does not translate to POS dataset, where it underperforms AE. These two observations suggest that sparsity could be a facilitator for disentanglement, but achieving a stable level of sparsity remains as a challenge. The more recent development in the direction of sparsity, HSVAE (Prokhorov et al., 2020), addresses the stability issue of MAT-VAE but we leave its exploration to future work.
To further analyse the inconsistency between different metrics we calculate the Pearson product-       Figure 2. While text-specific metrics are yet to be developed, our experiment suggests Higgins et al. (2017) is a good candidate to try first for text domain as it seems to be the one with strong correlation with Hoyer, AU, -Rec, and KL and has the highest level of agreement (overall) with other metrics.

Coupling Latent Code and Decoder
In VAEs, we typically feed the decoder with the latent code as well as word embeddings during training. The method to couple the latent code with decoder could have some effects on disentanglement for text. To highlight this, we train with 4 different coupling strategies: Init, Concat, Init Concat, Concat w/o Emb. See Figure 3a for an accessible visualisation. To analyse the impact of coupling, we opt for CCI-VAE which allows the comparisons to be made for the same value of KL. We first use Concat w/o Emb to find an optimal KL in vanilla VAEs, which is then used as the C to train CCI-VAEs using the other coupling metrics on YNOC and POS datasets. For YNOC, C = 1.5, and for POS, C = 5.5. This is to keep KLdivergence and reconstruction loss at the same level for fair comparison across different strategies. We report results in Table 5. Among the investigated coupling methods, the key distinguishing factor for disentanglement is their impacts on AU which is the highest for Concat.
Next, using Init as the baseline, we measure the absolute difference between disentanglement scores of different coupling methods in Figure 3b. In general, using concatenation can bring a large improvement in disentanglement. Using both initialization and concatenation do not lead to a better result. Despite our expectation, not feeding word embeddings into decoder during training does not encourage disentanglement due to the added reliance on the latent code.
A confounding factor which could pollute this analysis is the role of strong auto-regressive decoding of VAEs and the type of information captured by the decoder in such scenario. While a preliminary analysis has been provided recently (Bosc and Vincent, 2020), this has been vastly underexplored and requires more explicit attempts. We leave deeper investigation of this to future work.

Disentanglement and Classification
To examine the performance of these models on real-world downstream task setting, we consider the classification task. For our classification datasets, we use DBpedia (14 classes) and Yahoo Question (10 classes) (Zhang et al., 2015). Each class of these two datasets has (10k, 1k, 1k) randomly chosen sentences in (train, dev, test) sets. We train Vanilla-VAE, β-VAE (β = 0.2), CCI-VAE (C = 10), and MAT-VAE (β = 0.01, λ = 0.1) from Table 3 on DBpedia and Yahoo (without the labels), then freeze the trained encoders and place a classifier on top to use the mean vector representations from the encoder as a feature to train a classifier.
We set the dimensionality of word embedding, LSTM, and the latent space to 128, 512, 32, respectively. The VAE models are trained using a batch size of 64, for 6 epochs with Adam (learning rate 0.001). For the classifier, we use a single linear layer with 1024 neurons, followed by a Softmax and train it for 15 epochs, using Adam (learning rate 0.001) and batch size 512. We illustrate the mean and standard deviation across 3 runs of models in Figure 4.
that disentangled representations are likely to be easier to discriminate, although the role of sparsely learned representations could contribute to MAT-VAE's success as well (Prokhorov et al., 2020).

Disentanglement and Generation
To observe the effect of disentanglement in homotopy (Bowman et al., 2016), we use the exactly same toy dataset introduced in §2.1 and assess the homotopy behaviour of the highest scoring VAE vs. an ideal representation. To conduct homotopy, we interpolate between two sampled sequences' representations and pass the intermediate representations to decoder to generate the output. We use 4D word embedding, 16D LSTM, 4D latent space. We report the results for the VAEs scoring the highest on disentanglement Additionally, to highlight the role of generative factor in generation, we conduct a dimensionwise homotopy, transitioning from the first to the last sentence by interpolating between the dimensions one-by-one. This is implemented as follows: (i) using prior distribution 7 we sample two latent codes denoted by z 1 = (z 1,1 , z 1,2 , . . . , z 1,n ), z 2 = (z 2,1 , z 2,2 , . . . , z 2,n ); (ii) for i-th dimension, using z 1,i = (z 2,1 , . . . , z 2,i−1 , z 1,i , . . . , z 1,n ) as the start, we interpolate along the i-th dimension towards z 2,i = (z 2,1 , . . . , z 2,i , z 1,i+1 , . . . , z 1,n ). Table 6 illustrates this for a 3D latent code example.
Results: Table 7 reports the outputs for standard homotopy (top block) and dimension-wise homotopy. The results for standard homotopy demon- strate that the presence of ideally disentangled representation translates into disentangled generation in general. However, both VAE-Higg and VAE-Chen seem to mainly be producing variations of the letter in the first position (letter A) during the interpolation. The same observation holds in the dimension-wise experiments. VAE-Chen also produces variations of the letter in the second position (letter B) along with the variation of letter A, which suggests the lesser importance of completeness for disentangled representations. This indicates that despite the relative superior performance of certain models on the metrics and classification tasks, the amount of disentanglement present in the representation is not sufficient enough to be reflected by the generative behavior of these models. As a future work, we would look into the role of auto-regressive decoding and teacherforcing as confounding factors that can potentially affect the disentanglement process.

Conclusion and Future Directions
We evaluated a set of recent unsupervised disentanglement learning frameworks widely used in image domain on two artificially created corpora with known underlying generative factors. Our experiments highlight the existing gaps in text domain, the daunting tasks state-of-the-art models from image domain face on text, and the confounding elements that pose further challenges towards representation disentanglement in text domain. Motivated by our findings, in future, we will explore the role of inductive biases such as representation sparsity in achieving representation disentanglement. Additionally, we will look into alternative forms of decoding and training which may compromise reconstruction quality but increase the reliance of decoding on the representation, hence allowing for a more controlled analysis and evaluation.
Our synthetic datasets and experimental framework provide a set of quantitative and qualitative measures to facilitate and future research in developing new models, datasets, and evaluation metrics specific for text.

A Disentanglement Metrics Algorithms
To evaluate representations learned by a model on a dataset having the attributes of Data Requirement, we further require a series of representation space R ij , who has a bijection mapping with S ij . Hence, when sampling representations which have the same value on one generative factor, we only need to sample in one R ij . Under these notations, we write the pseudo code of metrics in Algorithm 1-6. For Algorithm 5 and 6, although we only use one criterion in the main paper, we still provide the details for other criteria. We set N = 1000 and L = 64 for Algorithm 1 and 2, and N = 10000 for Algorithm 3, 4, 5, and 6. with proportion (80%, 20%) 11: Train 10 MLPs with only input and output layer on TR 12: Calculate the accuracy on TE for 10 models 13: Calculate the mean and variance of accuracy