A Deep Decomposable Model for Disentangling Syntax and Semantics in Sentence Representation

Recently, disentanglement based on a generative adversarial network or a variational autoencoder has significantly advanced the performance of diverse applications in CV and NLP domains. Nevertheless, those models still work on coarse levels in the disentanglement of closely related properties, such as syntax and semantics in human languages. This paper introduces a deep decomposable model based on VAE to disentangle syntax and semantics by using total correlation penalties on KL divergences. Notably, we decompose the KL divergence term of the original VAE so that the generated latent variables can be separated in a more clear-cut and interpretable way. Experiments on benchmark datasets show that our proposed model can significantly improve the disentanglement quality between syntactic and semantic representations for semantic similarity tasks and syntactic similarity tasks.


Introduction
Recently, disentangled representations have significantly advanced the performance of several applications in NLP. For example, disentanglement has been used to separating representation of attributes such as sentiment from contents (Fu et al., 2018;John et al., 2019), understanding subtleties in component modeling (Esmaeili et al., 2019), detecting anomalies , and learning sentence representations that split the syntax and the semantics . They are also used to boost text generation (Iyyer et al., 2018;Jain et al., 2018) or calculating the semantic or syntactic similarity between sentences .
In this paper, we focus on the task of separating syntax and semantics in sentence representation learning. Unlike previous supervised approaches that usually resort to syntactic parsers to handle syntax processing, our approach separates syntactic and semantic variables by disentangling hidden states of deep neural nets in a self-learning and unsupervised fashion.
The first work focusing on the separation of syntax and semantics from hidden variables is Chen et al. (2019). They proposed a deep generative model based on VAE with two latent variables to represent syntax and semantics. The generative model comprises von Mises Fisher (vMF) and Gaussian priors on the semantic and syntactic latent variables, and a deep BOW decoder conditioning on these latent variables. Following previous work, they train this model by optimizing the Evidence Lower Bound (ELBO) with a VAE-like (Kingma and Welling, 2014) objective.
However, their approach still generates a rough decomposition and thus may fail to disentangle syntax and semantics at a finer granularity. To address this weakness, we propose a decomposable variational autoencoder (DecVAE) to allow hidden variables factorizable. From a modeling perspective, factorizable representations with statistically independent variables usually obtained in an unsupervised or semi-supervised manner can distill information into a compact form, which is semantically useful for downstream tasks. From an application perspective, different words or phrases in sentences represent various entities with variant roles. It is necessary to utilize decomposable latent variables to capture a variety of entities with different semantic meanings.
Towards building a finer-grained disentanglement, motivated by FactorVAE (Kim and Mnih, 2018), we extend the work in Chen et al. (2019) and use total correlation (Watanabe, 1960) (TC) as a penalty term to obtain a deeper and meaningful factorization of syntactic and semantic latent variables. To make TC more discriminative, we also integrate multi-head attention into this framework. DecVAE can identify and cluster hierarchically independent semantic components in natural language text, which exhibits hierarchical linguistic structure (Sanh et al., 2019), and the corresponding syntax and semantics interact with each other. For experiments, we evaluate learned semantic representations on the SemEval semantic textual similarity (STS) tasks. Following the protocol in Chen et al. (2019), we predict the syntactic structure of an unseen sentence to be the one similar to its nearest neighbor, determined by the latent syntactic representation in a large dataset of annotated sentences. Experiments show that DecVAE achieves the best performance on all tasks when learned representations are mostly disentangled.
Contributions. Firstly, we propose a generic Dec-VAE to disentangle semantics and syntax based on the total correlation of KL divergence. Secondly, DecVAE is also integrated with a multi-head attention network to cluster embedding vectors so that corresponding word embeddings are more discriminative. Thirdly, results after integrating DecVAE in disentangling syntax from semantics achieve SOTA performances, confirming DecVAE's effectiveness.

VAEs for Disentanglement
The variational autoencoder (VAE) (Kingma and Welling, 2014) is a latent variable model that pairs a top-down generator with a bottom-up inference network. Different from traditional maximumlikelihood estimation (MLE) approach, VAE training is done by evidence lower bound (ELBO) optimization in order to overcome the intractability of posterior. Basically, the objective function of VAE is represented as: When β = 1, this is the standard VAE. When β > 1, it becomes β-VAE (Higgins et al., 2017), which attempts to learn a disentangled representation by optimizing a heavily penalized objective.
Vanilla VAEs cannot disentangle latent variables. PixelGAN Autoencoders (Makhzani and Frey, 2017) further break down the KL term as: where I(x; z) is the mutual information under the joint distribution p(x)q(z|x). Penalizing the KL(q(z)||p(z)) term pushes q(z) towards the factorial prior p(z), encouraging independence in the dimensions of z and thus disentangling.
Alternatively, FactorVAE approaches this problem with total correlation penalty (Kim and Mnih, 2018), which we adopt for our work. FactorVAE achieves similar disentangling results while preserving good quality of reconstruction by augmenting the vanilla VAE objective with a term directly encouraging independence in the code distribution: . The FactorVAE's objective is also a lower bound on the marginal log likelihood E p [log p(X)]. KL(q(Z)||q(Z)) is known as "Total Correlation" (TC) (Watanabe, 1960), a popular measure of dependence for multiple random variables.

Disentanglement in NLP
Disentanglement in NLP has strong connections with LDA (Blei et al., 2003;Blei and Lafferty, 2006). In particular, neural topic models, that use belief networks (Mnih and Gregor, 2014;Li et al., 2019b) or enforce the Dirichlet prior via Gaussian or Wassertein autoencoders (Nan et al., 2019;Li et al., 2018), associate topic learning to disentanglement with component analysis. Later on, seq2seq VAE represent disentangled topics via continuous representations (Dieng et al., 2017;Ding et al., 2018;Bowman et al., 2016;. Srivastava and Sutton (2017) combines LDA and VAE for topic detection and Pergola et al. (2021) proposes to consider latent topics as generative factors to be disentangled to improve discriminative power of topics.
Although much work has been done on grammatical and semantic analysis, there are few explorations on disentangling syntax and semantics. The disentanglement between syntax and semantics is quite challenging since they are heavily entangled. Except under some circumstances where there are no ambiguities, such as some unique proper names, it is usually difficult to find absolute borderlines among words, phrases, or entities.
The work of VGVAE (Chen et al., 2019) is the latest one quite relevant to our work, wherein they assume that a sentence is generated by conditioning on two independent latent variables: semantic variable z sem and syntactic variable z syn . For inference, they assume a factored posterior is produced and a lower bound on marginal log-likelihood is maximized in the generative process. The corresponding inference and generative models are two independent word averaging encoders with additional linear feed-forward neural networks and a feed-forward neural network with the output being a bag of words or an RNN.
Compared with their work, we aim to construct a more generic work by deploying the decomposability of KL divergence, thus discovering more subtle components from latent variables. Consequently, the VAE framework can do better disentanglement with more fine-grained decomposed parts. Further, we can flexibly add regularities to guide the decomposition to generate more interpretable and controllable elements from decoders.

Proposed Approach
In this work, we are developing a generative model named Decomposable VAE (DecVAE). Although our proposed approach is applicable to any disentangled tasks in NLP, we focus on disentangling semantic and syntactic information from sentence representations. We extend VGVAE model (Chen et al., 2019) to incorporate the total correlation as a penalty term to enable latent variable factorization.

Decomposable VAE
Our model is essentially based on VAE, namely, composed of a term of computing loglikelihood of input data given latent variables, and terms of computing KL divergences between posterior variational probabilities of hidden variables given input data and the prior probabilities of hidden variables. Let x 1 , ..., x T be a sequence of T tokens (words), conditioned on a continuous latent variable z. As a usual practice, for example, like the assumption in Latent Dirichlet Allocations (LDA) (Blei et al., 2003), we have a conditional independence assumption of words on z: Model parameters θ can be learned via the variational lower-bound (Kingma and Welling, 2014) where q φ (z|x t ) is the encoder (recognition model or inference model), parameterized by φ, i.e., the approximation to true posterior p θ (z[x t ). The distribution p θ (z) is the prior for z.
As studied in Sanh et al. (2019), natural languages can be regarded as a manifold, since it is hierarchically organized, and the corresponding syntax and the semantics interact in an intricate space. Based on the observation that different words or phrases in sentences represent different entities with different roles, either grammatical or semantic, and potentially interact with each other, we guide the generations of latent variables in the VAE corresponding to entities in sentences by designing a VAE with decomposable latent variables. Hence our proposed DecVAE can identify hierarchically independent components from natural languages. Furthermore, the reconstruction network may generate words or phrases sequentially.
DecVAE will learn a decoder that maps the latent space Z (learned by the encoder from input samples) to this language manifold X . Let Z = [z 1 , · · · , z K ] ∈ Z be the latent variable of the decoder and z k to represent the k-th component of the latent variables. In addition, we also add a z 0 to each z k , a special latent variable to encodes the overall properties of the generated sentences and the correlations between different grammatical and semantic components.
be the variables for the output of the decoder (each element is a tuple composed of the generated token index in the vocabulary and its component index), where z k controls the properties of k-th componentx k .
Firstly, we assume that the components are conditionally independent with each other given the latent variables, i.e., We also have the following independent assumption about the components and latent variables, Embedding Layer with masks added to attentions Embedding Layer with masks added to attentions The proposed model consists of four layers. From bottom to top, they are embedding layer, multi-head attention layer, encoder, and decoder. Different from the usual network structure, the first three layers comprise three parallel independent layers, one for semantic and one for syntax. The attention layers yield K-dim attention weights f , so that ensemble of K weighted embeddings are working in both semantic and syntax encoders.
Letȳ = (x,f ) and eachȳ k = (x k ,f k ). We have the following distributions for generated tokens: This model attempts to encode each component's individual features ( tokens, words, or phrases) and the global latent factors for the sentence.

Objective Function
We propose to decompose the two terms of calculating KL divergence following Eq. (1). Meanwhile, along the thread of our proposed DecVAE, we add the global controller variable z 0 . This design shares some similarities with the component segmentation in computer vision, such as MONet (Burgess et al., 2019). MONet shows that an attention network layer improves component segmentation as well as component disentanglement, in which a variable, f , the representation of the attention, is deployed there. Taking these into consideration, our model is defined as following. Let z syn = [z 1 syn , · · · , z K syn ] be the syntactic latent variable, we define an equation for syntax based on the decomposable nature of latent variables as: and a similar equation for semantics as where i, j refer to indices of tokens and z k i * , * ∈ {sem, syn, 0} indicates the latent variable value at the i-th token. In Eq. (4) and Eq. (5), the second and third terms are derived from minimization of total correlations as in Esmaeili et al. (2019); Jeong and Song (2019). The second term decomposes each hidden vector of syntax and semantics into smaller categories in a hierarchical fashion so that we can have more subtle disentanglements of each syntactic or semantic components.
The third term in Eq. (4) and Eq. (5) is derived from the standard equation of total correlation, Namely, we deploy this technique to penalize the total correlation (TC) for enforcing disentanglement of the latent factors. To compute the second term, we use the weighted version for estimating the distribution value of q(z).

The Network Structure
With the above derivations as our basis, we construct our network structure as shown in Figure 1. From bottom to top, the input sentences are con-verted to embedding vectors. Meanwhile, there is a mask input with each mask m k showing whether each word or phrase x t appears in each sentence. Outputs from this layer are fed to a multi-head attention layer to generate attention weights f t . Following-up is the dot product between the embedding of x t and its attention weight f t . Since we are modeling both semantics and syntax of input sentences, the attention procedure is processed twice with different initialization. The results are passed into the semantic encoder and syntax encoder, respectively. Each encoder yields their hidden variables, (z 1···k semt , z 1···k 0t ) and (z 1···k synt , z 1···k 0t ). A similar idea is implemented in recent work from computer vision domain (CV), MONet (Burgess et al., 2019). Differently, in their work, f k is generated sequentially with an attention network while we generate attention all at once with multi-head attention, which is proven successful in the transformer model (Vaswani et al., 2017).
To incorporate recurrent neural networks for decoding, we take a similar structure described in SNAIL (Mishra et al., 2018). Namely, the selfattention mechanism from the transformer is combined with a temporal convolution. Next, the element-wise multiplication of embedding vector and focus masks generate hidden vectors, which are fed into semantic encoder and syntax encoder respectively to be encoded as a pair of variables (z k , z k 0 ). The two groups of hidden component vectors are concatenated into the decoder. We obtain the reconstructed words/phrasesx, and their component distributionf k , similar to a component assignment and consistent to the weights f k .

Multi-task Training and Inference
With the product of embedding vector emb t and their corresponding focus mask m t as the encoder's input, (z k , z k 0 ) as the latent variable and (x,m k ) as the output of the decoder, the loss for component k is given by Here a, e and d refer to multi-head attention layer, encoder and decoder layer respectively, θ and φ are parameters for the likelihood and variational distribution respectively, the local hidden variable Ψ k (x, f k ; θ, φ, a, e, d).
Loss Function Components. As seen from Eq. (6), our loss function is composed of three parts, which can be realized by our objective functions described in Eq. (4) and Eq. (5). Furthermore, following the success of multi-task training in Chen et al. (2019), we introduce three auxiliary objectives: paraphrase reconstruction loss (PRL), discriminative paraphrase loss (DPL) and word position loss (WPL). The purpose is to encourage z sym to better capture semantic information and z syn to better capture syntactic information.
Discriminative Paraphrase Loss. The Discriminative Paraphrase Loss (DPL) attempts to learn to encourage sentences with paraphrase relationships to have higher similarities while those without such relationships to have lower similarities. Because paraphrase relationship is defined in the sense of semantic similarity, we only calculate it with samples from vMF distributions. The loss is defined as, where dist refers to the distance, x 1 and x 2 are sentences with paraphrase relationship, while x 1 and n 1 are those without paraphrase relationships. The similarity function is the cosine similarity between the mean directions of the semantic variables across K components from the two sentences: where µ(x i ) = (z 1···K sem(i) z 1···K 0(i) ) and is the element-wise product.

Word Position Loss.
Following Chen et al. (2019), we keep a word position loss (WPL) to guide the representation learning of the syntactic variable. For both word averaging encoders and LSTM encoders, we parameterize WPL with a three-layer feedforward neural network f (·). The concatenation of the samples of the syntactic variables z syn and the embedding vector emb i at the word position i form the input for the network. In the decoder stage, the position representation at position i is predicted as a one-hot vector. The corresponding equation is defined as, where (·) i is the probability of position i.
Inference Model for Word Averaging. In our framework, syntax and semantics encoders q e φ (z syn |x) and q e φ (z sem |x) follow different fashions with different sampling strategies with additional linear feedforward neural network. However, both use word averaging to obtain the mean vector, µ(x) and the standard deviation vector, σ(x).
In the decoding stage, we generate a bag of words given z syn and z sem by the posterior probability p d θ (x|z syn , z sem ). Note that the decoding output is a tuple of vectors, which includes both word index and their component probability distribution. The expected output log-probability is computed as follows: where V is the vocabulary size, [;] indicates concatenation, T is the sentence length and x t is the index of the t'th word's word type. f θ ([z sem ; z syn ] is a feedforward neural network with outputs being a bag of words.

Inference Model for BLSTM Averaging
Similarly, we compute the expected output logprobability of generated words, including their component information for BLSTM as follows, The inference model q e φ (z sem ) is still a word averaging encoder while q e φ (z syn ) is parameterized by a bidirectional LSTM, where the forward and backward hidden states are concatenated together and then the average is taken. The averages are used as input for a feedforward network with one hidden layer to produce both mean vector µ(x) and σ(x).
Since both the inference model of word averaging and BLSTM are interacting with the decomposed KL divergence or total correlations through backpropagation, our inference and the generative models can obtain more factorized component information. Hence, the generated tokens are more consistent between syntax and semantics.

Experiments
Following Chen et al. (2019), we sampled 50M paraphrase pairs from ParaNMT-50M  as our training set. We use the SemEval semantic textual similarity (STS) task 2017 (Cer et al., 2017) as the development set. The STS task and its benchmark as the test set for similarity evaluation. The implementation was based on the PaddlePaddle deep learning platform.

Experiment Setup
We set the dimension of hidden variables and word embedding to 50, which speeds up experiments and provides a competitive performance over a wide range. To have a fair comparison, we also tune γ, the weights for PRL and reconstruction loss from 0.1 to 1 in increments of 0.1 based on the development set performance. We set γ = 0.2 with the best validation results. One sample from each latent variable is utilized during training. When evaluating DecVAE based models on STS tasks, the mean direction of the semantic variable is used. In contrast, the mean vector of the syntactic variable is used in syntactic similarity tasks. The total correlations are also mainly applied to syntactic tasks since we find that applying total correlations to vMF distribution makes the model too complicated. Hence, we simplify the framework with only KL divergence of attentions calculated against the semantic components for current work.

Baselines
We compare with word averaging (WORDAVG) and bidirectional LSTM averaging (BLSTMAVG) of VG-VAE model (Chen et al., 2019;. In particular, WORDAVG takes the average over word embeddings in the input sequence to obtain the sentence representation. BLSTMAVG uses the average hidden states of a bidirectional LSTM as the sentence representation, where forward and backward hidden states are concatenated.  As shown in the upper rows of Table 1, Dec-VAE+WORDAVG achieves the best semantic score for both STS avg metric and STS bm metric. LSTM-based models do not show advantages over W ord AV G as VGVAE (Chen et al., 2019). So average of LSTM outputs for decomposed VAE is not as effective as vanilla VAE based approaches.

Semantic Similarity Evaluations
The lower rows in Table 1 show whether semantic variables can better capture semantic information than syntactic variables. We reproduced VGVAE's result by their released package (Chen et al., 2019) for comparisons and our results are lines from 3 to 11. As shown, the semantic and syntactic variables of the base DecVAE model show similar performances on the STS test sets. With more losses added, the performance of these two variables gradually diverges, indicating that different information is captured in the two variables. Therefore, we can see that the various losses play essential roles in the disentanglement of semantics and syntax in DecVAE. When all losses plus W ord AV Ge&d are fully utilized, the high-est benchmark results (73.91%) are obtained with 1.7% higher than VGVAE for semantic variables. Meanwhile, all losses plus LST M e&d achieves the best average results for semantic variables. More impressively, this approach yields relatively low scores for both benchmarks and average of syntactic variables (8.05 and 9.72 for bm and avg respectively). This fully shows that decomposition with total correlation has excellent disentanglement capacity on semantics and syntax.
Finally, Figure 3 plots the performance curves of our models and baselines as the length of the target sentence increases. We observe a similar trend, i.e., the longer the sentence, the worse the performance. Our framework is close to the top (red) curve and has a more consistent trend. This shows that DecVAE achieves more remarkable disentanglement effects in syntax. Particularly, in Table 1, the full model with LSTM encoder and decoder achieves much lower values for syntactic evaluations than all other models. Part-of-Speech Tagging Figure 3: Constituency parsing F1 scores (left) and POS tagging accuracy (right) by sentence length, for 1-nearest neighbor parsers based on semantic and syntactic variables, as well as a random baseline and an oracle nearest neighbor parser ("Best"). Note that in the legend, "+LSTM" means "+LSTM enc & dec".

Syntactic Similarity Evaluation
Following the evaluation protocol in VG-VAE (Chen et al., 2019), we utilize syntactic variables to calculate nearest neighbors for a 1-nearest-neighbor syntactic parser or POS tagger. Several metrics are employed to quantify the quality of the parser's output and tagging sequences. It is worth noting that this evaluation does not directly compare parsing accuracy. Instead, similar to the semantic similarity, it demonstrates syntactic variables' ability to capture more syntactic information than semantic variables. We report labeled F1 of constituent parsing and accuracy of POS tagging in Table 2 Table 2: Syntactic similarity evaluations, labeled F1 score for constituent parsing, and accuracy (%) for partof-speech tagging. Numbers are bold if they are worst in the "semantic variable" column or best in the "syntactic variable" column. "ALL" indicates all of the multi-task losses are used. The results are collected and averaged over five rounds and the standard deviation is around 0.1%-0.2% for all methods.
DecVAE outperforms VGVAE in both parsing and tagging. For the lower part, in contrast to semantic similarity, syntactic variables are expected to boost both tasks while semantic variables worsen them. The baseline "VGVAE All" initially have similar results for two variables. Then, with the addition of LSTM encoder and decoder, expected performances appear along. For our method, the gaps between both variables are more remarkable than VGVAE, although not always worst for semantic variables and best for syntactic variables. Such a result indicates that DecVAE achieved a good disentanglement of syntax and semantics. In particular, our full combination with LSTM achieves the best results and outperforms those of SOTA. Another observation is that although both VG-VAE and DecVAE do not perform well compared with their LSTM counterparts, "DecVAE All" still obtains better performances than VGVAE. We believe that it is the total correlation that brings more accurate disentanglement effects. Nonetheless, the syntactic evaluation results, in general, are not so evident as the semantic correspondents.

Qualitative Analysis with Case Studies
We conduct a qualitative evaluation of latent variables via cosine similarity for nearest neighbor sentences and words to test set examples in terms of both the semantic and syntactic representations. The results are reported in Table 3 and Table 4.  My mom even basked a cake for the party. I'll tell you things can change a lot.

Lexical Analysis
When the siutation changes, we'll let you know. I'd like to try the state government again. They say, you do not have a face.
In fact, you's just a pretty face. You don't know what is in that building I even found a rare gouda on the internet.
I've seen a lot on the internet. Did you get your degree off a cereal box? I don't know, he was wearing socks.
you got any socks you do not want wear. you don't play piano, I hope. I love you as much as before.
I love you more than I ever loved anyone. but wait. There's as much as what is. You know what, cal, just pull over.
cal, is trying to pull you out. You know, you guys got some competition out there? Yeah, he got punched out in court earlier.
From there she was taken to court and back.
He would have to be forged by Jupityer himself. clear patterns. Among the five query words, retrieved words based on semantics have similar meanings against them, while those based on syntax share part-of-speeches. For example, for the query word, exact, almost all words in the semantic row have the sense of exactness. Likewise, most of the words in the second row, semantically, have the sense of order, as the query word, command.
In contrast, the syntactic part has POS as NN. For the third row, semantically, they mostly have an association with require while syntactically, they are all verbs. Table 4 demonstrates sentences of semantically and syntactic similar respectively in column 2 and column 3. Like the lexical similarity, retrieved sentences in column 2 have similar meanings or similar keywords or key phrases to query sentences while they may be different in sentence structure. For example, "bellowed", "Do you think", "head", "change", "internet", "love", "pull" and "court" are in the rows from one to ten respectively. In contrast, those that are syntactically similar may have different meanings while they have similar grammatical patterns. Take a few rows as examples, "go, you fools, Xar bellowed" does have similar syntactic construction to "Huh, I've got file festivals to enter he said". Likewise, the second row, the query is composed of yes/no questions with an object clause for both query and syntactically similar sentence.

Discussions
The above results show the disentanglement effects of our proposed DecVAE from semantic and syntactic evaluations in both quantitative and qualitative perspectives. In comparing with baselines, it is not hard to see that DecVAE demonstrates more impressive disentanglement powers. Such results confirm our assumption that a more finetuned decomposition of KL divergences can detect more subtle aspects of semantics and syntax. This discovery can shed light on constructing more representative learning strategies for languages in both token and sentence levels.

Conclusion
We propose DecVAE, a framework to disentangle syntax and semantics in a sentence. It extends the original VAE so that the latent variables can be separated in more interpretable way. Experiments show that DecVAE achieves better results in semantic and syntax similarity than that of SOTA. One future direction is fine-grained representation learning for words and sentences, which is essential for many downstream applications such as controllable text generation. Besides, continual and interactive feature distillation may help improve more discriminate disentanglement .