Learning Sparse Sentence Encoding without Supervision: An Exploration of Sparsity in Variational Autoencoders

It has been long known that sparsity is an effective inductive bias for learning efficient representation of data in vectors with fixed dimensionality, and it has been explored in many areas of representation learning. Of particular interest to this work is the investigation of the sparsity within the VAE framework which has been explored a lot in the image domain, but has been lacking even a basic level of exploration in NLP. Additionally, NLP is also lagging behind in terms of learning sparse representations of large units of text e.g., sentences. We use the VAEs that induce sparse latent representations of large units of text to address the aforementioned shortcomings. First, we move in this direction by measuring the success of unsupervised state-of-the-art (SOTA) and other strong VAE-based sparsification baselines for text and propose a hierarchical sparse VAE model to address the stability issue of SOTA. Then, we look at the implications of sparsity on text classification across 3 datasets, and highlight a link between performance of sparse latent representations on downstream tasks and its ability to encode task-related information.


Introduction
Representation learning has been pivotal in many success stories of modern days NLP. Observing its success, two fundamental questions arise: How is the information encoded in them? and What is encoded in them? While the latter has received a lot of attention by designing probing tasks, the former has been vastly neglected. In this work, we take small steps in this non-trivial direction by building on the knowns: One property we know about the encoding of information is that different data points * Work done while at Microsoft Research Cambridge. 1 The code is available on https://github.com/V ictorProkhorov/HSVAE. embody different characteristics (e.g. statistically, semantically, or syntactically) which should ideally utilise different sub-regions of the representation space. Therefore, the high-dimensional learned representations should ideally be sparse (Bengio et al., 2013;Burgess et al., 2018;Tonolini et al., 2019). In other words it allows us to have varying number of active dimension per sentence 2 (Bengio, 2009) in a fixed dimensional vector 3 . But if sparsity 4 is expected, could it be learned from data without supervision?
A handful of studies in NLP that have delved into building sparse representations of words either during the learning phase  or as a post-processing step on top of existing representations (e.g., word2vec embeddings) Sun et al., 2016;Subramanian et al., 2018;Arora et al., 2018;Li and Hao, 2019). These methods have not been developed for sentence embeddings, with the exception of Trifonov et al. (2018) which makes a strong assumption by forcing the latent sentence representation to be a sparse categorical distribution.
In parallel, Variational Autoencoders (VAEs) (Kingma and Welling, 2014) have been effective in capturing semantic closeness of sentences in the learned representation space (Bowman et al., 2016;Prokhorov et al., 2019;Balasubramanian et al., 2020). Furthermore, methods have been developed 2 This, for example, may allow us to cluster sentences' representations not only based on similarity of their active features (as it is the case for dense vectors) but also on active/inactive dimensions. 3 More on speculative side, sparse representations may be a more natural way of modelling sentences of a language in a fixed dimensional vector. Sentences vary in length and an amount of information that they convey. As such it makes sense to reflect this property in a vector representation of the sentence.
to encourage sparsity in VAEs via learning a deterministic selection variable (Yeung et al., 2017) or sparse priors (Barello et al., 2018;Mathieu et al., 2019;Tonolini et al., 2019). However, the success of these is yet to be examined on text domain.
To bridge this gap, we make a sober evaluation of existing state-of-the-art (SOTA) VAE-based sparsification model (Mathieu et al., 2019) against several VAE-based baselines on two experimental tasks: text classification accuracy, and the level of representation sparsity achieved. Additionally, we propose Hierarchical Sparse Variation Autoencoder (HSVAE), to improve the stability issue of existing SOTA model and demonstrate its performance on both experimental tasks.
Our experimental findings demonstrate that: (I) neither the simpler baseline models nor the SOTA manage to impose a satisfactory level of sparsity on text, (II) as expected, sparsity level and task performance have a negative correlation, while giving up task performance and having sparse codes helps with the analysis of the representations, (III) presence/absence of task related signal in the sparsity codes affects the task performance, (IV) the success of capturing the task related signal in the sparsity codes depends on the strength of the signal presented in a corpus, and representation dimensionality, (V) the success of SOTA in image domain does not necessarily transfer to inducing sparse representations for text, while HSVAE addresses this shortcoming.

Background
VAE. Given an input , VAEs, Figure 1 (left), are stochastic autoencoders that map to a corresponding representation using a probabilistic encoder ( | ) and a probabilistic decoder ( | ), implemented as neural networks. Optimisation of VAE is done by maximising the ELBO: where the reconstruction maximises the expectation of data likelihood under the posterior distribution of , and the Kullback-Leibler (KL) divergence acts as a regulariser and minimises the distance between the learned posterior and prior of . spike component is a Gaussian with → 0:

Spike-and-Slab
where denotes the th dimension of and D is the total number of dimensions of .

Hierarchical Sparse VAE (HSVAE)
We propose the hierarchical sparse VAE (HSVAE), Figure 1 (right), to learn sparse latent codes automatically. We treat the mixture weights = ( 1 , ..., ) as a random variable and assign a factorised Beta prior ( ) = Beta( , ) on it. The latent code is then sampled from a factorised Spikeand-Slab distribution ( | ) conditioned on , and the observation is generated by decoding the latent variable ∼ ( | ) using a GRU (Cho et al., 2014) decoder. This returns a probabilistic genera- For posterior inference, the encoder distribution is defined as ( , | ) = ( | ) ( | , ), where ( | ) is a learnable and factorised Beta distribution, and ( | , ) is a factorised Spike-and-Slab distribution with mixture weights and learnable "slab" components for each dimension. The distribution is computed by first extracting features from the sequence using a GRU, then applying MLPs to the extracted feature (and for ( | , )) to produce the distributional parameters.

ELBO:
We derive the ELBO, ( , ; ): where ∈ ℝ and ∈ ℝ are the coefficients for the KL terms. This ELBO is approximated with Monte Carlo (MC) in practice, ( , ; ): where and are scalar numbers corresponding to a number of samples taken from ( | , ) and ( | ) respectively. In this work, we set both and to 1. Similar to the vanilla VAE, the first term is the reconstruction, the second and the third KL terms control the distance between the posteriors and their corresponding priors. The parameters of the priors are fixed to some constant values (can be also thought as the hyperparameters) during the training. Also, see Appendix for ELBO derivation.
Control of Sparsity. The random variable , in our model, can be viewed as a "probabilistic switch" that determines how likely is for the th dimension of to be turned off. Intuitively, since for both generation and inference the latent code is sampled from a Spike-and-Slab distribution with the mixture weights , → 1 means is drawn from a delta mass centered at = 0. As the switch follows a Beta distribution ∼ ( ; , ), we can select the parameters and to control the concentration of the probability mass on ∈ [0, 1] interval.
There are three typical configurations of the ( , ) pair: (1) < : density is shifted towards = 0 hence th unit is likely to be on and dense representation is expected, (2) = : the density is centered at = 0.5, and (3) > : density is shifted towards = 1, hence the unit is likely to be off, leading to sparsity. The magnitude of these parameters also plays a role as it controls the spread and uni/bi-modal structure of the density.

Experiments
We conduct a set of experiments on three text classification corpora: Yelp (sentiment analysis -5 classes) (Yang et al., 2017), DBpedia and Yahoo (topic classification -14 and 10 classes respectively) (Zhang et al., 2015). First, we compare performance of the sparse latent representations with their dense counterpart on the text classification tasks ( §4.2). Second, the stability of sparsification of HSVAE is compared with the state-of-the-art MAT-VAE ( §4.3). Then, to better understand performance of our model on the downstream task, we examine the sparsity patterns ( §4.4).
Remark. An integral part of the experiments is the analysis of the learned representations. In this sense, tasks that rely on understanding of semantics (e.g., GLUE (Wang et al., 2018)) or syntax (e.g., (Marvin and Linzen, 2018)) would be non-trivial to analyse due to their inherent complexity. We consider classification tasks because the distribution of words alone could be a good indicator of class labels. Given the unsupervised nature of the models, we explore if this surface-level distribution of words could be captured by the sparsity patterns in the learned representation.

Corpora Preprocessing
We use Yelp 5 as it is, without any additional preprocessing. As for DBpedia 6 and Yahoo 7 , the preprocessing is as follows: (1) removing all non-ASCII characters, quotations marks, and hyperlinks, (2) tokenising with spaCy 8 , (3) lower-case conversion for all tokens, then (4) for each class we randomly sample 10,000 sentences for the training corpus and 1,000 sentences for the test and validation respectively. The vocabulary size of the both corpora is reduced to the first 20,000 most frequent words.

Baselines and Models
To ground the performance of HSVAE we use 4 baselines: 1) VAE is a version of the vanilla VAE used in Higgins et al. (2017), 2) the same VAE model but the activation of and of ( | ) regularised by either 1 (VAE 1 ) or 2 (VAE 2 ) norms, 3) MAT-VAE is a VAE framework introduced by Mathieu et al. (2019) and 4) simple classifier which is simply a text encoder with a classifier on top of it. For all these models we use a GRU network (Cho et al., 2014) to encode and decode text sequences. We set the dimesnionality of the both encoder and the decoder GRU's to 512D and the dimensionality of the word embeddings is 256D. The decoder and the encoder share the word embeddings. To train the model we use the Adam optimiser (Kingma and Ba, 2014) with the learning rate: 0.0008. Li et al. (2020b), we replace the GRU network used in VAE and HSVAE encoders with a pretrained BERT 9 (Devlin et al., 2019), while keeping the GRU decoder. We refer to these models as B-VAE and B-HSVAE, respectively. Also, we compare the task performance of these VAE models with the plain pretrained base-BERT 10 . To train B-VAE and B-HSVAE, we use the Adam optimiser with the learning rate: 0.00008. Dimensionality of . We use the following two dimensions: 32D and 768D. Since, HSVAE and MAT-VAE induce sparse latent representations we want to make sure that they perform robustly regardless of the number of the dimensions.

KL-Collapse.
None of the used VAE models is immune to the KL-collapse (Bowman et al., 2016) -when the KL term becomes zero and the decoder ignores the information provided by the encoder through . To address this issue, in all the models, we put a scalar value , < 1 on the KL terms of the VAE's objective function (He et al., 2019). 10 https://huggingface.co/transformers/ model_doc/bert.html Coupling Encoder with Decoder. To connect the encoder with the decoder we concatenate the latent variable , sampled from the posterior distribution, to word embeddings of the decoder at each time step (Prokhorov et al., 2019). Also, for GRU encoders we take the last hidden state to parameterise the posterior distribution. For BERT encoder, we take average pooling of all token's embeddings produced by the last layer of BERT.

Evaluation Metrics
Text Classification. To report the classification performance we use accuracy as a metric.
Sparsity. We measure Hoyer (Hurley and Rickard, 2009) on the representations of all data points in a corpus and report its average as our sparsity metric (Mathieu et al., 2019). Hoyer, in a nutshell, is ratio of the 2 to 1 norm, normalised by the number of dimensions. Higher indicates Lines are an average over the 3 runs of the models, the shaded area is the standard deviation. The dimensionality of the latent variable of the models is 32D. more sparsity. More specifically, to evaluate the average Hoyer, or as we refer to it as Average Hoyer (AH) in the experiments, either on a validation or test corpus we employ the following procedure. First, for each in the corpus { 1 , ..., } we obtain its corresponding by sampling it from a probabilistic encoder of a VAE model, such that for each we sample one : e.g. 1 ← ← → 1 . Then we normalisē = ∕ ( ), where = { 1 , ..., }, and (.) is the standard deviation. Finally, for each we compute Hoyer as follows: where is the dimensionality of̄ . To report the Hoyer for the whole corpus we compute the Average Hoyer = 1 ∑ (̄ ), where is the number of data points in a test or validation corpus.

Text Classification
Prior to use of a VAE encoder in the classification experiment, we pretrained it using the full VAE model with the corresponding VAE's objective function on one of the target corpus: Yelp, Yahoo or DBpedia. We compare performance of the sparse latent representations with their dense counterparts on the three text classification tasks (Figure 2). The classifier that we use comprises of the two dense layer of 32D each with the Leaky ReLU (Maas, 2013) activation function. To establish whether the performance gain or loss on the tasks is achieved thanks to the sparsity inductive bias, for all the VAE models and BERT we freeze the parameters of the encoder and only train the classifier which we put on top of the encoder. However, for the simple classifier model its text encoder is being trained together with the classifier. When the classifier, ( | ), is trained with a probabilistic VAE encoder we marginalise the latent variable(s). This is done for instance for HSVAE as, We approximate the integral with MC by taking = 5 samples from the probabilistic encoder both to train and to test the classifier: For each in a batch { 1 , ..., }: , ) i.e. a set of sampled tuples of , and , is { ( ,1 , ,1 ), ..., ( , , , )} in other words for each , we sample only one , .
For the other VAEs the procedure is similar. With the MC approximation : ( | ) ≈ 0.2 × ∑ 5 ( | ). For a systematic comparison of various VAEs, we collate classification performance of VAEs with comparable reconstruction loss -which indicates how informative the latent code is for the decoder during reconstruction. In other words the reconstruction loss serves as an intrinsic metric. Thus, for an example, in Figure 2a, for the Yelp corpus all the VAE models have a similar reconstruction loss. The same applies to Figure 2b and Figure 2c.
Comparing the accuracy of the classifiers that are trained with the different latent representations i.e. sparse and dense (Figure2), shows that in general the performance of the sparse latent representations induced by HSVAE or MAT-VAE is on par with their dense latent counterparts inferred by the VAEs. However, the performance of HSVAE slightly lagging behind on the Yelp corpus when the dimensionality of the latent representation is 32D ( Figure  2a). We put forward a hypothesis that may explain this in Section 4.4. Also, when the dimensionality of the latent representation is 32D, the accuracy of MAT-VAE is slightly better than of HSVAE, but this performance is reached at lower levels of sparsity. Additionally, we found that regularising the posterior parameters of the VAE model with either 1 or 2 norm, in some cases, helps to increase the classification accuracy, but does not reach AH higher than the vanilla VAE. Notably, the classification performance of all the VAE models becomes almost identical when the dimensionality of the latent space is increased from 32D to 768D, with HSVAE slightly outperforming all other VAEs on the DBpedia corpus (Figure 2b). We further elaborate on it in Section 4.4.
Use of BERT as an encoder, in our settings, only gives an improvement on the Yahoo corpus with B-HSVAE performing on par with B-VAE, but does not reach the classification accuracy of the plain BERT. We hypothesise that to reach the full potential of the use of a pretrained encoder in a VAE model one needs to pair it with a powerful decoder such as GPT-2  as it is the case in the Li et al. (2020b) VAE model. Further exploration of this was beyond our compute resource.
Finally, one can observe that the simple classifier model performs on a par (in Figure 2a) or even worse (Figure 2b ) than the VAE models on the Yelp corpus. Putting it into the context that the VAE encoders are not being trained with a supervision signal while the encoder of the simple classifier is, we speculate that this can be explained by the discussion put forward in Valpola (2014). A classifier in nature tries to remove all the information that is not relevant to the supervision signal, while an autoencoder tries to preserve as much as possible information in the latent code in order to reconstruct the original input data reliably. Thus, if the distribution of class related words in a text alone (see §4.4.1) is not indicative enough of a class then the classifier may perform poorly. In our case, we hypothesise that the VAE models capture some additional information other than class distribution of words in text that allows it to better discriminate the classes. For example, some class may have shorter sentences, on average, than the sentences presented in the other classes. This may provide an additional bias that allows the VAE models to discriminate sentences from this class from the sentences from the other classes. Thus, with this additional bias VAEs can perform better than the simple classifier. We leave this investigation for a future work.

Representation Sparsity
In Figure 7 we compare HSVAE with MAT-VAE. We report AH both on the mean and samples from the posterior distributions. As illustrated, MAT-VAE struggles to achieve steady and consistent AH regardless of the configurations of its hyperparameters ( , ). However, HSVAE stably controls the level of sparsity with and parameters, a positive effect of its more flexible posterior distribution and the learnable distribution over .

Can Sparsity Patterns Encode Classes?
In order to identify pertinent features, the unsupervised representation learning models are typically trained/fine-tuned on corpora that are closely re-lated to the downstream task. As such, without a supervisory signal, the model can only rely on the distribution of words in a text in order to identify these relevant features for the task. Ideally, compared to their dense counterparts, an unsupervised sparsification model such as HSVAE could result in performance improvement on downstream tasks if they capture the task-related features and discard the noisy features. However, if the sparsification model fail to capture the task related signal in its sparsity pattern; it can hurt the performance of the model on the downstream task as the task-related information can be removed. In what follows we investigate this direction by analysing the sparsity patterns and relate this analysis to the classification performance of the model ( §4.2).

Analysis of .
We hypothesise that if captures a class of a sentence then the sentences that belong to the same class should have a similar sparsity patterns in . To obtain a class specific , first, for each sentence we obtain the mean of the posterior distribution: ( | ) and we denote it as ( ) . Then we binarise the mean such as ( ) = Binarise( ( ) ), where Binarise(⋅) is defined as: 0 if ( ) < 0.5 and 1 otherwise. Finally, for each class we average its ( ) vectors to obtain a single vector that represent this class: is a number of sentences in the class. The averaging removes the information that differentiate these sentences, while preserving the class information that is shared among them. A similar approach was also used in Mathieu et al. (2019). Figure 4 reports the magnitudes of the vectors as heat maps for the three corpora. One would expect that of different classes should differ. For 32D (Figure 4a) this is the case when HSVAE is trained on the DBpedia and Yahoo but not on Yelp. Taking into account the unsupervised nature of these models, this difference is echoing the distribution of words in the classes, which is more distinct in DBpedia and Yahoo, but not in Yelp (see §4.4.1). We also hypothesis that this observation can explain inferior performance of the model on the Yelp corpus (Figure 2a).
In contrast, for in 768D (Figure 4b) one can observe that the different classes have different activation patterns even when HSVAE is trained on the Yelp corpus. 11 Also, the distributedness of 11 In Figure 4b we only show 32D out of 768D. This is one of the subsets of the 768 dimensions where the distributedness is present. It is not unique and the distributedness is also present the activation patterns now becomes more apparent when HSVAE is trained on the Yahoo corpus. This observation is also related to the distribution of words in the text (further elaborated in §4.4.1).
Intuitively, to reconstruct a sentence a VAE model first captures aspect of data that are the most conducive for reconstruction error reduction (Burgess et al., 2018). Therefore, given the limited dimensionality of the latent vector, the model will prioritised aspects of data during encoding. As such, if the information such as sentence class is not strongly presented in the corpus the model could potentially ignore it during encoding. However, when the dimensionality of the latent space is increased, the model has more capacity to represent various aspects of data that may otherwise be ignored in the smaller dimensionality. We speculate this could explain the presence of distributedness of on Yelp for 768D as opposed to 32D, which also translates into matching the task performance of its dense counterpart (Figure 2b).

Class Kullback-Leibler Divergence
The question that has yet not been addressed is why in some cases the HSVAE model is more successful at capturing the class distribution when trained on DBpedia compared to Yelp. We previously hypothesised that the reason for this can be a word distribution in a text. To empirically test our hypothesis, we calculate the add-1 smoothed probabilities of words in the classes and measure the pairwise KL divergence across them. The magnitudes of the pairwise KL divergences are shown in Figure  5. As demonstrated, the magnitude of the KL divergence is the largest for DBpedia and smallest for Yelp. This indicates that separating classes in Yelp would rely on more subtle aspects of data, whereas surface-level cues are more present in DBpedia and allow for an easier discrimination.

Related Work
Learning sparse representations of data can be dated back to Olshausen and Field (1996). This work motivates encoding of images in sparse linear codes for its biological plausibility and efficiency. It was later argued by Bengio (2009) that compared to the dimensionality reduction approaches, sparsity is a more efficient method for representation learning on vectors with fixed dimensionality.
in other dimensions of the 768D code. Representation Sparsity. In NLP, learning sparse representations has been explored for various units of text with most of the focus placed on sparse representation of words. As the earliest work that moved in this direction, Murphy et al. (2012) looked into sparse representations for ease of analysis, performance, and being more cognitively plausible. This idea was further developed by many other researchers Sun et al., 2016;Subramanian et al., 2018;Arora et al., 2018;Li and Hao, 2019). Sparsification of the large units of text (i.e., sentences) has not received a lot of attention, perhaps due to inherent complexity of sentence/phrase representations: i.e., encoding and analysing syntactic and semantic information in a sentence embedding is rather a non-trivial task. To the best of our knowledge, the only model that sparsifies sentence emebeddings is introduced by Trifonov et al. (2018). The authors introduced a Seq2Seq model (Sutskever et al., 2014) with the Sparsemax layer (Martins and Astudillo, 2016) between the encoder and the decoder which induces sparse latent codes of text. This layer allows to learn codes that can be easier to analyse compared to their dense counterparts, but it is limited to modelling the categorical distribution. Thus restricts a type a sentence representations that can be learned.

VAE-based Representation
Sparsity. VAEbased sentence representation learning has shown superior properties compared to their deterministic counterparts on tasks such as text generation (Bowman et al., 2016), Semantic Textual Similarity (Li et al., 2020a) and other wide range of language tasks (Li et al., 2020b). OBJECTIVE. HSVAE is trained with a principled ELBO (eq. 3), while the other two add additional regularisers to the ELBO of VAE (eq. 1). For instance, MAT add a maximum mean discrepancy (MMD) divergence between 's aggregated posterior and prior MMD( ( ), ( )) and include scalar and weights to the KL and MMD term, respectively, see Appendix.
Model Sparsity. Concurrent to the widespread use of large models such as Transformers (Vaswani et al., 2017) in NLP, sparsification of these models is also becoming popular (Zhang et al., 2020;Zhao et al., 2019;Correia et al., 2019;Ye et al., 2019;. The most common approach to sparsify a Transformer is to reduce a number of connection between the words/tokens in the self attention kernel e.g. Correia et al. (2019). However, these approaches still learn dense continuous representations of token/word/sentence embeddings.

Conclusion
We provided an objective analysis of several unsupervised sparsification frameworks based on VAEs, both in terms of the impact on downstream tasks and the level of sparsity achieved. Also, we presented a novel VAE model -Hierarchical Sparse Variational Autoencoder (HSVAE), outperforming existing SOTA model (Mathieu et al., 2019). Ideally, sparse representations should be capable of encoding the underlying characteristics of a corpus (e.g. class), in activation patterns as shown to be the case for HSVAE. Moreover, using the text classification corpora as a testbed, we established how statistical properties of a corpus such as word distribution in a class affect the ability of learned sparse codes to represent task-related information.
Moving forward, HSVAE model along with the analysis provided in this paper can serve as a good basis for the design of sparse models that induce continuous sparse vectors of text. For example, a potential extension of HSVAE could be an incorporation of explicit linguistic biases into the learned representations with the group sparsity . Furthermore, as we discussed in Section 5, sparsity found its application in the Transformers, but it, mainly, has been used to reduce the number of connection between the words/tokens. With the HSVAE framework one can also learn sparse continuous representations of token/word/sentence embeddings.

F Hardware
Please refer to Table 1 for the hardware that we use.