Challenging the Semi-Supervised VAE Framework for Text Classification

Semi-Supervised Variational Autoencoders (SSVAEs) are widely used models for data efficient learning. In this paper, we question the adequacy of the standard design of sequence SSVAEs for the task of text classification as we exhibit two sources of overcomplexity for which we provide simplifications. These simplifications to SSVAEs preserve their theoretical soundness while providing a number of practical advantages in the semi-supervised setup where the result of training is a text classifier. These simplifications are the removal of (i) the Kullback-Liebler divergence from its objective and (ii) the fully unobserved latent variable from its probabilistic model. These changes relieve users from choosing a prior for their latent variables, make the model smaller and faster, and allow for a better flow of information into the latent variables. We compare the simplified versions to standard SSVAEs on 4 text classification tasks. On top of the above-mentioned simplification, experiments show a speed-up of 26%, while keeping equivalent classification scores. The code to reproduce our experiments is public.


Introduction
Obtaining labeled data to train NLP systems is a process that has often proven to be costly and time-consuming, and this is still largely the case (Martínez Alonso et al., 2016;Seddah et al., 2020). Consequently, semi-supervised approaches are appealing to improve performance while alleviating dependence on annotations. To that end, Variational Autoencoders (VAEs) (Kingma and Welling, 2014) have been adapted to semi-supervised learning (Kingma et al., 2014), and subsequently applied to several NLP tasks (Chen et al., 2018a;Corro and Titov, 2019;Gururangan et al., 2020).
A notable difference between the generative model case from where VAEs originate, and the 1 https://github.com/ghazi-f/Challenging-SSVAEs semi-supervised case is that only the decoder (generator) of the VAE is kept after training in the first case, while in the second, it is the encoder (classifier) that we keep. This difference, as well as the autoregressive nature of text generators has not sufficiently been taken into account in the adaptation of VAEs to semi-supervised text classification. In this work, we show that some components can be ablated from the long used semi-supervised VAEs (SSVAEs) when only aiming for text classification. These ablations simplify SSVAEs and offer several practical advantages while preserving their performance and theoretical soundness.
The usage of unlabeled data through SSVAEs is often described as a regularization on representations (Chen et al., 2018a;Wolf-Sonkin et al., 2018;Yacoby et al., 2020). More specifically, SSVAEs add to the supervised learning signal, a conditional generation learning signal that is used to train on unlabeled samples. From this observation, we study two changes to the standard SSVAE framework. The first simplification we study is the removal of a term from the objective of SSVAEs: the Kullback-Leibler term. This encourages the flow of information into latent variables, frees the users from choosing priors for their latent variables, and is harmless to the theoretical soundness of the semisupervised framework. The second simplification we study is made to account for the autoregressive nature of text generators. In the general case, input samples in SSVAEs are described with two latent variables: a partially-observed latent variable, which is also used to infer the label for the supervised learning task, and an unobserved latent variable, which describes the rest of the variability in the data. However, autoregressive text generators are powerful enough to converge without the need for latent variables. Therefore, removing the unobserved latent variable is the second change we study in SSVAEs. The above modifications can be found in some rare works throughout the literature, e.g. (Corro and Titov, 2019). We, however, aim to provide justification for these changes beyond the empirical gains that they exhibit for some tasks.
Our experiments on four text classification datasets show no harm to the empirical classification performance of SSVAE in applying the simplifications above. Additionally, we show that removing the unobserved latent variable leads to a significant speed-up.
To summarize our contribution, we justify two simplifications to the standard SSVAE framework, explain the practical advantage of applying these modifications, and provide empirical results showing that they speed up the training process while causing no harm to the classification performance.

Variational Autoencoders
Variational Autoencoders (Kingma and Welling, 2019) are a class of generative models that combine Variational Inference with Deep Learning modules to train a generative model. For a latent variable z, and an observed variable x, the generative model p θ consists of a prior p θ (z) and a decoder p θ (x|z). VAEs also include an approximate posterior (also called the encoder) q φ (z|x). Both are used during training to maximize an objective called the Evidence Lower Bound (ELBo), a lower-bound of the log-likelihood: Throughout the paper, we will continue to use this ELBo(.; .) operator, with the observed variable(s) as a first argument, and the latent variable(s) as a second argument. In the original VAE framework, after training, the encoder q φ is discarded and only the generative model (the prior and the decoder) are kept.

Semi-Supervised VAEs
The idea of using the VAE encoder as a classifier for semi-supervised learning has first been explored in (Kingma et al., 2014). Besides the usual unobserved latent variable z, the semi-supervised VAE framework also uses a partially-observed latent variable y. The encoder q φ (y|x) serves both as the inference module for the supervised task, and as an approximate posterior (and encoder) for the y variable in the VAE framework.
Consider a set of labeled examples L = {(x 1 , y 1 ), ..., (x |L| , y |L| )}, and a set of unlabeled examples U = {x 1 , ..., x |U | }. For the set L, q φ (y|x) is trained i) with the usual supervised objective (typically, a cross-entropy objective for a classification task) ii) with an ELBo that considers x and y to be observed, and z to be a latent variable. A weight α is used on the supervised objective to control its balance with ELBo. For the set U , q φ (y|x) is only trained as part of the VAE model with an ELBO where y is used, this time, as a latent variable like z. Formally, the training objective J α of a SSVAE is as follows:

Simplifying SSVAEs for Text Classification
The simplifications we propose stem from the analysis of an alternative form under which ELBO can be written (Eq. 2.8 in Kingma and Welling, 2019). Although it is valid for any arguments of ELBo(.; .), we display it here for an observed variable x, and the couple of latent variables (y, z): For the case of SSVAEs, this form provides a clear reading of the additional effect of ELBo on the learning process: i) maximizing the log-likelihood of the generative model p θ (x), ii) bringing the parameters of the inference model q φ (y, z|x) closer to the posterior of the generative model p θ (y, z|x).
Since p θ (y, z|x) is the distribution of the latent variables expected by the generative model p θ for it to be able to generate x, we can conclude that ELBo trains both latent variables for conditional generation on the unsupervised dataset U .

Dropping the Unobserved Latent Variable
Building on observations from equation 3, we question the usefulness of training both latent variables for conditional generation when semi-supervised learning only aims for an improvement on the inference of the partially-observed latent variable y.
For the case of language generation, the sequence of discrete symbols in each sample is often modeled by an autoregressive distribution p θ (x|y, z) = i p θ (x i |y, z, x <i ) where x i is the i th symbol in the sequence, and x <i are the symbols preceding x i . Such a distribution is able to generate realistic samples when trained on a target text corpus, so much that text VAEs are plagued with a problem known as posterior collapse (Bowman et al., 2016) where the latent variable is ignored by the generative model. We therefore propose to keep only y and to drop z from the model avoiding its presence in the Kullback-Leibler divergence in Equation 3 and saving some parameters. 2

Dropping the Kullback-Leibler Term
Previous work on VAE-based language models showed that the KL divergence in Eq. 1 sometimes discourages the model from using latent variables and makes them useless in practice (Bowman et al., 2016;Zhao et al., 2017;Chen et al., 2018b).
An interesting result from Zhao et al. (2017) is that ELBo without KL divergence (KL-free) is still a theoretically sound objective for generative modeling with VAEs. The difference between the generative model resulting from a regular ELBo and a KL-free ELBo is the prior of the model. A KL-free ELBo results in a generative model that uses as a prior q φ (z) = z q φ (z|x)p data (x)dx. This prior is intractable which makes the resulting model impractical for generation, but causes no problem for semi-supervised VAEs. We therefore propose, as a second change to the standard SSVAE framework, the removal of the KL-divergence in Eq. 1.
Note that in this case, the network formulates its own prior instead of requiring the user to choose it. That is a significant advantage since the choice of a good prior is difficult: it must model adequately the default behavior of the latent variables, and requires a closed form for the KL-divergence in Eq. 1 to stabilize training.

Resulting Objective
Applying both of the previous simplifications to the semi-supervised objective in Eq. 2 leads to the 2 In this case, one may be tempted to drop the VAE framework entirely and resort to other learning algorithms such as EM or direct likelihood maximization. Although possible in theory, this would disconnect q φ from the generator's training, and thus discard the benefit from using unlabeled data.
following objective: As can be seen, the first ELBo in Eq. 2 turns into a supervised conditional generation objective, while the second ELBo turns into a reconstruction term that relies only on y. Nevertheless, we stress that the second term is still an ELBo, and the whole objective is still a VAE-based semi-supervised learning objective. It should also be noted that, without z, the latent variables cannot provide the decoder with the full information about a sentence and, therefore, cannot reach a state where each sample is reconstructed. To avoid confusion, instead of reconstructing from y, the role of the reconstruction term is better read in our case as raising the probability of the sample at hand under the associated label y.

Experiments
In this section, we display comparisons between instances of standard SSVAEs and the same SSVAEs after applying the changes we propose.

Setup
Datasets  Table 2: Accuracies On AGNEWS, DBPedia, IMDB, and Yelp. The values are averages over 5 runs with standard deviations between parentheses. The best score for each dataset and each amount of labeled data is given in bold.
original training set as unlabeled data. We also use 4K labeled samples a training set and 1K as development set. We use the original test sets from each dataset. All the samples are tokenized using a simple whitespace tokenizer.
Network Architecture The size of z is set to 32. For experiments without z, we simply drop all the components associated to it from the network.
The encoder consists of a pre-trained 300dimensional fastText (Bojanowski et al., 2017) embedding layer, and 2 Bidirectional LSTM networks with 100 hidden states each, one for each of the latent variables y and z. The logits of y are then obtained by passing the last state of its Bidirectional LSTM through a linear layer. Similarly the last state of the Bidirectional LSTM for z is passed through a linear layer to obtain its mean parameter, and a linear layer with a softplus activation to obtain its standard deviation parameter.
As for the decoding step, to allow backpropagation, z is sampled using the reparameterization trick (Kingma and Welling, 2014), and y is sampled using the Gumbel-Softmax trick (Jang et al., 2017). Xu et al. (2017) have shown that latent variables are best exploited in SSVAEs when concatenated with the previous word at each generation step to obtain the next word. We design our decoder accordingly and use a 1-layered LSTM with size 200. The only hyper-parameter we tune on the development set is α, the coefficient weighting the supervised learning objective in Eq. 2, which is selected in the set {10 0 , 10 −1 , 10 −2 , 10 −3 }. Further implementation details are provided in Appendix A.

Results
Classification performance In Table 2, we compare the performance of a standard SSVAE, to a SS-VAE where we remove the KL-divergence (SSVAE-{KL}) another where z is removed (SSVAE-{z}) and a third version where both the KL-divergence and z are removed (SSVAE-{KL, z}). We measure performance on all datasets using accuracy. As a baseline, we also include the results of an objective that does not use unlabeled data. The architecture we use for this objective is simply the LSTM encoder that we use to obtain y for the SSVAE objectives. This baseline is referred to as Supervised.
The aim of our experiment is to see whether we observe that there is a harm to the performance of SSVAEs when applying the proposed simplifications. In Table 2, we see that applying both changes compares favorably to the standard SS-VAE 2 times out of 4. The removal of z yields the same comparison, while removing the KL term causes improvement 3 times out of 4. For more extensive testing, we ran experiments for varying amounts of labeled data (from 1% to 100%; cf. Appendix C), and only found 4 statistically significant differences between SSVAE and its variants: 3 in favor of one of our Simplified SSVAEs, and 1 in favor of the standard SSVAE.
We performed additional experiments in an outof-domain setting (c.f Appendix B.) using our sentiment analysis datasets, and also observed improvements with our simplifications.
Speeding Up the Learning Process By removing the KL-divergence and the components associated with z, an improvement on the speed of the learning process is to be expected. This improvement is highly dependent on the model and on the implementation at hand. As an example, we measure the average speed of an optimization iteration for each dataset, and each version of SSVAE. In Table 3, the speed of each objective is displayed proportionally to the speed of standard SSVAEs. The calculations associated with the KL-divergence do not seem to slow down the iterations. However, removing z and its associated components consistently cuts out a considerable proportion of the duration of optimization steps. This proportion ranges from 14% (DBPedia) to 26%(AGNEWS).

Related Works
After While our work is a focused contribution dedicated to the theoretical soundness and the practical advantages of two simplifications to the SSVAE framework for text classifications, it could be extendend to other tasks involving text generation as the unsupervised VAE objective. For instance, the work of Corro and Titov (2019) shows that semisupervised dependency parsing scores higher with both the changes we study.

Conclusion
Starting from the observation that SSVAEs can be viewed as the combination of a supervised learning signal with an unsupervised conditional generation learning signal, we show that this framework needs neither to include a KL-divergence nor an unobserved latent variable (z) when dealing with text classification. We subsequently perform experimental comparisons between standard SSVAEs and simplified SSVAEs that indicate that they are globally equivalent in performance.
Our changes provide a number of practical advantages. First, removing the KL-divergence frees practitioners from choosing priors for the variables they use, and allows information to flow freely into these variables. Second, removing the latent variable z from the computational graph speeds up computation and shrinks the size of the network. Despite their popularity, VAEs are often tedious to train for NLP tasks. In that regard, our simplifications should facilite their usage in future works.  Probabilistic Graphical Model For models that use both z and y, we consider the latent variables to be conditionally independent in the inference model (i.e. q φ (y, z|x) = q φ (y|x)q φ (z|x)) ) and independent in the generation model (i.e p θ (y, z) = p(y)p(z)).
Training Procedure We use the STL estimator (Roeder et al., 2017) which is a low-variance unbiased gradient estimator for ELBo. The network is optimized using ADAM (Kingma and Ba, 2015), with a learning rate of 4e-3 and a dropout rate of 0.5. If the accuracy on the validation set doesn't increase for 4 epochs, the learning rate is divided by 4. If it doesn't increase for 8 epochs, the training is stopped. For objectives that include a KL-divergence, we scale it with a coefficient that is null for 3K steps then linearly increased to 1 for the following 3K steps to avoid posterior collapse (Li et al., 2020).

B Out-of-domain experiments
The sentiment analysis tasks we use for these experiments take place in different domains (Restaurant reviews for Yelp, and Movie reviews for IMDB). Using models trained for each domain (with %100 of the data), we measure performance on the other domain to see whether the changes we study have an effect on out-of-domain generalization. In Table  4, we compare the out-of-domain performances of each of the objectives to that of the baseline that doesn't use unlabeled data (Supervised).
The table shows no statistically significant gains from using unlabeled Yelp training data for inference on IMDB. This is to be expected as reviews from Yelp are drastically shorter than those from IMDB (cf . Table 1). However, for out-of-domain inference in the opposite direction, all the semisupervised objectives except the standard SSVAE show statistically significant gains. Removing the KL-divergence to accumulate more information in y, and removing z to have conditional generation exclusively rely on y seem to be effective to help generalization beyond the original domain of the task.

C Results Over Varying Amounts of Data
We display results with varying amounts of data in Table 5.