Sentence Bottleneck Autoencoders from Transformer Language Models

Representation learning for text via pretraining a language model on a large corpus has become a standard starting point for building NLP systems. This approach stands in contrast to autoencoders, also trained on raw text, but with the objective of learning to encode each input as a vector that allows full reconstruction. Autoencoders are attractive because of their latent space structure and generative properties. We therefore explore the construction of a sentence-level autoencoder from a pretrained, frozen transformer language model. We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder. We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer (an example of controlled generation), and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.


Introduction
Recent research has focused on devising new unsupervised pretraining methods from unlabeled data that involves some form of language modeling, primarily autoregressive (Peters et al., 2018;Radford et al., 2019), masked (Devlin et al., 2019;Liu et al., 2019;Conneau et al., 2020) and generalized (Radford et al., 2019;Brown et al., 2020;Song et al., 2019), with much success on downstream tasks.Under the hood, most of these methods use transformers (Vaswani et al., 2017) for encoding text sequences, which allows them to learn powerful contextual word representations that have been used widely for building models in NLP.However, this does not hold for sentence representations derived 1 Our code is available at: https://github.com/ivanmontero/autobot from pretrained transformer language models based on a special token or basic pooling operations.To this end, representation learning methods have been designed to better capture semantic information from pretrained transformer language models, e.g., using Siamese networks trained with a triplet loss (Reimers and Gurevych, 2019) or transforming the desired sentence distribution to a Gaussian distribution through normalizing flows (Li et al., 2020).
Existing sentence representations directly derived from pretrained language models or learned by specialized methods cannot guarantee perfect reconstruction of the input, a property that can enhance the structure of their semantic space and enable their use for controlled generation tasks.For the latter, a few recent studies have looked into ways to steer generation of pretrained language models towards a particular style (Dathathri et al., 2020;Krause et al., 2021), although they require following the gradient during the sampling process and rely on style text classifiers which might not be always available.The latent space of a text autoencoder allows one to perform controlled text generation by directly manipulating sentence representations using basic numerical operations (Shen et al., 2020a).Yet, how to convert pretrained transformer language models to autoencoders with such properties still remains unexplored.
To fill in this gap, we introduce AUTOBOT, a new autoencoder model for learning sentence "bottleneck" (i.e., fixed-size) representations from pretrained transformers that is useful for similarity, generation, and classification, displayed in Figure 1.Our model has two unique components: (i) a transformation that uses dot product attention to dynamically pool semantic information from the pretrained model's hidden states into a sentence bottleneck representation, and (ii) a shallow transformer decoder that is modified to operate based on the bottleneck representation.Instead of training our autoencoder from scratch, we directly finetune it using an input reconstruction objective on the unlabeled data on which the original pretrained transformer was trained.We keep the underlying pretrained transformer encoder fixed, which makes it more efficient than training from scratch and proves beneficial even if we compare to pretrained transformers trained for an equal number of steps.
Our evaluation on representative sentence similarity, classification, and generation tasks demonstrates that the resulting sentence representations are compact, better capture semantic similarity at the sentence-level than strong sentence representation methods (Reimers and Gurevych, 2019), and can be used for controlled generation tasks.Lastly, our model performs almost on par with the large RoBERTa model (Liu et al., 2019) even though it only introduces 1.6% additional parameters relative to the base RoBERTa model.

Model: AUTOBOT
Taking inspiration from recent research on text autoencoders (Bowman et al., 2016b;Shen et al., 2020b;Mai et al., 2020), we extend standard autoregressive text autoencoders, which have been predominantly based on recurrent networks, to a transformer-based architecture and integrate them with pretrained language models; here we focus on RoBERTa (Liu et al., 2019).
Autoencoders generally follow the encoderdecoder model structure to reconstruct their input with the constraint that the encoder produces a single, fixed-length hidden representation enc(x) = z: (1) Here, we focus on denoising autoencoders that aim to reconstruct a perturbed version of the input (Vincent et al., 2010;Shen et al., 2020b), which is compatible with many of the pretrained language models that are based on masked language modeling.
In our experiments, we use the same masking procedure as Devlin et al. (2019) to perturb the input.

Encoder
Standard approaches use encoders that reduce the input to a single representation z.To use a pretrained transformer for this purpose we need to reduce its output hidden representations H after processing the input to a single vector.Since using the special token representation or basic pooling methods have been shown sub-optimal in prior work (Reimers and Gurevych, 2019), here we opt Pretrained transformer H < l a t e x i t s h a 1 _ b a s e 6 4 = " / J n 9 K J l q 8 K + Z h i y g to keep the original encoder fixed and train a transformation β that will learn to compress H into a single representation z = β(H; θ), with θ being an additional set of parameters to be learned during finetuning.We choose β to be a multi-head attention mechanism that takes as input the keys K and values V corresponding to the final representations H from the pretrained model and a query vector q corresponding to a context vector u that we choose to be the CLS vector from the pretrained model: where the parameters to be learned, θ, include the weights that are used to transform the query, keys, and values which amount to 3d 2 with d being the dimensionality of each head (d = 64 in our experiments).

Decoder
The cross-attention layer in the Transformer decoder architecture by Vaswani et al. (2017) expects hidden representations for every token input from the encoder in order for each output candidate to attend to each input token.In the situation where only a single representation comes from the encoder, we have Note that the queries Q, which come from the previous masked self-attention layer, are not taken into account, and each step in the decoder will receive the exact same z W V as a result.In order to mitigate this, we propose a gating method inspired by Hochreiter and Schmidhuber (1997).Concretely, let Q t be the tth query representation.Then, the tth output o t of the cross-attention layer is computed as follows where σ(•) is the sigmoid activation function and G and G are the parameters of the transformation for the gate.One can view the role of the gate as determining the amount of per-element information from the linear transformation of the latent representation to keep for the current layer and timestep.Preliminary experiments found this method beneficial for generation.
Training considerations To avoid training our model from scratch, we finetune it for 100K optimization steps on a pretraining dataset using the base RoBERTa model (Liu et al., 2019) on the encoder side and a single layer decoder side for efficiency purposes (Kasai et al., 2021).The model is trained using an input reconstruction loss by minimizing the negative log-likelihood computed over the reconstructed inputs.Note that only the parameters of the sentence bottleneck and the decoder are learned; the encoder parameters are kept fixed.

Experiments
To assess the quality of the sentence representations learned by our model we evaluate on sentence similarity (Section 3.2), classification (Section 3.3), and generation tasks (Section 3.4).

Settings
Datasets Since the RoBERTa dataset is not publicly available, we use for pretraining the exact same dataset as BERT (Devlin et al., 2019), which is composed of BooksCorpus (Zhu et al., 2015) and English Wikipedia.For sentence similarity, we use the Natural Language Inference (NLI) dataset (Bowman et al., 2015) for finetuning and evaluate on the Semantic Textual Similarity (STS) dataset (Cer et al., 2017), following Conneau et al. (2017).
For classification, we use mainly single-sentence datasets from the GLUE benchmark (Wang et al., 2018), namely Stanford Sentiment Treebank (SST) and Corpus of Linguistic Acceptability (CoLA) datasets, but we also report the average performance on the remaining datasets.For generation, we use the Yelp reviews dataset (Shen et al., 2017).
Baselines For sentence similarity, we compare to SBERT which is a competitive method for deriving informative sentence representations from pretrained language models (Reimers and Gurevych, 2019).They obtain sentence representations by using simple pooling methods over BERT representations such as mean and max (instead of the CLS token representation) then finetuning the whole pretrained model using Siamese networks on a combination of natural language inference data.To compare with them on sentence similarity, we incorporate our model within their framework and follow their settings and training/evaluation protocol (details in Appendix A.2).
For sentence classification, we compare our model to RoBERTa-base and RoBERTa-large models (Liu et al., 2019).Note that BART (Lewis et al., 2019) achieves similar results to RoBERTa, so a similar comparison can be made.
For sentence generation tasks, we compare to a strong and efficient style transfer method by Shen et al. (2020b), which is a recurrent network-based denoising text autoencoder on in domain data.The style transfer is achieved through vector arithmetic, namely computing a "sentiment vector" v by taking the vector difference between 100 negative and positive sentences, then evaluating by taking an input sentence, encoding it, adding a multiple of the sentiment vector to the sentence representation, then decoding the resulting representation.In addition to the denoising auto encoder (DAE) of Shen et al. (2020b), we include more sophisticated methods for style transfer that are more computationally expensive such as fast gradient iterative modification (FGIM) of Wang et al. (2019) and Emb2Emb of Mai et al. (2020) for reference.

Sentence Similarity
The results on the sentence similarity task are displayed in Table 1.Due to resource constraints and unreported results by prior work, we report our model only with RoBERTa-base.We can observe that AUTOBOT applied to RoBERTa-base significantly outperforms other supervised base transformer methods.Additionally, AUTOBOT approaches the performance of large transformers while having a minimal parameter overhead of 1.6%.
We also find that AUTOBOT without any supervision (AUTOBOT-base unsup.)outperforms all of the unsupervised methods, and most notably improves upon average BERT embeddings by 26.1%.This demonstrates that our approach is effective in both supervised and unsupervised settings.
We find in Table 2 that using the proposed sentence bottleneck based on learned context provides noticeable gains over using simpler pooling methods from prior work.We suspect this is due to the  additional flexibility provided by our bottleneck acting as "weighted pooling" by attending over all tokens to compute the final representation, as opposed to equal contribution of all tokens regardless of the input.

Sentence Classification
The results on single-sentence classification tasks and other tasks from the GLUE benchmark are displayed in Table 3.We find that AUTOBOT provides a noticable performance increase on singlesentence tasks, specifically on the CoLA datasets when using both the RoBERTa-base and RoBERTalarge models.Additionally, we also find that AU-TOBOT, when fed both sentences concatenated for dual sentence GLUE tasks, maintains the original performance of the underlying pretrained encoder.Hence, our model improves the quality of the sentence representations from pretrained transformer models without hurting their performance.Table 3: Single-sentence GLUE classification dev.results.Median accuracy is reported over over three random seeds.Our model improves performance on single-sentence classification tasks over both base and large RoBERTa models while maintaining their performance on the remaining multi-sentence tasks.

Sentence Generation
For sentence generation, we focus on the sentiment transfer task proposed by Shen et al. (2020b) both with and without further training on in-domain data from Yelp.When finetuning, we perform an additional 10K optimization steps using the Yelp dataset.Note that all the baselines require training on in-domain data, while this is optional for our model.In Figure 2 we find that the AUTO-BOT model not exposed to the Yelp dataset during finetuning performed on par with the DAE that was trained specifically on Yelp.Additionally, AU-TOBOT outperforms the DAE in the above-40 percent accuracy range when finetuned on in-domain data.We include AUTOBOT results with partial finetuning of the encoder in the appendix, which we find considerably improves the Self-BLEU metric.
The encoder-decoder structure for obtaining representations has been used in pretraining (Lewis et al., 2019), sentence infilling (Huang et al., 2020), and multilingual (Artetxe and Schwenk, 2019) scenarios.In particular, Lewis et al. (2019) treat denoising as translation task to perform pretraining from scratch, but their approach does not induce a sentence representation space with generative properties.In contrast, our method makes use of a frozen pretrained transformer to learn a shallow, sentence bottleneck autoencoder on top.

Conclusion
We proposed an approach that converts a pretrained transformer language model into a sentence-level autoencoder that is able to reconstruct its pretraining data.The resulting model improves the performance of the pretrained model on sentence-level tasks while maintaining its performance on multisentence tasks.In addition, the new sentence representations are suitable for efficient conditional text generation such as sentiment transfer without the need for training on in-domain data.

A.2 Sentence Representations
We use the Sentence Transformers framework for training and evaluation of AUTOBOT.We use the default settings in their framework to train on NLI, and evaluate using the Spearman correlation of the cosine similarity.During NLI finetuning, we only use the encoder and bottlneck, with the bottleneck representation used as the sentence representation, and allow for all parameters to be finetuned.

A.3 Sentence Generation
We use a modified version of Fairseq's generation code for encoder-decoder models to perform vector arithmetic for sentiment transfer.We follow the instructions of Mai et al. (2020) to finetune a sentiment classifier using DistilBERT from the Huggingface transformers library.
For the AUTOBOT models finetuned to the Yelp dataset, we follow the exact same steps as Appendix A.1 except beginning with the AUTOBOTbase model, using the Yelp training set, and performing 10k optimization steps.

A.4 Sentence Classification
We use the Huggingface library to perform sentence classification using AUTOBOT.During finetuning, we only use the encoder and bottleneck, with the bottleneck representation used as a CLS representation, and allow for all parameters to be finetuned.We perform a hyperparameter search similar to that of RoBERTa by comparing development performances when using {1e-5, 2e-5, 3e-5} for the learning rate.

B Additional Results
We provide additional results in addition to our experiments below.

B.1 Autoencoding Steps
We perform an ablation study on the effect of autoencoding finetuning steps of the underlying pretrained encoder during autoencoding on the downstream sentence representation performance.We provide the detailed performances of performing Table 4 when using a learning rate of 1e-3 in Table 5.

B.2 Finetunable Encoder Layers
We perform an ablation study on the effect of finetuning the underlying pretrained encoder during

B.3 Finetunable Encoder Generation
We provide an appended generation table from Section 3.

B.4 Style Transfer Results
We provide Table 7 that reports results on the Yelp sentiment transfer test set from the generation table in Section 3.4, appending to the table (Mai et al., 2020).We outline the relative time differences during inference.We can observe that our model not only provides competitive speed-quality tradeoff.

B.5 Detailed Sentence Classification Results
Section 3.3 provides a summary of the GLUE results, while outlining the specific single-sentence classification performances.We provide the results for each task in Table 8

Figure 1 :
Figure 1: Our autoencoder consists of a pretrained transformer encoder enc, a function β that compresses the encoder's final representations H of size T × d to a sentence bottleneck representation z of size d, and a transformer decoder dec that is trained to fully reconstruct the training sentence x.

Figure 2 :
Figure 2: Automatic evaluations of vector arithmetic for sentiment transfer, plotted as accuracy vs. self-BLEU.Accuracy (ACC) is measured by a sentiment classifier, and values for varying multiples of the sentiment vector are plotted.Upper right is better.
4 to include the generation results we obtained by allowing the top three layers of RoBERTabase to be finetuned during autoencoding on the style generation task.The results are shown in Figure 3.The same model as used in Appendix B.2 is used.

Figure 3 :
Figure 3: Automatic evaluations of vector arithmetic for sentiment transfer, plotted as accuracy vs. self-BLEU.Accuracy is measured by a sentiment classifier, and values for varying multiples of the sentiment vector are plotted.Upper right is better.

Table 1 :
On semantic textual similarity (STS), AU-TOBOT outperforms previous sentence representation methods and reaches a score similar to RoBERTa-large while having fewer parameters.We report Spearman's rank correlation on the test set and the model sizes are reported in terms of trained parameter size.

Table 2 :
Performance of sentence representations from RoBERTa trained with different pooling methods on NLI data and then evaluated on STS benchmark's development set in terms of Spearman's rank correlation.

Table 5 :
AUTOBOT pretraining steps vs. sentence representation performance when training on NLI and evaluating on STS autoencoding on downstream sentence representation performance.We provide the detailed performances of performing Table4with the optimal parameters, but varying how many of the last layers of RoBERTa-base to finetune.Results are in Table6

Table 6 :
AUTOBOT finetunable layers vs. sentence representation performance when training on NLI and evaluating on STS

Table 7 :
Mai et al. (2020)Yelp sentiment transfer test set with highest transfer accuracy ("Acc.")."+Time" reports the inference-time slowdown factor due to each method's additional computation relative to the method byMai et al. (2020).

Table 8 :
Dev. results on GLUE.For RTE, STS and MRPC we finetune starting from the MNLI model instead of the baseline pretrained model.