Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence

Topic models extract groups of words from documents, whose interpretation as a topic hopefully allows for a better understanding of the data. However, the resulting word groups are often not coherent, making them harder to interpret. Recently, neural topic models have shown improvements in overall coherence. Concurrently, contextual embeddings have advanced the state of the art of neural models in general. In this paper, we combine contextualized representations with neural topic models. We find that our approach produces more meaningful and coherent topics than traditional bag-of-words topic models and recent neural models. Our results indicate that future improvements in language models will translate into better topic models.


Introduction
One of the crucial issues with topic models is the quality of the topics they discover. Coherent topics are easier to interpret and are considered more meaningful. E.g., a topic represented by the words "apple, pear, lemon, banana, kiwi" would be considered a meaningful topic on FRUIT and is more coherent than one defined by "apple, knife, lemon, banana, spoon." Coherence can be measured in numerous ways, from human evaluation via intrusion tests (Chang et al., 2009) to approximated scores (Lau et al., 2014;Röder et al., 2015).
However, most topic models still use Bag-of-Words (BoW) document representations as input. These representations, though, disregard the syntactic and semantic relationships among the words in a document, the two main linguistic avenues to coherent text. I.e., BoW models represent the input in an inherently incoherent manner.
Meanwhile, pre-trained language models are becoming ubiquitous in Natural Language Processing (NLP), precisely for their ability to cap-ture and maintain sentential coherence. Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019), the most prominent architecture in this category, allows us to extract pre-trained word and sentence representations. Their use as input has advanced state-of-the-art performance across many tasks. Consequently, BERT representations are used in a diverse set of NLP applications (Rogers et al., 2020;Nozza et al., 2020).
In this paper, we show that adding contextual information to neural topic models provides a significant increase in topic coherence. This effect is even more remarkable given that we cannot embed long documents due to the sentence length limit in recent pre-trained language models architectures.
Concretely, we extend Neural ProdLDA (Product-of-Experts LDA) (Srivastava and Sutton, 2017), a state-of-the-art topic model that implements black-box variational inference (Ranganath et al., 2014), to include contextualized representations. Our approach leads to consistent and significant improvements in two standard metrics on topic coherence and produces competitive topic diversity results.
Contributions We propose a straightforward and easily implementable method that allows neural topic models to create coherent topics. We show that the use of contextualized document embeddings in neural topic models produces significantly more coherent topics. Our results suggest that topic models benefit from latent contextual information, which is missing in BoW representations. The resulting model addresses one of the most central issues in topic modeling. We release our implementation as a Python library, available at the following link: https://github.com/MilaNLProc/ contextualized-topic-models.

Neural Topic Models with Language
Model Pre-training We introduce a Combined Topic Model (Com-binedTM) to investigate the incorporation of contextualized representations in topic models. Our model is built around two main components: (i) the neural topic model ProdLDA (Srivastava and Sutton, 2017) and (ii) the SBERT embedded representations (Reimers and Gurevych, 2019). Let us notice that our method is indeed agnostic about the choice of the topic model and the pre-trained representations, as long as the topic model extends an autoencoder and the pre-trained representations embed the documents. ProdLDA is a neural topic modeling approach based on the Variational AutoEncoder (VAE). The neural variational framework trains a neural inference network to directly map the BoW document representation into a continuous latent representation. Then, a decoder network reconstructs the BoW by generating its words from the latent document representation 1 . The framework explicitly approximates the Dirichlet prior using Gaussian distributions, instead of using a Gaussian prior like Neural Variational Document Models (Miao et al., 2016). Moreover, ProdLDA replaces the multinomial distribution over individual words in LDA with a product of experts (Hinton, 2002). We extend this model with contextualized document embeddings from SBERT (Reimers and Gurevych, 2019), 2 a recent extension of BERT that allows the quick generation of sentence embeddings. This approach has one limitation. If a document is longer than SBERT's sentence-length limit, the rest of the document will be lost. The document representations are projected through a hidden layer with the same dimensionality as the vocabulary size, concatenated with the BoW repre-1 For more details see (Srivastava and Sutton, 2017). 2 https://github.com/UKPLab/ sentence-transformers sentation. Figure 1 briefly sketches the architecture of our model. The hidden layer size could be tuned, but an extensive evaluation of different architectures is out of the scope of this paper.

Experimental Setting
We provide detailed explanations of the experiments (e.g., runtimes) in the supplementary materials. To reach full replicability, we use open-source implementations of the algorithms.

Datasets
We evaluate the models on five datasets: 20News-Groups 3 , Wiki20K (a collection of 20,000 English Wikipedia abstracts from Bianchi et al. (2021)), Tweets2011 4 , Google News (Qiang et al., 2019), and the StackOverflow dataset (Qiang et al., 2019). The latter three are already pre-processed. We use a similar pipeline for 20NewsGroups and Wiki20K: removing digits, punctuation, stopwords, and infrequent words. We derive SBERT document representations from unpreprocessed text for Wiki20k

Metrics
We evaluate each model on three different metrics: two for topic coherence (normalized pointwise mutual information and a word-embedding based measure) and one metric to quantify the diversity of the topic solutions. τ is a symbolic metric and relies on co-occurrence. As Ding et al. (2018) pointed out, though, topic coherence computed on the original data is inherently limited. Coherence computed on an external corpus, on the other hand, correlates much more to human judgment, but it may be expensive to estimate.
External word embeddings topic coherence (α) provides an additional measure of how similar the words in a topic are. We follow Ding et al. (2018) and first compute the average pairwise cosine similarity of the word embeddings of the top-10 words in a topic, using Mikolov et al. (2013) embeddings. Then, we compute the overall average of those values for all the topics. We can consider this measure as an external topic coherence, but it is more efficient to compute than Normalized Pointwise Mutual Information on an external corpus.
Inversed Rank-Biased Overlap (ρ) evaluates how diverse the topics generated by a single model are. We define ρ as the reciprocal of the standard RBO (Webber et al., 2010;Terragni et al., 2021b). RBO compares the 10-top words of two topics. It allows disjointedness between the lists of topics (i.e., two topics can have different words in them) and uses weighted ranking. I.e., two lists that share some of the same words, albeit at different rankings, are penalized less than two lists that share the same words at the highest ranks. ρ is 0 for identical topics and 1 for completely different topics.

Models
Our main objective is to show that contextual information increases coherence. To show this, we compare our approach to ProdLDA (Srivastava and Sutton, 2017, the model we extend) 7 , and the 6 stsb-roberta-large 7 We use the implementation of Carrow (2018).

Configurations
To maximize comparability, we train all models with similar hyper-parameter configurations. The inference network for both our method and ProdLDA consists of one hidden layer and 100dimension of softplus units, which converts the input into embeddings. This final representation is again passed through a hidden layer before the variational inference process. We follow (Srivastava and Sutton, 2017) for the choice of the parameters.

Results
We divide our results into two parts: we first describe the results for our quantitative evaluation, and we then explore the effect on the performance when we use two different contextualized representations.

Quantitative Evaluation
We compute all the metrics for 25, 50, 75, 100, and 150 topics. We average results for each metric over 30 runs of each model (see Table 2). As a general remark, our model provides the most coherent topics across all corpora and topic settings, even maintaining a competitive diversity of the topics. This result suggests that the incorporation of contextualized representations can improve a topic model's performance.
LDA and NVDM obtain low coherence. This result has also also been confirmed by Srivastava and Sutton (2017). ETM shows good external coherence (α), especially in 20NewsGroups and Stack-Overflow. However, it fails at obtaining a good τ coherence for short texts. Moreover, ρ shows that the topics are very similar to one another. A manual inspection of the topics confirmed this problem. MetaLDA is the most competitive of the models we used for comparison. This may be due to the incorporation of pre-trained word embeddings into MetaLDA. Our model provides very competitive results, and the second strongest model appears to be  MetaLDA. For this reason, we provide a detailed comparison of τ in Table 3, where we show the average coherence for each number of topics; we show that on 4 datasets over 5 our model provides the best results, but still keeping a very competitive score on 20NG, where MetaLDA is best.
Readers can see examples of the top words for each model in the Supplementary Materials. These descriptors illustrate the increased coherence of topics obtained with SBERT embeddings.

Using Different Contextualized Representations
Contextualized representations can be generated from different models and some representations might be better than others. Indeed, one question left to answer is the impact of the specific contextualized model on the topic modeling task. To answer to this question we rerun all the experiments with CombinedTM but we used different contextualized sentence embedding methods as input to the model. We compare the performance of CombinedTM using two different models for embedding the contextualized representations found in the SBERT repository: 8 stsb-roberta-large (Ours-R), as employed in the previous experimental setting, and using bert-base-nli-means (Ours-B). The latter is derived from a BERT model fine-tuned on NLI  data. Table 4 shows the coherence of the two approaches on all the datasets (we averaged all results). In these experiments, RoBERTa fine-tuned on the STSb dataset has a strong impact on the increase of the coherence. This result suggests that including novel and better contextualized embeddings can further improve a topic model's performance.
(2020) represent words and topics in the same embedding space. Srivastava and Sutton (2017) propose a neural variational framework that explicitly approximates the Dirichlet prior using a Gaussian distribution. Our approach builds on this work but includes a crucial component, i.e., the representations from a pre-trained transformer that can benefit from both general language knowledge and corpusdependent information. Similarly, Bianchi et al. (2021) replace the BOW document representation with pre-trained contextualized representations to tackle a problem of cross-lingual zero-shot topic modeling. This approach was extended by Mueller and Dredze (2021) that also considered fine-tuning the representations. A very recent approach (Hoyle et al., 2020) which follows a similar direction uses knowledge distillation (Hinton et al., 2015) to combine neural topic models and pre-trained transformers.

Conclusions
We propose a straightforward and simple method to incorporate contextualized embeddings into topic models. The proposed model significantly improves the quality of the discovered topics. Our results show that context information is a significant element to consider also for topic modeling.

Ethical Statement
In this research work, we used datasets from the recent literature, and we do not use or infer any sensible information. The risk of possible abuse of our models is low.
the topic and document distributions are learnable parameters. Momentum is set to 0.99, the learning rate is set to 0.002, and we apply 20% of drop-out to the hidden document representation. The batch size is equal to 200. More details related to the architecture can be found in the original work (Srivastava and Sutton, 2017).

B.2 Combined TM
The model and the hyper-parameters are the same used for ProdLDA with the difference that we also use SBERT features in combination with the BoW: we take the SBERT English embeddings, apply a (learnable) function/dense layer R 1024 → R |V | and concatenate the representation to the BoW. We run 100 epochs of the model. We use ADAM optimizer.

B.3 LDA
We use Gensim's 9 implementation of this model.
The hyper-parameters α and β, controlling the document-topic and word-topic distribution respectively, are estimated from the data during training.

B.5 Meta-LDA
We use the authors' implementation available at https://github.com/ethanhezhao/MetaLDA. As suggested, we use the Glove embeddings to initialize the models. We used the 50-dimensional embeddings from https://nlp.stanford.edu/ projects/glove/. The parameters α and β have been set to 0.1 and 0.01 respectively.

B.6 Neural Variational Document Model (NVDM)
We use the implementation available at https: //github.com/ysmiao/nvdm with default hyperparameters, but using two alternating epochs for encoder and decoder.

C Computing Infrastructure
We ran the experiments on two common laptops, equipped with a GeForce GTX 1050: models can be easily run with basic infrastructure (having a GPU is better than just using CPU, but the experiment can also be replicated with CPU). Both lap-9 https://radimrehurek.com/gensim/ models/ldamodel.html tops have 16GB of RAM. CUDA version for the experiments was 10.0.

C.1 Runtime
What influences the computational time the most is the number of words in the vocabulary. Table 5 shows the runtime for one epoch of both our Combined TM (CTM) and ProdLDA (PDLDA) for 25 and 50 topics on Google News and 20Newsgroups datasets with the GeForce GTX 1050. ProdLDA is faster than our Combined TM. This is due to the added representation. However, we believe that these numbers are quite similar and make our model easy to use, even with common hardware.