OCTIS: Comparing and Optimizing Topic models is Simple!

In this paper, we present OCTIS, a framework for training, analyzing, and comparing Topic Models, whose optimal hyper-parameters are estimated using a Bayesian Optimization approach. The proposed solution integrates several state-of-the-art topic models and evaluation metrics. These metrics can be targeted as objective by the underlying optimization procedure to determine the best hyper-parameter configuration. OCTIS allows researchers and practitioners to have a fair comparison between topic models of interest, using several benchmark datasets and well-known evaluation metrics, to integrate novel algorithms, and to have an interactive visualization of the results for understanding the behavior of each model. The code is available at the following link: https://github.com/MIND-Lab/OCTIS.


Introduction
Topic models are promising statistical methods that aim to extract the hidden topics underlying a collection of documents. Although researchers have proposed several models across the years (Blei, 2012;Vayansky and Kumar, 2020), their evaluation and comparison is still a hard task. The evaluation of a topic model usually involves different datasets (with non-standard pre-processing) (Schofield and Mimno, 2016;Schofield et al., 2017) and several evaluation metrics (Lau et al., 2014;Wallach et al., 2009;Terragni et al., 2020a). Furthermore, topic models are usually compared by fixing their hyperparameters. However, choosing the optimal hyperparameter configuration for a given dataset and a given evaluation metric is fundamental to induce * Corresponding author. each model at the best of its capabilities, and therefore to guarantee a fair comparison with other models.
Current topic modeling frameworks (McCallum et al., 2005;Qiang et al., 2018;Lisena et al., 2020) typically focus on the release of topic modeling algorithms while ignoring one or more critical aspects of the topic modeling pipeline, such as preprocessing, evaluation, comparison of the models, and visualization. Most importantly, they disregard the hyper-parameter selection.
In this paper, we present OCTIS (Optimizing and Comparing Topic models Is Simple) 1 , a unified and open-source evaluation framework for training, analyzing, and comparing Topic Models, over several datasets and evaluation metrics. Their optimal hyper-parameter configuration is determined according to a Bayesian Optimization (BO) strategy Snoek et al., 2012;Galuzzi et al., 2020).
In the following, we summarize the main contributions of the proposed framework: • several open-source topic models have been integrated into a unified framework, providing a common interface that allows the users to easily experiment with topic models; • a single-objective BO approach has been integrated to determine the optimal hyperparameter values of each model, for a given dataset and a specific evaluation metric of interest; • an interactive visualization of the results for inspecting the details of the models, providing insights about the optimization strategy, word and topic distributions, and robustness of the estimated configuration; • a python library for advanced exploitation of the framework for integrating novel algorithms, with their training and inference algorithms.

System design and architecture
OCTIS is an open-source evaluation framework for the comparison of a set of state-of-the-art topic models, that allows the user to optimize the models' hyper-parameters for a fair experimental comparison. The proposed framework follows an objectoriented paradigm, providing all the tools for running a topic modeling pipeline. The main functionalities of the proposed OC-TIS are related to dataset pre-processing, training topic models, estimating evaluation metrics, hyperparameter optimization, and interactive web dashboard visualization. Figure 1 summarizes the workflow involving the first four modules (the dashboard interacts with all of them). The framework can be used both as a python library and as a dashboard. The python library offers more advanced functionalities than the ones available in the dashboard. The modules that comprise the OCTIS framework are detailed in the following sections.

Datasets and Pre-processing
The first step of the topic modeling pipeline is the pre-processing of the input dataset. OCTIS includes the following pre-processing utilities: • reducing the text to lowercase; • punctuation removal; • lemmatization; • stop-words removal; • removal of unfrequent and most frequent words (according to a specified frequency threshold); • removal of documents with few words (according to a specified frequency threshold).
These utilities include the most common techniques for pre-processing text for topic modeling. However, some of these features may not be appropriate for specific domains and languages, e.g. requiring language-specific or domain-specific stop-words. OCTIS currently provides 4 pre-processed datasets, i.e. 20 NewsGroups 2 , M10 (Lim and Buntine, 2014), DBLP 3 and BBC News (Greene and Cunningham, 2006), different in nature and length.
The datasets already available in OCTIS, and accessible through the web dashboard, have been preprocessed according to the length of the documents. In particular, we removed the punctuation, we lemmatized the text, filtered out the stop-words (using the English stop-words list provided by MALLET), and removed the words that have a word frequency less than 0.5% for 20 Newsgroups and BBC News and less than 0.05% for DBLP and M10. Subsequently, we removed the documents with less than 5 words for 20 Newsgroups and BBC News and less than 3 words for the other datasets. Table 1 reports some statistics about the currently available pre-processed. Although OCTIS already provides some datasets, a user can upload and pre-process any dataset (using the python library) according to its needs.

Topic Modeling
OCTIS integrates both classical topic models and neural topic models. In particular, the following traditional and neural approaches are available to be trained, optimized, analyzed, and compared (the models that are available in the web dashboard are marked with ): • Latent Dirichlet Allocation (Blei et al., 2003, LDA); 4 • Non-negative Matrix Factorization (Lee and Seung, 2000, NMF); 4 • Latent Semantic Analysis (Hofmann, 1999, LSI); 4 • Contextualized Topic Models (Bianchi et al., 2021, CTM). 7 Moreover, we defined a standard interface for allowing a user to integrate their topic model's implementation. A topic model is indeed a black-box, a system solely viewed in terms of its inputs and outputs and whose internal workings are invisible. This black-box topic model takes as input a dataset and a set of hyperparameters values and returns the top-t topic words, the document-topic distributions, and the topic-word distribution in a specified format.

Evaluation Metrics
The proposed framework provides several evaluation metrics. A metric can be used as the objective targeted by a Bayesian Optimization strategy, or to monitor the behavior of a topic model while the model is optimized on a different objective. The performance of a topic model can be evaluated by investigating different aspects, according to the following evaluation metrics: • Topic coherence metrics (Lau et al., 2014;Röder et al., 2015) that compute how the top-k words of a topic are related to each other; • Topic significance metrics (AlSumait et al., 2009;Terragni et al., 2020b) that focus on the document-topic and topic-word distributions to discover high-quality and junk topics; • Diversity metrics (Dieng et al., 2019;Bianchi et al., 2020) that measure how diverse the topk words of a topic are to each other; • Classification metrics (Phan et al., 2008;Terragni et al., 2020a) where the document-topic distribution of each document is used as the K-dimensional representation to train a classifier that predicts the document's class.
OCTIS provides 10 evaluation metrics directly available in the web dashboard, and 13 accessible through the python library.

Hyper-parameter Optimization
The proposed framework uses Bayesian Optimization (Snoek et al., 2012;Shahriari et al., 2015) to tune the hyper-parameters of the topic models. If any of the available hyper-parameters is selected to be optimized for a given evaluation metric, BO explores the search space to determine the optimal settings. Since the performance estimated by the evaluation metrics can be affected by noise, the objective function is computed as the median of a given number of model runs (i.e., topic models run with the same hyperparameter configuration) computed for the selected evaluation metric.
BO is a sequential model-based optimization strategy for expensive and noisy black-box functions (e.g. topic models). The basic idea consists of using all the model's configurations evaluated so far to approximate the value of the performance metric and then selects a new promising configuration to evaluate.  The approximation is provided by a probabilistic surrogate model, which describes the prior belief over the objective function using the observed configurations. The next configuration to evaluate is selected through the optimization of an acquisition function, which leverages the uncertainty in the posterior to guide the exploration.
We integrated into OCTIS most of the BO algorithms of the Scikit-Optimize library (Head et al., 2018) to provide a robust and efficient BO implementation. We integrated Gaussian Process and Random Forest as surrogate models, while we included Probability of Improvement, Expected Improvement, and Upper Confidence Bound as acquisition functions. See (Snoek et al., 2012; for more details about the use of BO for hyper-parameter optimization. Instead of performing BO, a user can also use a random search technique to find the best hyperparameter configuration. Since the Bayesian Optimization requires some initial configurations to fit the surrogate model, the user can provide the initial configurations, according to their domain knowledge. Alternatively, a user can perform a pure exploration of the search space using a random sampling strategy. Different algorithms are available (e.g. Uniform Random Sampling or Latin Hypercube sequence) for sampling the initial configurations.

Existing frameworks
The existing topic modeling frameworks usually provide topic modeling algorithms, while disregarding other essential aspects of the whole topic modeling pipeline: pre-processing, evaluation, comparison, and visualization of the results and, most importantly, the hyper-parameter selection. In the fol-lowing, we outline the existing frameworks, highlighting their advantages and limitations.
MALLET (McCallum, 2002) and gensim 4 are the most known topic modeling libraries and include several classical topic models. They provide pre-processing methods and the estimation of the hyper-parameters using maximum likelihood estimation (MLE) techniques. These libraries do not include the recently proposed neural topic models, and they just provide topic coherence metrics.
STTM (Qiang et al., 2018) is a java library that provides a set of topic models that are specifically designed for short texts, providing several evaluation metrics.
ToModAPI (Lisena et al., 2020) is a python API that allows for training, inference, and evaluating different topic models, also including some of the most recent. However, it does not provide a method for finding the best hyper-parameter configuration of topic models. Instead, a tool that allows for optimizing the hyper-parameter of a machine learning model is PyCARET (Ali, 2020). However, it employs a grid-search technique to tune the hyper-parameters. This approach can be very time-consuming if the number of hyperparameters is high and the search space is huge (Bergstra and Bengio, 2012).
OCTIS stands at the union of the features of the existing frameworks: we integrated both classical and recent neural topic models, providing preprocessing methods, evaluation metrics, and the possibility of optimizing the hyper-parameters. Finally, a user-friendly graphical interface to launch one or more hyper-parameter optimization experiments on a given topic model and on a specific dataset has been provided. Table 2 summarizes the main features of the existing topic modeling frameworks and compares them with OCTIS.

System usage
OCTIS has been designed to be used as a python library by advanced users, as well as a simple web dashboard by anyone. The above lines of code will execute an optimization experiment that will provide an optimal configuration of the hyperparameters α and β for LDA with 25 topics by maximizing the diversity of the topics.

Web-based dashboard
The dashboard includes a set of simple but useful operations to conduct an experimental campaign on different topic models. Here we briefly explain the four main functionalities of the dashboard.
Experiment creation. First, a user can define an optimization experiment by selecting the dataset, the topic model, the corresponding hyperparameter to optimize, the evaluation metric to be considered by the BO (possibly other extra metrics to evaluate), and the settings of the optimization process.
Management of the experiments' queue. The user can monitor the queue of the experiments and see the corresponding progress. The user can also pause, restart, or delete an experiment that has been launched before. Additionally, the user can easily change the order of the queue of the experiments, by allowing a given run to be executed before others. Figure 2: Example of the best-seen evolution for an optimization experiment. Comparison of the Topic Models. The user can select the models to be analyzed and compared. At the first stage, one can observe the progress of the BO iterations, observing a plot that contains at each iteration the best-seen evaluation, i.e. the median at each iteration of the metric that has been optimized (see Figure 2). Alternatively, a user can visualize a box plot at each iteration (see Figure 3) to understand if a given hyper-parameter configuration is noisy (high variance) or not.
Analysis of a single experiment. A user can further inspect the results of a specific topic model on a given dataset with respect to the considered metrics, by analyzing a single experiment.
Here, a user can visualize all the information and statistics related to the experiment, including the best hyper-parameter configuration and the best value of the optimized metric. They can also have an outline of the statistics of the other extra metrics that they had chosen to evaluate. We provide three different plots for inspecting the output of a single run of a topic model. Figure 4 shows the word cloud obtained from the most relevant words of a given topic, scaled by their probability. Focusing on the distributions inferred by a topic model, Figure 5 shows the topic distribution of a document, and Figure 6 represents an example of the weight of a selected word of the vocabulary for each topic.

Conclusions
In this paper, we presented the framework OCTIS for training, analyzing, and comparing Topic Models. The proposed framework is composed of a python library and a web dashboard and integrates several state-of-the-art topic models (both traditional and neural). These models can be trained by searching for their optimal hyperparameter configuration, for a given metric and dataset, exploiting a BO strategy. OCTIS allows researchers to train existing models, integrate new training and inference algorithms, and fairly compare the topic models of interest. On the other hand, practitioners could use OCTIS to boost the performance of Topic Models for their preferred downstream task or a wide range of practical applications, such as data exploratory analysis (Boyd-Graber et al., 2017).
Regarding future work, OCTIS could integrate a multi-objective optimization strategy to optimize multiple metrics in the same BO procedure (Paria et al., 2020). For example, this could allow a user to find an optimal hyper-parameter configuration for both topic coherence and document classification.