Neural Topic Modeling based on Cycle Adversarial Training and Contrastive Learning

,


Introduction
Topic modeling, uncovering the semantic structures within a collection of documents, has been widely used in various natural language processing (NLP) tasks (Zhou et al., 2017;Yang et al., 2018Yang et al., , 2019;;Zhou et al., 2021;Wang et al., 2022).Latent Dirichlet Allocation (LDA) (Blei et al., 2003), a probabilistic graphical model, is one of the most popular topic models due to its interpretability and effectiveness.However, the parameter estimation methods for LDA and its variants, such as collapsed Gibbs sampling (Griffiths and Steyvers, 2004), are model-specific and require specialized derivations.
To tackle such disadvantages, neural topic models have been proposed with a flexible training process, which can be divided into two categories, variational autoencoder (VAE) based and generative adversarial network (GAN) based.VAE-based neural topic models regard the encoded latent vector as the topic distribution of the input document, then employ the decoder to reconstruct the word distribution (Miao et al., 2016;Srivastava and Sutton, 2017;Miao et al., 2017;Card et al., 2018;Wang et al., 2021).To address the limitation that VAE-based neural topic models cannot approximate Dirichlet distribution precisely, Wang et al. (2019) propose an adversarial topic model, in which the topic distribution is sampled from the Dirichlet prior distribution directly and transformed into the word distribution by the generator.In order to uncover topic distribution and infer document topic simultaneously, Bidirectional Adversarial Topic model (Wang et al., 2020) and Topic Modeling with Cycle-consistent Adversarial Training (Hu et al., 2020) have been proposed in turn.
Recently, a neural topic model named CLNTM has been proposed to apply contrastive learning to VAE-based neural topic models (Nguyen and Luu, 2021).A data augmentation strategy is proposed to replace the salient and non-salient parts of the document representation according to word frequency information to construct positive and negative examples.Although achieving promising results, it has such a disadvantage.As shown in Figure 1, due to the limitation of the unidirectional structure of VAE, the encoder is optimized through the contrastive loss, instead of the decoder which generates topic-word distribution, leading to the gap between model training and evaluation.
Therefore, in this paper, we consider discovering topics based on cycle adversarial training and contrastive learning.As illustrated in the lower left part of Figure 1

Contrastive Learning
Figure 1: Difference between CLNTM (Nguyen and Luu, 2021) and the proposed approach.distribution x will be transformed bidirectionally, breaking the structural limitations of VAE-based neural topic models.However, it is not straightforward to combine contrastive learning and cycle adversarial training.On the one hand, it is crucial to construct applicable positive samples for topic distributions.On the other hand, it is hard to improve the learning of topic-word distribution while maintaining the bidirectional mapping ability of cycle adversarial training.To overcome the above challenges, we propose a novel Neural Topic Modeling framework based on Adversarial training and Contrastive Learning (NTM-ACL).A self-supervised contrastive loss is employed to make the generator capture similar topic information between positive pairs.The generation of topic-word distribution is improved directly, which mitigates the gap between model training and evaluation.Meanwhile, a discriminative contrastive loss is designed to cooperate with supervised contrastive loss to avoid the adversarial training being undermined by the unbalance between generation and discrimination.Moreover, data augmentation is applied to construct positive samples of topic distribution with the reconstruction ability of cycle generative adversarial network structure.The minimum items in the reconstructed distribution are substituted for corresponding items in the original distribution, which hasn't been explored before.We conduct extensive experiments to fully exploit the effectiveness of our proposed model.
In a nutshell, the main contributions of our paper can be summarized as follows: • We propose NTM-ACL, a novel neural topic modeling framework where contrastive learn-ing is directly applied to the generation of topic-word distribution.
• We propose a novel data augmentation strategy for topic distribution based on the reconstruction ability of cycle adversarial training.
To the best knowledge, we are the first to apply data augmentation to construct positive samples of topic distribution.
• We conduct extensive experiments and experimental results show that NTM-ACL outperforms several competitive baselines on four benchmark datasets.

Related Work
Our work is mainly related to two lines of research, including neural topic models and contrastive learning.

Neural Topic Model
Inspired by VAE, Miao et al. (2016) proposed Neural Variational Document Model (NVDM) for text modeling, employing Gaussian as the prior distribution of latent topics.Following that, (Srivastava and Sutton, 2017;Card et al., 2018)   Bidirectional Adversarial Topic (BAT) (Wang et al., 2020) constructs two-way adversarial training on the basis of ATM.Hu et al. (2020) propose Topic Modeling with Cycle-consistent Adversarial Training (ToMCAT) to realize the transformation between a topic distribution and word distribution, inspired by Cycle-GAN (Zhu et al., 2017).

Contrastive Learning
Contrastive learning, as a self-supervised learning method, improves the transforming ability of models without large-scale labeled data and becomes a popular technique in computer vision domain (Chen et al., 2020a;He et al., 2020;Chen et al., 2020b;Grill et al., 2020;Zhao et al., 2021).Chen et al. (2020a) proposed SimCLR, applied image transformations to generate two positive samples for each image randomly, and used the normalized temperature-scaled cross-entropy loss (NT-Xent) as the training loss to make positive pair close in the representation space.
With the success of contrastive learning in computer vision tasks, recent studies attempt to extend it to other domains.Khosla et al. (2020) extended the self-supervised approach to the fully-supervised setting, allowing models to effectively leverage label information.Jeong and Shin (2021) proposed ContraD to incorporate a contrastive learning scheme into GAN.In natural language processing, contrastive learning is widely applied to various tasks, such as sentence embedding, text classification, information extraction, and stance detection (Gao et al., 2021;Yan et al., 2021;Zhang et al., 2022;Wu et al., 2022;Chuang et al., 2022;Liang et al., 2022).In neural topic modeling, contrastive learning has been used to improve the VAE-based neural topic model by adding a contrastive objective to the training loss and taking a more principled approach to creating positive and negative samples (Nguyen and Luu, 2021).

Method
The overall architecture of the proposed NTM-ACL is shown in Figure 2, which consists of three parts: 1) Cycle Adversarial Training based Neural Topic Model, which includes the generator, the encoder, and discriminators to transform topic distribution and word distribution bidirectionally; 2) Topic-Augmented Contrastive Learning, which includes the Self-supervised contrastive loss and the Discriminative contrastive loss to enhance the generator without affecting the adversarial training; 3) Reconstruct Min-Term Replacement, which is based on reconstruction ability of cycle generative adversarial network to create positive samples of topic distributions.

Problem Setting
We denote corpus as D, which consists of M documents {x i } M i=1 .Given the document x i ∈ R V where V is the vocabulary size, the first purpose of topic modeling is topic inference, inferring the corresponding topic distribution θ i ∈ R K where K is the number of topics.
To formalize topic modeling, we use X to stand for word distribution set where the document is represented in normalized Term Frequency Inverse Document Frequency (TF-IDF), use Θ to stand for topic distribution set where topic distribution is sampled from a Dirichlet distribution with parameter α ∈ R K .
During the training process, we need to learn two mapping functions, generator G and encoder E. G transforms samples from Θ into X while E is the reverse function of G.After G is well-trained, the indicator vector of each topic is input to get the topic-word distribution.This is another purpose of topic modeling, referred as topic discovery.The one-hot vector I k ∈ R K denotes the indicator vector of the k-th topic, where the value at the k-th index is 1.

Cycle Adversarial Training
Following (Hu et al., 2020), NTM-ACL consists of two mapping functions, generator G: Θ → X, encoder E: X → Θ and their related discriminators, D x and D θ .They are all implemented in the structure of a three-layer multi-layer perceptron (MLP), with a H-dim hidden layer using LeakyReLU as an active function and batch normalization, followed by an output layer using softmax.The cycle adversarial training objective is composed of adversarial loss and cycle consistency loss.We apply Wasserstein GAN (WGAN) (Arjovsky et al., 2017) adversarial losses to train G and corresponding discriminator D x : in which G tries to generate word distributions similar to samples in X, while D x aims to distinguish generated samples and real samples.G aims to minimize this objective against an adversary D x that tries to maximize it.
To further constrain the relationship between origin distribution and target distribution, we additionally use cycle-consistency losses, encouraging G and E to reconstruct the origin distribution.Cycleconsistency losses are implemented as follows: where ∥•∥ 1 denotes L1 norm.Combining adversarial loss and cycle-consistency loss, the objective of cycle adversarial training is: where λ 1 and λ 2 control the importance of the losses respectively.

Data Augmentation
In this subsection, we describe how to apply data augmentation to construct positive samples of topic distribution, which takes advantage of the structural features of cycle adversarial training.Given distribution θ i = {θ i1 , θ i2 , • • • , θ iK }, the reconstructed distribution θi created by the cycle of G and E is similar to the original distribution: we hypothesize that items with the maximum value in θ i indicate the salient topic information, which has a significant effect on generating the word distribution xi through G. On the contrary, items with the minimum values in θ i have limited effects.After making slight modifications to them, G can still generate a word distribution similar to xi .
Based on the above assumptions, we propose a data augmentation strategy for topic distribution named Reconstruct Min-Term Replacement (RMR).For the reconstructed topic distribution θi = { θi1 , θi2 , • • • , θiK }, we select the minimum p items from it.The indices of these items in θi are denoted as {a 1 , a 2 , • • • , a p }.We replace the value at the corresponding index in θ i : For the topic distribution θ i , we denote its dataaugmented distribution as θ ′ i .Correspondingly, the topic distribution set Θ after data augmentation is denoted as Θ ′ .

Topic-Augmented Contrastive Learning
In this subsection, we will introduce Topic-Augmented Contrastive Learning, which enhances G while keeping the balance of generation and discrimination.This part mainly consists of two training objectives, Self-supervised contrastive loss, and Discriminative contrastive loss.
Self-supervised contrastive loss We follow the setting in SimCLR (Chen et al., 2020a), use Normalized Temperature-Scaled Cross-Entropy Loss (NT-Xent Loss) to calculate the Self-supervised contrastive loss L SelfCon .L SelfCon helps improving the mapping ability of G, capturing similar topic information to generate better topic-word distribution.Self-supervised contrastive loss pulls word distributions of positive topic distribution pairs together while pushing away distance between the word distributions corresponding to the negative sample pairs, which is shown in the upper right part in Figure 2. Given representation r i , its positive sample is denoted as r + i , and the set of its negative samples is recorded as r − , the NT-Xent Loss between r i , r + i and r − is: where τ is a temperature hyperparameter.
Assuming that topic set Θ of the current training batch contains N samples, we get Θ ′ after data augmentation and the number of training samples is expanded to 2N .Two training sets are transformed to X and X′ respectively.For word distribution xi in X, we can find its positive sample x′ i in X′ .The remaining 2N − 2 word distributions form the set of negative samples, denoted as X − i : Based on the above description, we define L SelfCon as follow: Discriminative contrastive loss The Selfsupervised contrastive loss L SelfCon can make the generator better perceive the similarity between two topics, then generate topic-word distributions that are more in line with the corresponding topics.However, only improving the mapping ability leads to an imbalance between generation and discrimination, which undermines the performance of cycle adversarial training.Therefore, we additionally design a discriminative contrastive loss L DisCon , leveraging category information of real samples and generated samples to keep the balance of generation and discrimination.
It is obvious that samples in X belong to the real category, while samples in X and X′ belong to the generated category.For any x i in X, we denote U i = [X \ x i ; X; X′ ].The main purpose of discriminative contrastive loss is not to focus on the similarity between the positive sample pair but make samples of the same category closer.We define the discriminative contrastive loss between x i and U i as: where x + stands for samples of the same category as x i .
For the whole batch, we define discriminative contrastive loss L DisCon as: Overall Training Objective Summing up L Cyc-adv , L SelfCon and L DisCon , the overall training objective of our model is: where λ 3 and λ 4 control the relative significance of Self-supervised contrastive loss and Discriminative contrastive loss respectively.At each training iteration, the parameters of G and E are updated once after parameters of D θ and D x have been updated 5 times.

Baselines
We compare NTM-ACL with the following baselines: • LDA (Blei et al., 2003), a probabilistic graphical model, which is one of the most popular conventional models, we used the implementation of GibbsLDA++ 5 .
• NVDM (Miao et al., 2016), a VAE-based neural topic model that employs Gaussian prior for topic distributions.
• ProdLDA (Srivastava and Sutton, 2017), a VAE-based neural topic model that employs logistic normal prior to approximate Dirichlet prior.
• Scholar (Card et al., 2018), a VAE-based neural topic model that integrates metadata on the basis of ProdLDA.
• CLNTM (Nguyen and Luu, 2021), the first attempt to combine contrastive learning with a VAE-based topic model.

Implementation Details and Evaluation
We set the Dirichlet parameter α to 1 K .The dimension H of the hidden layer is set to 100.The number of replacement items p changes dynamically according to the number of topics K. To be specific, set p = K 4 .For the training objective, we set λ 1 , λ 2 , λ 3 , and λ 4 to be 2, 0.2, 1e-3, and 1e-3 respectively, aligning the magnitudes of different losses.During training, we set the batch size to 256 for NYTimes and Grolier, 1,024 for DBPedia, and 64 for 20Newsgroups.The training epoch is set to 150.We use Adam optimizer to update the model parameters, whose learning rate is 1e-4 and the momentum term is 0.5.
Following the previous work (Wang et al., 2020), we evaluate the performance of NTM-ACL and baselines using topic coherence measures highly correlated with human subjective judgments.For each topic, we select the top 10 topic words based on probability to represent the topic.C_A (Aletras and Stevenson, 2013), C_P (Röder et al., 2015), and NPMI (Aletras and Stevenson, 2013) are three topic coherence measures we use to evaluate models.We apply the Palmetto6 tool to calculate coherence scores.We refer readers to (Röder et al., 2015) for more details of topic coherence measures.

Experiment Results
To make a robust comparison of NTM-ACL with baselines, we set topic numbers as 20, 30, 50, 75, and 100 on each dataset.Then we calculate the average topic coherence score of 5 settings.The experimental results are presented in Table 2.
Compared with baselines of diverse structures, NTM-ACL performs better on most datasets and topic coherence measures, illustrating the effectiveness of our proposed approach.VAE-based neural topic models perform poorly due to the assumption of prior.Compared with GAN-based neural topic models, especially ToMCAT, which is also based on cycle adversarial training, NTM-ACL achieves State-of-Art results on all datasets, demonstrating the effectiveness of Topic-Augmented Contrastive Learning.CLNTM, as the first method to incorporate contrastive learning with topic modeling, performs worse compared with NTM-ACL except for the 20Newsgroups dataset in terms of C_A and C_P score.This result illustrates that combining contrastive learning with cycle adversarial training is more effective to improve the performance of topic discovery by eliminating the gap between model training and evaluation.To calculate coherence scores, Palmetto uses Wikipedia as a reference corpus, while CLNTM uses the training corpus itself as a reference, leading to the result reporting difference between ours and (Nguyen and Luu, 2021).Based on the C_P score, we select 4 group topics with the best coherence score from 50 topic results of NYTimes and Grolier respectively.Every topic is represented in the form of top-5 topic words.As shown in Table 3, the best topics of the NYTimes are related to sports, politics, and music news, while the topics of Grolier reflect science and culture.

Ablation Study
We conduct an ablation study on the relative contributions of different training objectives to topic modeling performance.We compare our full model with the following ablated variants: 1) Self-supervised only removes L DisCon in contrastive learning objective.2) Discriminative only removes L SelfCon in contrastive learning objective.3) w/o Adversarial Loss removes adversarial loss for word distribution and only relies on contrastive learning to distinguish samples.4) w/o Cycle-Consistency Loss removes Cycleconsistency losses.We perform experiments on the Grolier dataset.The results are shown in Table 4.
From Table 4, we can obtain the following observations: 1) The removal of Adversarial Loss and Cycle-Consistency Loss both lead to performance drops, indicating that reserving the full objective of cycle adversarial training is a necessary condition for the proposed method.2) Self-supervised only creates an imbalance between generation and discrimination, causing damage to model performance.3) Although Discriminative only achieves a higher C_P score, the overall performance decreases compared to NTM-ACL, indicating the effectiveness of Self-supervised contrastive loss to improve topic-word generation.

Different Data Augmentation Strategies
To fully exploit the effectiveness of the proposed Reconstruct Min-Term Replacement strategy, we design two simple data augmentation strategies for comparison: 1) Noise Added (NA), topic distribution θ i is added with the noise distribution which is of the same dimension, randomly sampled from a Gaussian distribution with expectation 0 and variance 0.01.2) Zero Masked (ZM), when getting the indices {a 1 , a 2 , • • • , a p }, the value at the corresponding index is set to 0. We apply different data augmentation strategies and keep other experimental settings unchanged.The results are shown in Table 5.It can be observed that most coherence scores increase compared to GAN-based neural topic models, indicating the robustness of our contrastive learning approach.On the other hand, the results of NTM-ACL are the highest among the three strategies, which is proved to be a more suitable strategy for topic distribution.

Effect of Replacement Number
The number of replacement items p is one of the important hyperparameters for the RMR.For different K, the number of topics, it is inappropriate if p is set to a fixed value.In this subsection, we compare the dynamic setting to fixed numbers (1, 5, 15) on four datasets, using C_P coherence measure.The results are shown in Figure 3.
It can be observed that the data augmentation strategy with dynamic replacement numbers achieves the best performance.When the number p is too small, the difference between the positive pair is too slight.When the number p is too large, the similarity between positive samples cannot provide sufficient information for Self-supervised contrastive loss.

Training Strategy
In this subsection, we explore different training strategies, making contrastive learning and cycle adversarial training work respectively in different stages of 150 epochs.We abbreviate contrastive  From the results, we can observe that NTM-ACL achieves the best performance.The result of CL→CA is second only to the best result.Using CL after CA undermines the stable symmetric structure, instead of further improving mapping ability, which should be avoided in future studies.

Conclusion
In this paper, we have proposed NTM-ACL, a novel topic modeling framework based on cycle adversarial training and contrastive learning.Selfsupervised contrastive loss improves the generation of topic-word distribution that is used for the evaluation of topic modeling, while Discriminative con-trastive loss keeps the balance of generation and discrimination.Moreover, a novel data augmentation strategy is designed to create positive samples of topic distributions based on the reconstruction ability of cycle adversarial training.The experimental results show that the proposed method outperforms competitive baselines of different structures.

Limitations
In this section, we describe the limitation of our proposed method in terms of data augmentation and the way to combine contrastive learning with cycle adversarial training.First, our data augmentation strategy relies on the reconstruction ability of cycle adversarial training.We believe that more data augmentation strategies for topic distribution will be studied.Second, with the symmetrical structure of cycle adversarial training, it is worth exploring how to optimize the encoder E and generator G through contrastive learning simultaneously.We can extend the proposed framework to a conjugated structure in future work.Moreover, although it has been explored that contrastive learning and cycle adversarial training working synchronously performs better, we believe that more sophisticated training strategies will be designed to further improve the performance of topic modeling.

ACL 2023 Responsible NLP Checklist
A For every submission: A1.Did you describe the limitations of your work?
We discuss the limitations of our work after the conclusion section.

A2. Did you discuss any potential risks of your work?
As a research domain with a lot of practice, the topic model does not show obvious potential risks.So we don't discuss this aspect specifically.B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Section 4. B4.Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Previous research extensively validated the datasets we use to ensure data security.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Section 4. B6.Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Section 4.

C Did you run computational experiments?
Our model does not use pre-trained language models and requires few computing resources.
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?No response.
, by incorporating cycle adversarial training, topic distribution θ and word

Figure 2 :
Figure 2: The architecture of the proposed model, NTM-ACL.

A3.
Do the abstract and introduction summarize the paper's main claims?Section 1. A4. Have you used AI writing assistants when working on this paper?Left blank.B Did you use or create scientific artifacts?Section 4. B1.Did you cite the creators of artifacts you used?Section 4. B2.Did you discuss the license or terms for use and / or distribution of any artifacts?Section 4.

Table 1 :
. The statistics of the processed datasets are shown in Table1.Dataset statistics.

Table 3 :
Top 4 topics discovered by NTM-ACL on NYTimes and Grolier.

Table 4 :
Performance of different ablated variants compared with the full model.

Table 5 :
Effectiveness of different data augmentation strategies.
Disentangle, separating CL from CA.The first 50 epochs use CL/CA to update model parameters, and the next 100 epochs only employ CA/CL.Also, we design an alternation strategy in that CL is used to update the parameters after 5 epochs of CA only.2) Warm up, using CL to warm up in the first 50 epochs, the next 100 epochs adopt the original strategy or are divided into two stages equally, i.e.CL+CA→CA.The results are shown in Table6.

Table 6 :
Comparison between different training strategies on Grolier.