Discovering Topics in Long-tailed Corpora with Causal Intervention

Topic models are effective in capturing the latent semantics of large-scale textual data while existing methods are normally designed and evaluated on balanced corpora. However, it contradicts the fact that general corpora in our world are naturally long-tailed, and the long-tailed bias can highly impair the topic modeling performance. Therefore, in this paper, we propose a causal inference framework to explain and overcome the issues of topic modeling on long-tailed corpora. In a neat and elegant way, causal intervention is applied in training to take out the inﬂuence brought by the long-tailed bias. Extensive experiments on manually constructed and naturally collected datasets demonstrate that our model can mitigate the bias effect, greatly improve topic quality and better discover the hidden semantics on the tail.


Introduction
Topic models are proposed to discover the underlying topics and semantic structures from unlabelled text collections. Due to the effectiveness and interpretability, topic models have been applied in various downstream tasks like information retrieval (Wang et al., 2007), content summarization (Ma et al., 2012) and recommendation systems (McAuley and Leskovec, 2013). One of the most widely used topic models is Latent Dirichlet Allocation (LDA) (Blei et al., 2003), a probabilistic graphical model using the conjugate of Dirichlet and Multinomial distribution and inferring the parameters with approximation methods (Griffiths and Steyvers, 2004;Blei et al., 2017). Recently, some popular neural topic models based on Variational AutoEncoder (VAE) (Kingma and Welling, 2014;Rezende et al., 2014) have been introduced, such as Neural Variational Document Model (NVDM) (Miao et al., 2016) and Product of Experts LDA (ProdLDA) (Srivastava and Sutton, 2017). Compared to probabilistic ones, they can easily carry out the inference by gradient backpropagation.
However, these topic models are generally designed and evaluated on balanced corpora, such as the commonly used 20News (Lang, 1995) with evenly distributed labels through which we can infer that the latent topics are also evenly distributed. It hence conflicts with the fact that natural text collections are regularly long-tail distributed following Zipf's law (Reed, 2001), especially the textual data on social network platforms (Zhang and Luo, 2019). More precisely, Figure 1 illustrates that in a collected corpus, documents about some hot topics are numerous (head topics), while the documents about most topics are few (tail topics). Due to this bias, similar to long-tailed classification tasks where a classifier favors to predict an image as the head classes (Kang et al., 2020;Zhou et al., 2020), topic models on long-tailed corpora tend to reveal the semantics of documents about head topics and ignore the documents about tail topics to a great extent. Namely, the discovered topics are mostly about the latent head ones in the corpus. As a result, their diversities are much impaired and incomplete to represent the whole semantics of a corpus. Thus, it is crucial to explore effective ways for long-tailed topic modeling.
Different from other long-tailed tasks like image classification or relation extraction, the key challenge of this problem lies in that topic modeling is originally designed for unlabelled datasets, so we have no access to classification labels to infer the latent global topic distributions while designing solutions 1 . Owing to this factor, we intend not to introduce complicated modules conditioning on accessible labels, e.g., re-weighting (Mahajan et al., 2018) or re-sampling (Khan et al., 2017;Lin et al., 2017;Cui et al., 2019) approaches for other long-tail problems.
To overcome this challenge, in this paper, we present a Structural Causal Model (SCM) (Pearl et al., 2016;Pearl and Mackenzie, 2018) to precisely explain how the long-tailed bias undermines the topic modeling performance. Then, to remove the bias effect, we propose an approach via the causal intervention (Pearl et al., 2016) on topic distributions and adopt the backdoor adjustment (Pearl, 1995) to calculate the causality in the condition of no auxiliary information. Furthermore, we introduce a novel neural model named as Deconfounded Topic Model (DecTM) in the framework of VAE with deconfounded training through an approximation manner. Through comprehensive experiments, we manifest that our new model can mitigate the influence of the long-tailed bias and produce high-quality topics that are more diverse and better disclose the semantics of documents about tail topics. The main contributions of this paper can be concluded as follows: 1. We present a structural causal model to clarify how the problems of topic modeling are incurred by the long-tailed bias in detail; 2. We further propose a neat method to approximate the causal intervention for reducing bias influence, depending on which a novel neural topic is also introduced with deconfounded training; 3. We validate our model on both manuallyconstructed and extreme multi-label text classification datasets and demonstrate our model is effective to alleviate the impact of bias and greatly improve the topic quality compared to both probabilistic and neural baseline models.

Related Work
Topic Modeling Probabilistic topic models can date back to Probabilistic Latent Semantic Analysis (PLSA) (Hofmann, 1999) and LDA (Blei et al., 2003), derving numerous variants (Blei and Lafferty, 2006;Yan et al., 2013;Wu and Li, 2019). Previously, Wang et al. (2015) adapted LDA to discover long-tail semantics from large-scale corpora. Those models usually adopt Gibbs Sampling  (Miao et al., 2016;Srivastava and Sutton, 2017;Wu et al., 2020a) are introduced. They are derivation-free and can apply gradient backpropagation directly. Nevertheless, these former works including probabilistic and neural methods are normally evaluated on balanced datasets. Since long-tail distributed data are common in our natural world (Reed, 2001), this inspires us to find out how these topic models perform on long-tailed corpora and propose useful ways to alleviate the long-tailed bias impact.
Causal Inference Causal inference (Pearl et al., 2016) has been widely adopted in various fields for years, like psychology, epidemiology, and medicine (MacKinnon et al., 2007;Richiardi et al., 2013), providing solutions to investigate the causation between research objects. Recently, the causal inference has also increasingly attracted attention in computer vision and NLP society for removing the biases in datasets (Tang et al., 2020;Wu et al., 2020c) or providing counterfactual examples (Zeng et al., 2020) in domain-specific applications.
In this paper, we propose to employ the causal inference mechanism to investigate wherefores for the issues of long-tailed topic modeling and propose a solution with deconfounded training via the intervention to alleviate the bias effect.

Method
In this section, we first explain how the long-tailed bias affects topic modeling from the perspective of causal inference, and then propose a novel model to overcome this issue with deconfounded training by the causal intervention.

SCM for Topic Modeling
First of all, we investigate the causal relationship between the latent variables in a topic model with a Structural Causal Model (SCM). SCMs are expressed visually by using directed acyclic graphs.
In the graph, vertices are random variables, and directed edges represent direct causation from one variable to another (Pearl et al., 2016). There is a special vertex in the graph: confounder, a variable that influences both correlated and independent variables, creating a spurious statistical correlation. For example, considering an interesting study that chocolate consumption is statistically related to the number of Nobel prizes of a country (Dablander, 2020). Is it justified to argue that people can get Nobel prizes if they eat more chocolate? Common sense intuitively tells us this assertion is inaccurate. We can draw a causal graph to detail it: chocolate consumption ← economy → number of Nobel prizes, the chocolate consumption is usually higher in a developed country with good economy, and the number of Nobel prizes is also larger since the citizens' education level is higher in this country. Therefore, the economy acts as a confounder that creates a spurious correlation between chocolate consumption and the number of Nobel prizes. Similar to the above example, we build a SCM shown in Figure 2a to describe how a biased corpus influences the text generation process of topic modeling. In the graph, C means the unobserved confounding bias in a long-tailed corpus. We note the vocabulary size is V and set K topic entries (the topic number is K) which means the model needs to discover K latent topics. In the setting of topic modeling, a topic entry k is interpreted as the related words and represented with a word distribution β k ∈ R V . Then, the word distributions of all topic entries (topic-word distribution matrix) is β = (β 1 , ..., β k , ..., β K ) ∈ R V ×K . A document x is assigned with various topic entries with each probability as θ k , so the distribution over all topic entries (topic distribution of x) is θ = (θ 1 , ..., θ k , ..., θ K ) T ∈ R K . Then, x is gener- ated with its topic distribution θ and the topic-word distribution matrix β of the whole corpus. The paths in Figure 2a can be specifically interpreted as follows: • C → θ: This path says that the topic distributions are trained under bias. If there is no bias, different topic entries are ideally assigned to documents about various topics, and the inferred topic distributions of these documents are also different. However, in a long-tailed corpus with bias, the topic distributions of documents about different topics could be similar. As shown in Figure 1, since documents about the head topics are the absolute majority, most of the topic entries are assigned to them 2 . In this case, for a document about tail topics, its assigned topic entries probably are also assigned to the documents about head topics, as a result of which, its inferred topic distribution becomes similar to the topic distributions of some documents about head topics.
• C → β → x: This link denotes the topicword distribution matrix β is trained under the bias and is used to generate the document x. Due to the long-tailed bias, the generated x tends to contain words in the documents about head topics.
• θ ← C → β → x: Because of the confounder C, the inferred topic distribution of a document about tail topics could be similar to the topic distributions of some documents about head topics, and the generated documents of these similar topic distributions tend to include words in the documents about head topics instead of tail topics. Therefore, this backdoor path via C causes the spurious correlation between the topic distribution of a document about tail topics and the words in the documents about head topics.
In consequence, this spurious correlation through the confounder incurs that the discovered topics from documents about tail topics are mixed by the words of latent head topics. When the bias in the corpus gets severer, the discovered topics are even totally occupied by these words. Namely, topic models tend to ignore the semantics of the documents about tail topics and cannot discover the latent tail topics of a corpus.
In the above discussion, we clarify how the bias leads to the problems of long-tailed topic modeling with the presented SCM. In the next section, we propose a neat method to solve this issue without any auxiliary information.

Intervention on Topic Distribution
To remove the spurious correlation (deconfound), we propose to do causal intervention via dooperator (Pearl et al., 2016). Taking the chocolate and Nobel prizes for example again, intervening on the chocolate consumption means we fix its value through which we curtail the natural tendency of it to vary in response to the economy in nature. This amounts to remove the edges directed into the chocolate consumption. For example, if we were to close all chocolate factories, denoted as do(chocolate consumption = 0), we will find the causality between the chocolate consumption and the number of Nobel prizes.
Similarly, we do intervention on the topic distribution θ to compute the causality of θ on x, i.e., p(x|do(θ)). As shown in Figure 2b, doing intervention on θ means cutting off the edge C → θ so that C cannot affect θ. But it is difficult to actually intervene variables (like closing all chocolate factories), so we utilize the backdoor adjustment (Pearl, 1995). The variable β meets the backdoor criterion and blocks the backdoor path θ ← C → β → x. Following the backdoor adjustment, we use Inverse probability Weighting (Pearl et al., 2016) as In Figure 2b, all of θ and x association flows along the directed path from θ to x since there cannot be any backdoor paths because θ has no incoming edges, so p(x|do(θ), β) = p(x|θ, β). Also, p(β|do(θ)) = p(β|θ) since there's no other edges from θ to β except through the collider x. But this equation is intractable, we need to approximate it. To find a proper way, we bury in mind that topics are interpreted as word distributions, so long-tail distributed topics can also be seen as longtail distributed words. If we treat these words as "labels", then the generative process of a document is roughly predicting the probability under each "label". This inspires us to discover the relation between long-tailed topics and long-tailed classification tasks (Kang et al., 2020;Tang et al., 2020). Similar to these tasks, as shown in Figure 3, we observe that the magnitudes of topic distributions of words, i.e., β i * for word i, gradually decrease along with the word frequency. Intuitively, the magnitude of β i * means the "correlation score" between word i and all topics; therefore, this phenomenon may be because most inferred topics tend to relate to the words in documents about the head topics as mentioned before. Due to this finding, we propose an approximation method following the propensity score modeling (Rosenbaum and Rubin, 1983;Austin, 2011): where i refers to a word in x and we also empirically add the magnitude of θ. Here, the denomina-tor works as a normalizer that balances the magnitude of the variables: β i * and θ for approximating the intervention probability.

Proposed Model
In this section, we propose a neural topic model for long-tailed corpora with deconfounded training based on the aforementioned intervention method, named as Deconfounded Topic Model (DecTM). Our network architecture is under the basic framework of VAE (Kingma and Welling, 2014;Rezende et al., 2014) with an encoder and a deconfounded decoder.

Encoder
The encoder transforms a text x into its topic distribution θ. Following the setting of Miao et al. (2016), we take the bag-of-words (BoW) assumption that ignores the word orders since topic models normally leverage word co-occurrences for topic inference. Inputted the BoW representation of x, we first obtain its intermediate representation π with a Multi-Layer Perceptron (MLP). Based on π, we then compute q(r|x), the variational distribution of the latent representation r. Since the prior distribution p(r) is assumed to be a Logistic Normal distribution for approximating the Dirichlet distribution (Srivastava and Sutton, 2017), we model the q(r|x) as N (µ, Σ). The mean µ and variance Σ are calculated as where W µ , W Σ , b µ and b Σ are weight matrices and biases respectively , and diag(·) means converting a vector to a diagonal matrix. Later, to reduce the gradient variance, we adopt the reparameterization trick (Kingma and Welling, 2014) to sample r as Next, r is normalized with a softmax function to get the topic distribution θ as θ = softmax(r).

Deconfounded Decoder
After getting the topic distribution of the input text, we then feed it to the proposed deconfounded decoder for reconstruction. According to the method in Equation (4), the objective function of DecTM can be written as where the first term is the Kullback-Leibler (KL) divergence between the posterior and prior distribution. It can be computed with the analytical solution for two Normal distributions. The second term is the reconstruction error between the input and output text. Different from normal neural topic models (Miao et al., 2016;Srivastava and Sutton, 2017), the deconfounded decoder in our model employs the approximated probability for causal intervention on θ to weaken the long-tailed bias. Note that our model can be directly applied to naturally collected corpora since no additional auxiliary information is necessary for our model.

Datasets
Unfortunately, common datasets for topic modeling are almost all balanced, so we manually construct the long-tailed variants of them by repeating and deleting documents according to the given labels, making them follow a long-tailed distribution. Through the distribution of labels, we can roughly assume the latent topics are long-tailed distributed. In this way, we form the long-tailed versions (-LT) of 20News (Lang, 1995) 3 and Yahoo Answer 4 , called 20News-LT and Yahoo Answer-LT respectively. Moreover, to better evaluate the performance of long-tailed topic modeling, we adopt the datasets for eXtreme Multi-label Text Classification (XMTC) (You et al., 2019), a task to predict the most relevant multiple labels for texts from an extremely large-scale label set. The label set includes hundreds and thousands, even millions of labels, and most are tail labels with very few positive samples. These plentiful labels can be naturally interpreted as the latent topics of documents; thus, we can evaluate the proposed model on these long-tailed distributed datasets. We conducted experiments on the the subsets of standard benchmark XMTC datasets Amazon-670K, Wiki-500K, AmazonCat-13K and Amazon-3M (Bhatia For all datasets, we conduct the following steps for preprocessing: (1) tokenize texts and lowercase words; (2) remove stop words and illegal characters; (3) remove low-frequency words. The statistics of preprocessed datasets are reported in Table 1. It is worth noting that although labels are provided in these datasets, they are not used by our model.

Baseline Models
We take both probabilistic and neural topic models as baselines. For probabilistic models, we consider the widely used LDA (Blei et al., 2003) with python-lda 6 package for topic inference. For neural topic models, we use NVDM (Miao et al., 2016) 7 , ProdLDA (Srivastava and Sutton, 2017) 8 and Scholar (Card et al., 2018) 9 . Scholar is an extension of ProdLDA via optionally incorporating metadata of documents like sentiments.

Evaluation Metrics
Following Nan et al. (2019) and Wu et al. (2020b), we evaluate the topic quality concerning two aspects, topic coherence and diversity. Topic coherence means that the words in the discovered topics are supposed to be as coherent as possible instead of irrelevant ones, and topic diversity means that topics should differ from each other instead of being similar ones.
Topic Coherence For topic coherence, we employ C V (Röder et al., 2015), an improved variant 5 http://manikvarma.org/downloads/XC/ XMLRepository.html 6 https://github.com/lda-project/lda 7 https://github.com/ysmiao/nvdm 8 https://github.com/akashgit/ autoencoding_vi_for_topic_models 9 https://github.com/dallascard/scholar of the Normalized Pointwise Mutual Information (NPMI) (Bouma, 2009;Chang et al., 2009;Newman et al., 2010). Its detailed calculation can be found in Wu et al. (2020b). We need to mention that given a topic z and its top T probable words (x 1 , x 2 , ..., x T ), the NPMI of (x i , x j ) used in the C V computation is defined as where p(x i ) is the occurrence probability of word x i and p(x i , x j ) the co-occurrence probability of (x i , x j ). These probabilities are estimated in a reference corpus. To exhaustively assess the topic coherence performance of long-tailed topic modeling, we use three kinds of C V scores with the probabilities estimated in different reference corpora. First, we adopt the public tool 10 which uses Wikipedia documents as the external reference corpus (-E), so it is named as C V -E. Then, we directly use the internal training documents (-I) as the reference corpus, named as C V -I. However, since documents about head topics occupy the main portion of a long-tailed corpus, previous C V -E and C V -I probably are insufficient to appraise the performance on the documents about tail topics. To this end, we heuristically introduce C V -T that employs the documents including the tail labels (-T) provided by the datasets instead of all the training documents as the reference corpus, so it can assess whether the discovered topics can reveal the hidden semantics of documents about tail topics, i.e., discover the tail topics.
Topic Diversity For topic diversity evaluation, we employ the Topic Unique (T U ) (Nan et al., Models 20News-LT Yahoo Answer-LT Wiki-500K AmazonCat-13K Amazon-3M  Table 2: Topic quality results concerning topic coherence and diversity. The best in each column is in bold.

2019) defined as
where cnt(x i ) is the total number of times that word x i appears in the top T words of all topics. Accordingly, T U ranges from 1/K to 1, and a higher T U score means topics are more diverse since fewer words are repeated across all.
Comprehensive Evaluation It is necessary to mention that if the topic coherence performance of a model remains about the same and the diversity gets higher, it means the overall topic quality is also improved since it can unearth more various semantics of documents. To provide a forthright and comprehensive evaluation of both coherence and diversity performance, following Dieng et al.
(2019), we propose Topic Quality (T Q) that combines C V and T U as which is the product of T U and the average of three different C V scores. Thus, T Q can provide a direct comparison of the overall topic quality performance. Table 2 reports the topic quality results concerning different metrics of the top 15 words with the topic number K = 50. At first, we notice that C V -E scores of DecTM are the highest on 20News-LT and are very close to the best on Yahoo Answer-LT and Wiki-500K. Although C V -E scores of some baseline models are higher on other datasets, DecTM stably outperforms them in terms of T U by a large margin, and the C V -I scores of DecTM are also mostly better. This implies that baseline models are disposed to generate repetitive topics because of the bias of these long-tailed corpora, while the topics of our DecTM are more diverse. Therefore, despite that C V -E of some baselines are higher, their lower diversity performance indicates that their yielded topics are redundant. More significantly, DecTM commonly surpasses baseline models in terms of C V -T, which shows the discovered topics can preferably reflect the semantics of documents about tail topics; thus, the produced topics of DecTM are more complete. These arguments are further illustrated with topic examples in Section 5.3. At last, we find that DecTM achieves the highest T Q scores on all datasets, showing the overall performance of our model is fairly better. From the above experimental results, we observe the problems of the long-tailed topic modeling formerly mentioned in Section 1 and Section 3.1, that the performance of existing topic models, especially topic diversity, is deteriorated on account of the bias. But with the help of the deconfounded decoder, our proposed method can alleviate the effect of the bias and hugely improve the topic diversity while remain good coherence performance with a better ability to expose the semantics of documents about tail topics.

Impact of Topic Number
To investigate how performance varies concerning the topic number, we report the T U and the average C V (Avg C V ) scores defined in Equation (12) under topic number K ranging from 50 to 100 on Wiki-500K in Figure 4. We can see that the Avg C V of DecTM is relatively better in Figure 4a. Besides, as shown in Figure 4b, T U scores of all models gradually decline when the topic number gets bigger, but the performance of DecTM is constantly higher and decreases slower. The reason is that those baseline models tend to focus on documents about head topics which are inadequate to support larger topic numbers, while DecTM can also discover semantics of documents about tail topics. Furthermore, Figure 5 presents the T Q with different topic numbers of all datasets. We can observe that whether on manually constructed or XMTC datasets, our model DecTM outperforms baseline models under different topic numbers. These experiments demonstrate that the performance of our model is relatively stable.

Discovered Topic Examples
To further illustrate the topic quality performance of different models, Table 3 reports some discovered topic examples. As shown by the comparison of topic diversity in Section 5.1.2, we can see baseline models produce some topics including repeated words. To be more specific, LDA generates topics with repetitive words like "subject', "organization" from 20News-LT and "book", "author" from Amazon-3M. Similar topics about "newcastle", "orchestra" and "hockey" are yielded by NVDM, ProdLDA, and Scholar respectively. We also notice that NVDM, ProdLDA and Scholar all yield several same topics about "census" from Wiki-500K. These topics are coherent indeed, but they can trickily improve the C V scores and are redundant in the downstream applications. On the contrary, we find only one coherent topic generated by DecTM corresponding to the aforementioned ones.
Models Topic examples LDA hp nasa organization subject article new re subject organization access good pc support subject problem organization file world get scsi organization subject mark university ide thanks organization subject pt scott imagine life book god church author christian spiritual book students guide text chapter reading book story love read author stories characters books author lives writing new years book guide book design new using use techniques NVDM galaxy texas newcastle sky austin theta madrid edinburgh harbour newcastle fortress tunnel edwards leeds birmingham newcastle townships cdp islander nonfamilies couples females males husband nonfamilies ProdLDA orchestra hits songs symphony unreleased song concert orchestra opera biography symphony translation symphony subtitles orchestra mozart median capita nonfamilies residing household nonfamilies households householder residing residing township householder nonfamilies quot bmw yamaha honda macbook laptop Scholar guitarist pianist composer hockey player montreal nhl provincial provinces hockey championships finals medal olympics hockey householder nonfamilies households residing nonfamilies residing households householder township norway nonfamilies residing demographics median census hamlet town DecTM median residing nonfamilies household tires tire steering truck motorcycle honda wales welsh yorkshire scotland glasgow violin orchestra symphony concerto piano nonfiction copies manga novels bestseller championships mens olympic competed ink inkjet paper printer printers cartridges episodes episode season vol inspector series europe russian paris germany german spain What is more, DecTM also discovers some latent topics like "printer", "series" and "europe" while these cannot be found by baseline models, which could verify the superior C V -T performance of our model. These topic examples qualitatively show the overall topic quality performance of DecTM is adequately preferable.

Conclusion
In this paper, for discovering the topics in longtailed corpora, we present a causal inference model to describe how the bias influences topic modeling, and to reduce the impact of the bias, we then propose a causal intervention method for deconfounding, relying on which we introduce the Deconfounded Topic Model (DecTM) with a deconfounded decoder. Comprehensive experiments demonstrate that our model can produce topics with better quality and mitigate the effect of long-tailed bias.