Improving Privacy Guarantee and Efficiency of Latent Dirichlet Allocation Model Training Under Differential Privacy

Latent Dirichlet allocation (LDA), a widely used topic model, is often employed as a fundamental tool for text analysis in various applications. However, the training process of the LDA model typically requires massive text corpus data. On one hand, such massive data may expose private information in the training data, thereby incurring significant privacy concerns. On the other hand, the efficiency of the LDA model training may be impacted, since LDA training often needs to handle these massive text corpus data. To address the privacy issues in LDA model training, some recent works have combined LDA training algorithms that are based on collapsed Gibbs sampling (CGS) with differential privacy. Nevertheless, these works usually have a high accumulative privacy budget due to vast iterations in CGS. Moreover, these works always have low efficiency due to handling massive text corpus data. To improve the privacy guarantee and efficiency, we combine a subsampling method with CGS and propose a novel LDA training algorithm with differential privacy, SUB-LDA. We find that subsampling in CGS naturally improves efficiency while amplifying privacy. We propose a novel metric, the efficiency–privacy function, to evaluate improvements of the privacy guarantee and efficiency. Based on a conventional subsampling method, we propose an adaptive subsampling method to improve the model’s utility produced by SUB-LDA when the subsampling ratio is small. We provide a comprehensive analysis of SUB-LDA, and the experiment results validate its efficiency and privacy guarantee improvements.


Introduction
Latent Dirichlet allocation (LDA) (Blei et al., 2003) is a widely used topic model to discover the latent semantic of text data. High-dimensional text data can be mapped to low-dimensional latent topic space via LDA. Thus, LDA simplifies subsequent text analysis tasks, such as similarity judgment.
Platforms based on LDA for analyzing various text data have been established by many enterprises, such as Tencent (Wang et al., 2014) (Yut et al., 2017) and Microsoft (Yuan et al., 2015).
Differential privacy (DP) is a de-facto standard of privacy protection definition with a rigorous mathematical proof and is widely used for quantifying the privacy risks of random algorithms. To address privacy issues when touching datasets containing sensitive information in the training process of LDA, some works (Park et al., 2016) (Zhu et al., 2016) (Wang et al., 2020) (Zhao et al., 2019) (Zhao et al., 2020) combine DP with LDA. In this study, we focus on LDA training algorithms based on collapsed Gibbs sampling (CGS).
HDP-LDA, proposed by Zhao et al (Zhao et al., 2020), has been demonstrated to be effective and outperforms other relevant works (Park et al., 2016) (Zhu et al., 2016) (Zhao et al., 2019) when protecting sensitive word-count information in CGS training. HDP-LDA injects noise into word counts in each training iteration. However, this method suffers from worse efficiency when dealing with massive text corpus data. Moreover, even when HDP-LDA chooses a small privacy budget in each iteration, the accumulative privacy budget during the whole training may be very large due to a mass of iterations.
Subsampling is a widely used method to achieve privacy amplification in differentially private algorithms (Dwork et al., 2014) (Balle et al., 2020) (Zhu and Wang, 2019) (Mironov et al., 2019). A subsampled randomized algorithm takes a subsample of the original dataset generated by some subsampling procedure, and then applies a known randomized mechanism to the subsampled data. When introducing a subsampling operation in CGS, we discover that subsampling naturally improves the efficiency of CGS while amplifying privacy.
Moreover, a natural question is whether we can amplify privacy while improving the efficiency of CGS simultaneously. We need a metric to evaluate the efficiency-privacy improvement.
In this study, we propose a subsampling solution to improve the privacy guarantee and efficiency of HDP-LDA. We call our novel LDA training algorithm with differential privacy SUB-LDA. Then, we propose a novel metric, the efficiency-privacy function, to evaluate the privacy guarantee and efficiency improvements of SUB-LDA. When the subsampling ratio is small, the model always suffers from heavy utility loss. We propose an adaptive subsampling (AS) method to mitigate the dilemma. Our contributions are summarized as follows.
• We combine subsampling with HDP-LDA, a general differentially private CGS algorithm, and propose our SUB-LDA algorithm, which provides a better privacy guarantee and efficiency than existing methods.
• We propose a novel metric, called the efficiency-privacy function, and provide a comprehensive analysis of SUB-LDA and how the metric behaves when we change the subsampling ratio. We find that we can improve efficiency and privacy guarantee simultaneously only under a certain range of the subsampling ratio.
• We propose an AS method that can be used to improve the model's utility. We conduct extensive experiments on several real-world datasets to validate the effectiveness of SUB-LDA. The experiments show that SUB-LDA achieves better efficiency and amplifies privacy.

Related works
We divide related works into the following three categories.
(a)LDA training with differential privacy As a widely used machine learning model, LDA with DP has attracted the interest of researchers. Zhu et al.(Zhu et al., 2016) propose a privacypreserving tag release algorithm. To protect intermediate private weight information, they add Laplace noise to the weights in the last iteration of CGS. Zhao et al. (Zhao et al., 2019) propose a locally private LDA training algorithm on crowdsourced data to provide local DP for individual data contributors. (Zhao et al., 2020) propose a centralized privacy-preserving algorithm that can prevent data inference from the intermediate statistics in CGS training. Variational Bayes for parameter estimation of LDA is the focus of (Park et al., 2016). In this study, we aim to provide an LDA model trained via CGS with a DP guarantee under a centralized situation.
(b)Subsampled differential privacy Since machine learning algorithms always handle massive sensitive data and perform many iterations before finding the optimal solution, limitations arise when we want to bound the privacy budget of iterative machine learning algorithms. Privacy amplification by subsampling has gradually attracted the interest of researchers.  propose a general "RDPamplification" bound that applies to any randomized mechanism equipped with subsampling without replacements. However, this bound is a constant factor away from being optimal. Zhu and Wang (Zhu and Wang, 2019) provide a more general result of tighter RDP-amplification bound under Poisson subsampling. Mironov (Mironov, 2017) discuss the special sampled Gaussian mechanism, which is successfully used in several machine learning applications. They describe a numerically stable procedure for precise computation of sampled Gaussian Mechanism's Rényi Differential Privacy (RDP) and prove a nearly tight closed-form bound. Dwork et al. (Dwork et al., 2014) give a general bound of privacy loss of the subsampled mechanism in terms of (ε, δ)-DP. Balle et al. (Balle et al., 2020) improve the bound and propose a general framework to derive tight bound of privacy loss of the subsampled mechanism in terms of (ε, δ)-DP.
(c)Efficient collapsed Gibbs sampling CGS is a widely used method to train the LDA model. However, the complexity of traditional CGS is O(N Z), which is a large number, where N and Z are the total number of words and latent topics in text corpus. To improve the efficiency of traditional CGS, some efficient CGS algorithms (Porteous et al., 2008) (Yao et al., 2009)(Li et al., 2014 (Yuan et al., 2015) (Hu et al., 2017) have been proposed recently. FastLDA (Porteous et al., 2008) reduces operations per sample to improve the efficiency of CGS. Yao et al. (Yao et al., 2009) obtain better efficiency of CGS by reducing the complex- where Z w and Z d are the numbers of distinct topics that are assigned to a word w and a document d, respectively. Usually, Z w + Z d is much smaller than Z. Li et al. (Li et al., 2014) utilize the sparsity in the topic model and reduce the complexity from O(N Z) to O(N Z d ) by combining Metropolis-Hasting sampling and the alias table method (Walker, 1977). Yuan et al. (Yuan et al., 2015) propose a compute-and-memory efficient distributed LDA implementation, called LightLDA. The complexity of LightLDA is O(N ). Hu et al. (Hu et al., 2017) observe that topic distributions of words are skewed, and only a subset of documents can approximately represent the semantics of the whole corpus. They reduce N via approximate semantics and reduce Z via skewed topic distribution.

Latent Dirichlet Allocation and Collapsed
Gibbs Sampling The LDA model is widely used to discover the latent structures of text corpus datasets. The latent structures are depicted as probability distributions with prior and obtained posterior distributions after training via Bayes rules. Text corpus is considered a mixture of K different latent topics, and each document m in text corpus is represented by a K-dimensional document-topic distribution θ m . Moreover, each latent topic k is represented by a V -dimensional topic-word distribution φ k where V is the total number of unique words in text corpus. The CGS training process aims to discover topicword distribution φ k . For each word w i , CGS samples a new topic z i based on the following full conditional distribution: where ¬i denotes the whole words in text corpus without the absence of word w i , n k m is the count of topic k that appeared in document m, and n t k is the count of topic k assigned to word t. τ is the document-topic prior hyper-parameter and β is the topic-word prior hyper-parameter. CGS runs over three periods: initialization, burn-in, and estimation. During initialization, each word w in text corpus is randomly assigned to a topic k ∈ K. Then, the document-topic count n k m and topic-word count n t k are obtained. In the subsequent burn-in process, the topic assignment for each word w is updated via sampling from a multinomial distribution P = [p 1 , . . . , p k , . . . , p K ], where p k is calculated according to equation (1). After a series of iterations, the burn-in process ends and we can estimate φ t k by More details about LDA and CGS can be found in (Porteous et al., 2008) (Xiao and Stibor, 2010)(MacKay and Mac Kay, 2003)(Carlo, 2004)(Liu, 2008. Since counting n t k needs to touch original dataset, n t k is considered as sensitive information and thus needs to be protected. HDP-LDA (Zhao et al., 2020) suggests adding noise, for example, Laplace noise, to each n t k independently in each iteration of CGS. Thus, even the adversary can monitor the whole training process of CGS, n t k in each iteration could be protected.

Poisson Subsampled Rényi Differential Privacy
In this subsection, we introduce background on DP, RDP, Poisson subsampling, and its privacyamplification effects. DP has been embraced by multiple research communities as a standard principle of privacy for algorithms. DP bounds a shift in the output distribution of a randomized algorithm when a small change is induced in its input.
Definition 3.1((ε, δ)-DP) (Dwork et al., 2014). A randomized mechanism f : G → R offers (ε, δ)-DP if for any adjacent G, G ∈ G and R ∈ R This definition restrains an adversary's ability to infer whether the input dataset is G or G . RDP is a refinement of DP. RDP utilizes Rényi-divergence as a distance metric instead of sup-divergence in DP.
Definition 3.2((α, ε)-RDP) (Mironov, 2017). A randomized mechanism f : G → R is said to have ε-RDP of order α, abbreviated as (α, ε)-RDP, if for any adjacent G, G ∈ G it holds that Rényi- Recent works have often adopted privacy amplification by subsampling in differentially private machine learning. Applying a randomized mechanism to a subsampled dataset always produces a lower bound on privacy loss, that is, privacy is am-plified. RDP is a useful technique for analyzing how much the privacy loss is improved by the subsampling operation (Zhu and Wang, 2019) (Mironov et al., 2019). Before introducing the subsampled RDP privacy amplification theorem, we first introduce Poisson subsampling.
Zhu and Wang (Zhu and Wang, 2019) give a tight bound to the privacy loss of the Poisson subsampling mechanism when noise is drawn independently from Gaussian or Laplace distribution.
Theorem 3.1(Privacy amplification theorem for subsampled RDP). Let M be a randomized algorithm that obeys (α, ε(α))-RDP whose randomness comes from Gaussian or Laplace noise. Let γ be the Poisson subsampling probability.

Framework of SUB-LDA
In this section, we introduce our algorithm SUB-LDA, which achieves better privacy guarantee and efficiency than HDP-LDA. SUB-LDA is presented in Algorithm 1. Given document corpus G, SUB-LDA first preprocesses the corpus and randomly allocates topics to each word in the corpus. Then, SUB-LDA conducts CGS on a subset of words in each document produced by the Poisson subsampling process. When the convergence condition is satisfied or the amount of accumulative iterations reach maximum value IT ER, then the burn-in period of CGS is stopped. Since SUB-CGS touches only a subset of sensitive words in each document during each iteration, privacy is amplified by Theorem 3.1. We present additional discussions about privacy and efficiency.

Privacy Amplification and Efficiency Improvement
Algorithm 1 SUB-LDA Input: Document corpus G, Prior parameters τ ,β, Subsampling ratio γ, Topic number K, Clipping bound clip Output: Trained document-topic distribution Θ, topic-word distribution Φ, accumulate privacy loss ε = T · ε subsample (α) // Initialization for d m ∈ G do for w = t ∈ d m do Sample topic: k ∼ M ult 1 K · I K Initialize word count n t k and n k m end for end for // Collapsed Gibbs Sampling Set Iter = 0 while not convergent or Iter <= IT ER do for d m ∈ G do Take a batch of word W t from d m according to subsampling ratio γ for w = t ∈ W t do Add noise to each n t k independently: n t k ← n t k + η Clip: n t k temp ← min n t k , clip Compute sampling distribution p: · n k m +τ K k=1 (n k m +τ ) Sample topic and update n t k and n k m via p end for end for Iter ← Iter + 1 end while Output Trained document-topic distribution Θ, topic-word distribution Φ, accumulate privacy loss ε = Iter · ε subsample (α) Given privacy budget ε(α) of RDP in each iteration, noise η is drawn independently from Gaussian distribution N 0, σ 2 , where σ 2 = α 2ε(α) . Since SUB-LDA conducts Poisson subsampling before counting and noise injection, privacy budget ε subsample (α) in each iteration is obtained by equation (3) and ε subsample (α). Intuitively, the smaller the γ, the better the privacy amplification.
To discuss the efficiency improvement, we discover that the running time of each iteration is proportional to the sum of sampling times for each Symbol Meaning τ , β Hyper-parameters of Dirichlet distribution G Text corpus n k m , n t k Count of topic k in document m and count of word t in topic k K Topic amount V Amount of unique words in corpus γ Subsampling ratio M A randomized mechanism M • PoissonSubsample A randomized mechanism equipped with Poisson Subsampling ε(α) RDP privacy loss of order α of a randomized mechanism in one iteration ε subsample (α) RDP privacy loss of order α of a randomized mechanism equipped with Poisson Subsampling in one iteration N d The length of document d t Running time of one CGS step for a single word I Efficiency-privacy function Obviously, smaller γ induces shorter total time; thus, CGS is more efficient. To evaluate privacy amplification and efficiency improvement synthetically, we propose the efficiency-privacy function I, which is defined as follows: For a given text corpus dataset, a smaller value of I indicates better efficiency and privacy amplification. In the following analysis, we omit the constant t · D d=1 N d , and I is simplified as the following kernel: We find an important property of I, which is expressed in Lemma 4.1.
Lemma 4.1 indicates that we could improve efficiency and amplify privacy simultaneously by decreasing the value of γ in a certain range of subsampling ratio γ. The proof of Lemma 4.1 is presented in the appendix.
We plot properties of efficiency-privacy function I in Figure 1. We observe an extremum of efficiency-privacy function I. Moreover, the value

Subsampling actually amplifies privacy?
Does Poisson subsampling actually amplify the privacy of HDP-LDA? The answer could be yes or no. If we concentrate on one single iteration and fix the total iteration number IT ER, the privacy budget actually shrinks, and we can conclude that Poisson subsampling amplifies privacy. If we concentrate on the whole training process of SUB-LDA, we cannot reach the exact same conclusion. Poisson subsampling actually amplifies privacy of each iteration of HDP-LDA. Nevertheless, the efficiency improvement is at the cost of more iterations to reach convergence (as the latent topics are updated for a subset of words in a document). Thus, the accumulated privacy loss ε = Iter · ε subsample (α) of SUB-LDA may increase. We show results in our experiments.

Experiment results
This section reports on our evaluation of SUB-LDA. We implement our method on three realword datasets: 20 Newsgroups dataset (Lang,  Table 2.

Efficiency improvement of SUB-LDA
In our experiments, we set the order α of RDP as 14 and the original privacy budget ε(α) = 2. We vary the subsampling ratio γ from 0.1 to 0.9 with the step being 0.2. Obviously, when γ = 1, SUB-LDA is simply HDP-LDA. The topic amount varies from 20 to 100 with the step being 20. We omit the convergence condition and use IT ER = 100 to stop the iterations. We record the average running 1 https://archive.ics.uci.edu/ml/datasets/bag+of+words 2 https://archive.ics.uci.edu/ml/datasets/bag+of+words time t sub of each SUB-LDA iteration and the average running time t hdp of each HDP-LDA iteration. The speed-up ratio is calculated as follows: The results are shown in Figure 2. We conclude that SUB-LDA would have better efficiency if we choose a smaller subsampling ratio. The amount of topic K indicates the complexity of the LDA model. Larger K often results in better efficiency improvement, which indicates that SUB-LDA could be suitable for a complex LDA model.

Effectiveness difference between SUB-LDA and HDP-LDA
We choose perplexity to evaluate the model's utility. We focus on the impacts of Poisson subsampling of SUB-LDA on perplexity. After IT RE = 100 iterations, we record the perplexity per sub of SUB-LDA and the perplexity per hdp of HDP-LDA. We utilize the perplexity ratio in equation (8) to show the difference of effectiveness between SUB-LDA and HDP-LDA. Lower perplexity always indicates better generalization ability of the LDA model. Subsampling has non-negligible impacts on the model's perplexity. Figure 3 shows that perplexity would have a smaller difference when the value of the subsampling ratio is larger. Furthermore, Perplexity ratio = |per sub − per hdp | per hdp . (8)

Convergence difference between SUB-LDA and HDP-LDA
In this subsection, we fix the topic amount as 100 and track the changes of perplexity during iterations. The results are shown in Figure 4. A smaller subsampling ratio results in higher perplexity. Moreover, the convergence of SUB-LDA is influenced by subsampling. Often SUB-LDA with a larger subsampling ratio has a faster convergence rate.

Privacy guarantee difference between SUB-LDA and HDP-LDA under convergence
In this subsection, the topic amount is fixed as 100. We utilize the convergence condition to stop the burn-in process. Given i-th iteration, per i denotes the perplexity of this iteration. The difference value of perplexity of the i-iteration is defined as D i = |per i − per i−1 |. Given a thresholdD, we consider that SUB-LDA reaches convergence if the following condition is satisfied for some value T and s: We use theorem 6 in (Zhu and Wang, 2019) to approximate ε subsample (α) in each iteration. We then calculate the accumulative privacy budget of SUB-LDA, and the results of NIPS are shown in Table 3. Unsurprisingly, ε subsample (α) shrinks when the subsampling ratio decreases, but we obtain a larger value of Iter. From Table 3, we observe that the accumulative privacy losses of SUB-LDA with a subsampling ratio of 0.9 and 0.7 are greater than those with subsampling ratio of 1. The results of Table 3 is to provide insights on the synthetical impacts of iterations and subsampling ratio towards privacy guarantee. Thus, in practice, we can find a suitable subsampling ratio not the smallest ratio to provide a rigorous privacy guarantee.

Relationship between efficiency-privacy function and perplexity
Obviously, it is difficult to analyze properties of perplexity. Nevertheless, we discover that efficiencyprivacy function I tends to have similarities to perplexity. We show the values of perplexity with respect to each dataset in Figure 5 after SUB-LDA terminates. In Figure 1, we discover that the gradient of I first increases and then decreases. The  Figure 5 also tends to increase first and then decrease(these curves are concave first and then convex). This is reasonable in practice, since the improvements of efficiency and privacy are usually at the cost of variation of the model's utility. This indicates that analysis of perplexity could be substituted for analysis of the efficiency-privacy function.

An effective method to improve model's utility in practice
In our subsampling experiments, we discover that the model's perplexity produced by SUB-LDA usually tends to have few changes when the subsampling ratio is small (e.g., γ = 0.1). To improve utility in this case, we propose an AS method. The AS method is based on the fact that a small subset of frequent words has much higher probabilities than the other words given a certain topic k. Thus, we can increase the subsampling ratio of frequent words of topic k while decreasing the subsampling ratio of infrequent words. Give corpus dataset G, we denote G = {N 1 , . . . , N k , . . . N K } K k=1 as the partition of G in terms of topics in the i-th iteration. For each N k = V t=1 n t k , the AS method first constructs a frequent word subset, namely, t ∈ {1, 2, ..., V } : t n t k ≥ qN k , where q is fixed beforehand. For n-th word w m,n = t with topic z m,n = k in document d m , AS sets the subsampling ratio as follows: γ i t wm,n=t,zm,n=k = vγ 1 γ > v > 1 , t ∈ t : t n t k ≥ qN k 0, t / ∈ t : t n t k ≥ qN k (10) Denote G sub and Ḡ sub as the size of the word subset produced by the conventional subsampling method (we call this a uniform subsample) and the AS method. Then, we have Var G sub Var Ḡsub = 1 − γ qv(1 − vγ) .
The AS method takes η as input. We apply the AS method to each dataset under topic k = 20 and γ = 0.1. The results are shown in Figure  6. We observe that prominent improvements of utility are achieved for NIPS and ENRON. For 20Newsgroups, we achieve similar utility, since the scale of 20Newsgroups is small compared to NIPS and ENRON.
Privacy guarantees for the uniform subsample and the adaptive subsample are provided in Lemma 5.1.

Conclusion and Future Works
In this study, we combine Poisson subsampling with HDP-LDA to improve efficiency and amplify privacy in LDA model training. We find that subsampling naturally improves efficiency. Moreover, we propose a metric to evaluate the efficiencyprivacy improvement via efficiency-privacy function I. We discuss the properties of I. We then conduct comprehensive experiments to evaluate the efficiency improvements and privacy amplification effects. In future works, we plan to combine SUB-LDA with distributed CGS algorithms that satisfy local DP to boost the efficiency and privacy guarantee.