Online topic model for Twitter considering dynamics of user interests and topic trends

Latent Dirichlet allocation (LDA) is a topic model that has been applied to various ﬁelds, including user proﬁling and event summarization on Twitter. When LDA is applied to tweet collections, it generally treats all aggregated tweets of a user as a single document. Twitter-LDA, which assumes a single tweet consists of a single topic, has been proposed and has shown that it is superior in topic semantic coherence. However, Twitter-LDA is not capable of online inference. In this study, we extend Twitter-LDA in the following two ways. First, we model the generation process of tweets more accurately by estimating the ratio between topic words and general words for each user. Second, we enable it to estimate the dynamics of user interests and topic trends online based on the topic tracking model (TTM), which models consumer purchase behaviors.


Introduction
Microblogs such as Twitter, have prevailed rapidly in our society recently. Twitter users post a message using 140 characters, which is called a tweet. The characters limit allows users to post tweets easily about not only personal interest or real life but also public events such as traffic accidents or earthquakes. There have been many studies on how to extract and utilize such information on tweets (Diao et al., 2012;Pennacchiotti and Popescu, 2011;Sakaki et al., 2010;Weng et al., 2010).
Topic models, such as latent Dirichlet allocation (LDA) (Blei et al., 2003) are widely used to identify latent topic structure in large collections of documents. Recently, some studies have applied LDA to Twitter for user classification (Pen-nacchiotti and Popescu, 2011), detection of influential users (Weng et al., 2010), and so on. LDA is a generative document model, which assumes that each document is represented as a probability distribution over some topics, and that each word has a latent topic. When we apply LDA to tweets, each tweet is treated as a single document. This direct application does not work well because a tweet is very short compared with traditional media such as newspapers. To deal with the shortness of a tweet, some studies aggregated all the tweets of a user as a single document (Hong and Davison, 2010;Pennacchiotti and Popescu, 2011;Weng et al., 2010). On the other hand, Zhao et al. (2011) proposed "Twitter-LDA," which is a model that considers the shortness of a tweet. Twitter-LDA assumes that a single tweet consists of a single topic, and that tweets consist of topic and background words. Zhao et al. (2011) show that it works well at the point of semantic coherence of topics compared with LDA. However, as with the case of LDA, Twitter-LDA cannot consider a sequence of tweets because it assumes that samples are exchangeable. In Twitter, user interests and topic trends are dynamically changing. In addition, when new data comes along, a new model must be generated again with all the data in Twitter-LDA because it does not assume online inference. Therefore, it cannot efficiently analyze the large number of tweets generated everyday. To overcome these difficulties, a model that considers the time sequence and has the capability of online inference is required.
In this study, we first propose an improved model based on Twitter-LDA, which assumes that the ratio between topic and background words differs for each user. This study evaluates the proposed method based on perplexity and shows the efficacy of the new assumption in the improved model. Second, we propose a new topic model called "Twitter-TTM" by extending the improved model based on the topic tracking model (TTM) (Iwata et al., 2009), which models the purchase behavior of consumers and is capable of online inference. Finally, we demonstrate that Twitter-TTM can effectively capture the dynamics of user interests and topic trends in Twitter.
2 Improvement of Twitter-LDA 2.1 Improved-Model Figure 1(a) shows the graphical representation of Twitter-LDA based on the following assumptions. There are K topics in Twitter and each topic is represented by a topic word distribution. Each user has his/her topic interests ϕ u represented by a distribution over K topics. Topic k is assigned to each tweet of user u depending on the topic interests ϕ u . Each word in the tweet assigned by topic k is generated from a background word distribution θ B or a topic word distribution θ k . Whether the word is a background word or a topic word is determined by a latent value y. When y = 0, the word is generated from the background word distribution θ B , and from the topic word distribution θ k when y = 1. The latent value y is chosen according to a distribution π. In other words, the ratio between background and topic words is determined by π.
In Twitter-LDA, π is common for all users, meaning that the rate between background and topic words is the same for each user. However, this assumption could be incorrect, and the rate could differ for each user. Thus, we develop an improved model based on Twitter-LDA, which assumes that π is different for each user, as shown in Figure 1 (b). In the improved model, the rate between background and topic words for user u is determined by a user-specific distribution π u . The improved model is expected to infer the generative process of tweets more efficiently.

Experiment for Improved Model
We performed an experiment to compare the predictive performances of LDA, TTM, and the improved model shown in Section 2.1. In this experiment, LDA was applied as the method to aggregate all tweets of a user as a single document. The original Twitter data set contains 14,305 users and 292,105 tweets collected on October 18, 2013. We then removed words that occurred less than 20 times and stop words. Retweets 1 were treated 1 Republishing a tweet written by another Twitter user. as the same as other general tweets because they reflected the user's interests. After the above preprocessing, we obtained the final dataset with 14,139 users, 252,842 tweets, and 7,763 vocabularies. Each model was inferred with collapsed Gibbs sampling (Griffiths and Steyvers, 2004) and the iteration was set at 500. For a fair comparison, the hyper parameters in these models were optimized in each Gibbs sampling iteration by maximizing likelihood using fixed iterations (Minka, 2000).
This study employs perplexity as the evaluation index, which is the standard metric in information retrieval literature. The perplexity of a held-out test set is defined as where w u represents words are contained in the tweets of user u and N is the number of words in the test set. A lower perplexity means higher predictive performance. We set the number of topics K at 50, 100, 150, 200, and 250 and evaluated the perplexity for each model in each K via a 10-fold cross-validation.
The results are shown in Table 1, which shows that the improved model performs better than the other models for any K. Therefore, the new assumption of the improved model, that the rate between background and topic words is different for each user, could be more appropriate. LDA performance worsens with an increase in K because the aggregated tweets of a single user neglect the topic of each tweet. Table 2 shows examples of the tweets of users with high and low rates of background words. The users with a high background words rate tend to use basic words that are often used in any topics, such as "like," "about," and "people," and they tend to tweet about their personal lives. On the other hand, for users with a low background words rate, topical words are often used such as "Arsenal," "Justin," and "Google". They tend to tweet about their interests, including music, sports, and movies.

Model Extension based on Topic Tracking Model
We extend the improved model shown in Section 2.1 considering the time sequence and capabil-   Table 2: Example of tweets of users with high and low rate of background words High rate of background words Low rate of background words I hope today goes quickly Team Arsenal v will Ozil be I want to work in a cake Making Justin smile and laugh as he is working on music All need your support please Google nexus briefly appears in Google play store ity of online inference based on TTM (Iwata et al., 2009). TTM is a probabilistic consumer purchase behavior model based on LDA for tracking the interests of each user and the trends in each topic. Other topic models considering the dynamics of topics include the dynamic topic model (DTM) (Blei and Lafferty, 2006) and topic over time (ToT) (Wang and McCallum, 2006). DTM is a model for analyzing the time evolution of topics in time-ordered document collections. It does not track the interests of each user as shown in Figure 2(a) because it assumes that a user (document) has only one time stamp. ToT requires all the data over time for inference, thus, it is not ap-propriate for application to continuously generated data such as Twitter. We consider a model must be capable of online inference and track the dynamics of user interests and topic trends for modeling tweets. Since TTM has these abilities, we adapt it to the improved model described in Section 2.
Figure 2(b) shows the graphical representation of TTM. TTM assumes that the mean of user interests at the current time is the same as that at the previous time, unless new data is observed. Formally, the current interest ϕ t,u are drawn from the following Dirichlet distribution in which the mean is the previous interestφ t−1,u and the precision is where ϕ t,u,k represents the probability that user u is interested in topic k at time t. t is a discrete variable and can be arbitrarily set as the unit time interval, e.g., at one day or one week. The precision α t,u represents the interest persistence of how consistently user u maintains his/her interests at time t compared with the previous time t − 1. α t,u is estimated for each time period and each user because interest persistence depends on both time and users. As mentioned above, the current topic trend θ t,k is drawn from the following Dirichlet distribution with the previous trendθ t−1,k where θ t,k,v represents the probability that word v is chosen in topic k at time t.
Here our proposed Twitter-TTM adapts the above TTM assumptions to the improved model. That is, we extend the improved model whereby user interest ϕ t,u and topic trend θ t,k depend on previous states. Time dependency is not considered on θ B and π u because they can be regarded as being independent of time. Figures 3 and 4 show the generative process and a graphical representation of Twitter-TTM, respectively. Twitter-TTM can capture the dynamics of user interests and topic trends in Twitter considering the features of tweets online. Moreover, Twitter-TTM can be extended to capture long-term dependences, as described in Iwata et al. (2009).

Model Inference
We use a stochastic expectation-maximization algorithm for Twitter-TTM inference, as described in Wallach (2006) in which Gibbs sampling of latent values and maximum joint likelihood estimation of parameters are alternately iterated. At time t, we estimate user interests Φ t = {φ t,u } U u=1 , topic trends Θ t = {θ t,k } K k=1 , background word distribution θ t,B , word usage rate distribution π t,u , interest persistence parameters α t = {α t,u } U u=1 , and trend persistence parameters β t = {β t,k } K k=1 using the previous time interestsΦ t−1 and trendŝ Θ t−1 .
We employ collapsed Gibbs sampling to infer the latent variables. Let D t be a set of tweets and Z t , Y t be a set of latent variables z, y at time t. We can integrate the parameters in the joint distribu-tion as follows: where n t,u,B and n t,u,K are the number of background and topic words of user u at time t, n t,B,v is the number of times that word v is assigned as a background word at time t, n t,k,v is the number of times that word v is assigned to topic k at time t, c t,u,k is the number of tweets assigned to topic k for user u at time t. In addition, n t,u = n t,u,B + n t,uK , n t, Given the assignment of all other latent variables, we derive the following formula calculated from eq.(4) to infer a latent topic, c t,u,k\i + α t,uφt−1,u,k c t,u\i + α t,u Γ(n t,k\i + β t,k ) Γ(n t,k + β t,k ) where i = (t, u, s), thus z i represents a topic assigned to the s-th tweet of user u at time t, and \i represents a count excluding the i-th tweet. Then, when z i = k is given, we derive the following formula to infer a latent variable y j , n t,B,v\j + λ n t,B\j + V λ n t,u,B\j + γ n t,u\j + 2γ , ∝ n t,k,v\j + β t,kθt−1,k,v n t,k\j + β t,k n t,u,K\j + γ n t,u\j + 2γ , where j = (t, u, s, n), thus y j represents a latent variable assigned to the n-th word in the s-th tweet of user u at time t, and \j represents a count excluding the j-th word. The persistence parameters α t and β t are estimated by maximizing the joint likelihood eq.(4), using a fixed point iteration (Minka, 2000). The update formulas are as follows: where A t,u,k = Ψ(c t,u,k + α t,uφt−1,u,k ) − Ψ(α t,uφt−1,u,k ), and where B t,k,v = Ψ(n t,k,v + β t,kθt−1,k,v ) − Ψ(β t,kθt−1,k,v ). We can estimate latent variables Z t , Y t , and parameters α t and β t by iterating Gibbs sampling with eq.(5), eq.(6), and eq.(7) and maximum joint likelihood with eq.(8) and eq.(9). After the iterations, the means of ϕ t,u,k and θ t,k,v are obtained as follows.
ϕ t,u,k = c t,u,k + α t,uφt−1,u,k c t,u + α t,u , (10) These estimates are used as the hyper parameters of the prior distributions at the next time period t + 1.

Related Work
Recently, topic models for Twitter have been proposed. Diao et al. (2012) proposed a topic model that considers both the temporal information of tweets and user's personal interests. They applied their model to find bursty topics from Twitter. Yan et al. (2013) proposed a biterm topic model (BTM), which assumes that a wordpair is independently drawn from a specific topic. They demonstrated that BTM can effectively capture the topics within short texts such as tweets compared with LDA. Chua and Asur (2013) proposed two topic models considering time order and tweet intervals to extract the tweets summarizing a given event. The models mentioned above do not consider the dynamics of user interests, nor Figure 3: Generative process of tweets in Twitter-TTM do they have the capability of online inference; thus, they cannot efficiently model the large number of tweets generated everyday, whereas Twitter-TTM can capture the dynamics of user interests and topic trends and has the capability of online inference.
Some online topic models have also been proposed. TM-LDA was proposed by Wang et al. (2012), which can efficiently model online the topics and topic transitions that naturally arise in a tweet stream. Their model learns the transition parameters among topics by minimizing the prediction error on topic distribution in subsequent tweets. However, the TM-LDA does not consider dynamic word distributions. In other words, their model can not capture the dynamics of topic trends. Lau et al. (2012) proposed a topic model implementing a dynamic vocabulary based on online LDA (OLDA) (AlSumait et al., 2008) and applied it to track emerging events on Twitter. An online variational Bayes algorithm for LDA is also proposed (Hoffman et al., 2010). However, these methods are based on LDA and do not consider the shortness of a tweet. Twitter-TTM tackles the shortness of a tweet by assuming that a single tweet consists of a single topic. This assumption is based on the following observation: a tweet is much shorter than a normal document, so a single tweet rarely contains multiple topics but rather a single one.

Setting
We evaluated the effectiveness of the proposed Twitter-TTM using an actual Twitter data set. The original Twitter data set contains 15,962 users and 4,146,672 tweets collected from October 18 to 31, 2013. We then removed words that occurred less than 30 times and stop words. After this preprocessing, we obtained the final data set with 15,944 users, 3,679,481 tweets, and 30,096 vocabularies. We compared the predictive performance of Twitter-TTM with LDA, TTM, Twitter-LDA, Twitter-LDA+TTM, and the improved model based on the perplexity for the next time tweets. Twitter-LDA+TTM is a combination of Twitter-LDA and TTM. It is equivalent to Twitter-TTM, except that the rate between background and topic words is different for each user. We set the number of topics K at 100, the iteration of each model at 500, and the unit time interval at one day. The hyper parameters in these models were optimized in each Gibbs sampling iteration by maximizing likelihood using fixed iterations (Minka, 2000). The inferences of LDA, Twitter-LDA, and the improved model were made for current time tweets. Figure 5 shows the perplexity of each model for each time, where t = 1 in the horizontal axis represents October 18, t = 2 represents October 19, ..., and t = 13 represents October 31. The perplexity at time t represents the predictive performance of each model inferred by previous time tweets to the current time tweets. Note that at t = 1, the performance of LDA and TTM, that of Twitter-LDA and Twitter-LDA+TTM, and that of Twitter-TTM and the improved model were found to be equivalent.

Result
As shown in Figure 5(a), the proposed Twitter-TTM shows lower perplexity compared with conventional models, such as LDA, Twitter-LDA, and TTM at any time, which implies that Twitter-TTM can appropriately model the dynamics of user interests and topic trends in Twitter. TTM could not have perplexity lower than LDA although it considers the dynamics. If LDA could not appropriately model the tweets, then the user inter-estsΦ t−1 and topic trendsΘ t−1 in the previous time are not estimated well in TTM. Figure 5( b) shows the perplexities of the improved model and Twitter-TTM. From t = 2, Twitter-TTM shows lower perplexity than the improved model for each time. The reason for the high perplexity of the improved model is that it does not consider the dynamics. Twitter-TTM also shows lower perplexity than Twitter-LDA+TTM for each time, as shown in Figure 5(c), because Twitter-TTM's assumption that the rate between background and topic words is different for each user is more appropriate, as demonstrated in Section 2.2. These results imply that Twitter-TTM also outperforms other conventional methods, such as DTM, OLDA, and TM-LDA, which do not consider the shortness of a tweet or the dynamics of user interests or topic trends . Table 3 shows two topic examples of the topic evolution analyzed by Twitter-TTM, and Figure 6 shows the trend persistence parameters β of each topic at each time. The persistence parameters of the topic "Football" are lower than those of "Birthday" because it is strongly affected by trends in the real world. In fact, the top words in "Football" change more dynamically than those of "Birthday." For example, in the "Football" topic, though 'Arsenal' is usually popular, 'Madrid' becomes more popular on October 24.

Conclusion
We first proposed an improved model based on Twitter-LDA, which estimates the rate between background and topic words for each user. We demonstrated that the improved model could model tweets more efficiently than LDA and Twitter-LDA. Next we proposed a novel proba- bilistic topic model for Twitter, called Twitter-TTM, which can capture the dynamics of user interests and topic trends and is capable of online inference. We evaluated Twitter-TTM using an actual Twitter data set and demonstrated that it could model more accurately tweets than conventional methods.
The proposed method currently needs to predetermine the number of topics each time, and it is fixed. In future work, we plan to extend the proposed method to capture the birth and death of topics along the timeline with a variable number of topics, such as the model proposed by Ahmed (Ahmed and Xing, 2010). We also plan to apply the proposed method to content recommendations and trend analysis in Twitter to investigate this method further.  Table 3: Two examples of topic evolution analyzed by Twitter-TTM Label Date Top words Birthday 10/18 birthday,happy,maria,hope,good,love,thanks,bday,lovely,enjoy