RollingLDA: An Update Algorithm of Latent Dirichlet Allocation to Construct Consistent Time Series from Textual Data

Jonas Rieger; Carsten Jentsch; Jörg Rahnenführer

doi:10.18653/v1/2021.findings-emnlp.201

RollingLDA: An Update Algorithm of Latent Dirichlet Allocation to Construct Consistent Time Series from Textual Data

Jonas Rieger, Carsten Jentsch, Jörg Rahnenführer

Abstract

We propose a rolling version of the Latent Dirichlet Allocation, called RollingLDA. By a sequential approach, it enables the construction of LDA-based time series of topics that are consistent with previous states of LDA models. After an initial modeling, updates can be computed efficiently, allowing for real-time monitoring and detection of events or structural breaks. For this purpose, we propose suitable similarity measures for topics and provide simulation evidence of superiority over other commonly used approaches. The adequacy of the resulting method is illustrated by an application to an example corpus. In particular, we compute the similarity of sequentially obtained topic and word distributions over consecutive time periods. For a representative example corpus consisting of The New York Times articles from 1980 to 2020, we analyze the effect of several tuning parameter choices and we run the RollingLDA method on the full dataset of approximately 4 million articles to demonstrate its feasibility.

Anthology ID:: 2021.findings-emnlp.201
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2021
Month:: November
Year:: 2021
Address:: Punta Cana, Dominican Republic
Editors:: Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:: Findings
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2337–2347
Language:
URL:: https://aclanthology.org/2021.findings-emnlp.201
DOI:: 10.18653/v1/2021.findings-emnlp.201
Bibkey:
Cite (ACL):: Jonas Rieger, Carsten Jentsch, and Jörg Rahnenführer. 2021. RollingLDA: An Update Algorithm of Latent Dirichlet Allocation to Construct Consistent Time Series from Textual Data. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2337–2347, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: RollingLDA: An Update Algorithm of Latent Dirichlet Allocation to Construct Consistent Time Series from Textual Data (Rieger et al., Findings 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.findings-emnlp.201.pdf
Video:: https://aclanthology.org/2021.findings-emnlp.201.mp4
Code: jonasrieger/rollinglda + additional community code

PDF Cite Search Code Video