Mining the Past: A Comparative Study of Classical and Neural Topic Models on Historical Newspaper Archives

Keerthana Murugaraj; Salima Lamsiyah; Marten During; Martin Theobald

doi:10.18653/v1/2025.nlp4dh-1.39

Mining the Past: A Comparative Study of Classical and Neural Topic Models on Historical Newspaper Archives

Keerthana Murugaraj, Salima Lamsiyah, Marten During, Martin Theobald

Abstract

Analyzing historical discourse in large-scale newspaper archives requires scalable and interpretable methods to uncover hidden themes. This study systematically evaluates topic modeling approaches for newspaper articles from 1955 to 2018, comparing probabilistic LDA, matrix factorization NMF, and neural-based models such as Top2Vec and BERTopic across various preprocessing strategies. We benchmark these methods on topic coherence, diversity, scalability, and interpretability. While LDA is commonly used in historical text analysis, our findings demonstrate that BERTopic, leveraging contextual embeddings, consistently outperforms classical models in all tested aspects, making it a more robust choice for large-scale textual corpora. Additionally, we highlight the trade-offs between preprocessing strategies and model performance, emphasizing the importance of tailored pipeline design. These insights advance the field of historical NLP, offering concrete guidance for historians and computational social scientists in selecting the most effective topic-modeling approach for analyzing digitized archives. Our code will be publicly available on GitHub.

Anthology ID:: 2025.nlp4dh-1.39
Volume:: Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Month:: May
Year:: 2025
Address:: Albuquerque, USA
Editors:: Mika Hämäläinen, Emily Öhman, Yuri Bizzoni, So Miyagawa, Khalid Alnajjar
Venues:: NLP4DH | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 452–463
Language:
URL:: https://aclanthology.org/2025.nlp4dh-1.39/
DOI:: 10.18653/v1/2025.nlp4dh-1.39
Bibkey:
Cite (ACL):: Keerthana Murugaraj, Salima Lamsiyah, Marten During, and Martin Theobald. 2025. Mining the Past: A Comparative Study of Classical and Neural Topic Models on Historical Newspaper Archives. In Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities, pages 452–463, Albuquerque, USA. Association for Computational Linguistics.
Cite (Informal):: Mining the Past: A Comparative Study of Classical and Neural Topic Models on Historical Newspaper Archives (Murugaraj et al., NLP4DH 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.nlp4dh-1.39.pdf

PDF Cite Search Fix data