Semantic Component Analysis: Introducing Multi-Topic Distributions to Clustering-Based Topic Modeling

Florian Eichin, Carolin M. Schuster, Georg Groh, Michael A. Hedderich


Abstract
Topic modeling is a key method in text analysis, but existing approaches fail to efficiently scale to large datasets or are limited by assuming one topic per document. Overcoming these limitations, we introduce Semantic Component Analysis (SCA), a topic modeling technique that discovers multiple topics per sample by introducing a decomposition step to the clustering-based topic modeling framework. We evaluate SCA on Twitter datasets in English, Hausa and Chinese. There, it achieves competetive coherence and diversity compared to BERTopic, while uncovering at least double the topics and maintaining a noise rate close to zero. We also find that SCA outperforms the LLM-based TopicGPT in scenarios with similar compute budgets. SCA thus provides an effective and efficient approach for topic modeling of large datasets.
Anthology ID:
2025.findings-emnlp.964
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17748–17771
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.964/
DOI:
Bibkey:
Cite (ACL):
Florian Eichin, Carolin M. Schuster, Georg Groh, and Michael A. Hedderich. 2025. Semantic Component Analysis: Introducing Multi-Topic Distributions to Clustering-Based Topic Modeling. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 17748–17771, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Semantic Component Analysis: Introducing Multi-Topic Distributions to Clustering-Based Topic Modeling (Eichin et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.964.pdf
Checklist:
 2025.findings-emnlp.964.checklist.pdf