Topic Modeling by Clustering Language Model Embeddings: Human Validation on an Industry Dataset

Anton Eklund, Mona Forsman


Abstract
Topic models are powerful tools to get an overview of large collections of text data, a situation that is prevalent in industry applications. A rising trend within topic modeling is to directly cluster dimension-reduced embeddings created with pretrained language models. It is difficult to evaluate these models because there is no ground truth and automatic measurements may not mimic human judgment. To address this problem, we created a tool called STELLAR for interactive topic browsing which we used for human evaluation of topics created from a real-world dataset used in industry. Embeddings created with BERT were used together with UMAP and HDBSCAN to model the topics. The human evaluation found that our topic model creates coherent topics. The following discussion revolves around the requirements of industry and what research is needed for production-ready systems.
Anthology ID:
2022.emnlp-industry.65
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
December
Year:
2022
Address:
Abu Dhabi, UAE
Editors:
Yunyao Li, Angeliki Lazaridou
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
635–643
Language:
URL:
https://aclanthology.org/2022.emnlp-industry.65
DOI:
10.18653/v1/2022.emnlp-industry.65
Bibkey:
Cite (ACL):
Anton Eklund and Mona Forsman. 2022. Topic Modeling by Clustering Language Model Embeddings: Human Validation on an Industry Dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 635–643, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Topic Modeling by Clustering Language Model Embeddings: Human Validation on an Industry Dataset (Eklund & Forsman, EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-industry.65.pdf